Locked lesson.
About this lesson
Exercise files
Download this lesson’s related exercise files.
Non-normal Data.xlsx14.5 KB Non-normal Data - Solution.docx
86.9 KB
Quick reference
Non-normal Data
Many processes have non-normal variation which generates non-normal data. There are several reasons that will cause this condition. However, the Central Limit Theorem is presented as a tool to normalize non-normal data.
When to use
A process either generates non-normal data or it does not. When non-normal data exists, the underlying cause should be determined. In many cases, the non-normal data can be transformed into normal data and then controlled using SPC.
Instructions
Non-normal data can exist for many reasons. In order to use SPC with a process, that non-normal data must be transformed into normal data. Some of the reasons for non-normal data represent a process out of control and some of those could occur with a process that is in control. Let’s first consider the reasons and then what can be done with non-normal data.
- Too many extreme points. This indicates a process that is out of control The extreme points prevent the ability to predict process performance. In this case, identify and remove special causes that created the extreme point from the process. You cannot use SPC until this is done.
- Overlap of maultiple processes in the data. This will often generate a distribution that is lumpy. A lump occurs at the center value for each of the processes in the data. The best approach is to stratify and separate the processes. However, you can also use the Central Limit Theorem to create a normal distribution.
- Sorted Data. IN this case, the process or system automatically sorts the data into a specific order or the data at the extremes is automatically reworded so that it is closer to the central value. Move upstream in the data collection and use original data points instead of the “reworked” data. If you cannot do that, the Central Limit Theorem can normalize this data.
- Natural Limit. In this case, one of the tails on the bell-shaped curve is truncated. This is due to an equipment or natural limit in the process. You can transform data skewed by physical limitation either through the Central Limit Theorem or other transformation process.
- Insufficient data discrimination. In this case, the data is only able to take on a few values such as on/off or true/false. The raw data can never form a normal bell-shaped curve. You may be able to improve the measurement system. Otherwise, use the Central Limit Theorem to transform the data.
The primary technique for This states, “The average of many values tends to have a normal distribution.” So instead of plotting the raw data, a small sample or subset of data is collected and a total value for the subset is determined. These subset data values will likely be normal for all causes extreme data points. The obvious question is, “How many data points in a subset?” This depends upon the nature of the non-normality. The rule of thumb is that if the data is symmetrical, use at least 5 points. If the data is not symmetrical – such as skewed or a physical limit – use 30 data points. Regardless, the more data points, the more likely the transformed data will be normal.
Data can also be transformed through several transformation algorithms. The most popular transformation in the SPC world is the Box-Cox transformation. Most statistical software applications, such as Minitab, can do this transformation with just a few mouse clicks.
Once the data is transformed either through the Central Limit Theorem or an algorithm, the new normal data can now be evaluated using SPC tools.
Hints & tips
- If the non-normality is due to extreme points, you must first get the process under control by eliminating those causes. Until then, SPC will not add much value.
- If there are multiple processes present, it is best to separate those and put each one under statistical control. Otherwise, it is difficult to know what to fix when SPC indicates a problem.
- When creating the subsets or samples to support a Central Limit Theorem transformation, try to use logical subsets such as all the points in a shift. The key is that they are from the same time period.
- 00:04 Hi, I’m Ray Sheen.
- 00:06 Now we’ve talked about normal random variation and the normal curve, but
- 00:11 sometimes the data doesn’t fit the normal curve.
- 00:14 Now, when that happens, we don’t call it abnormal, we call that non-normal.
- 00:20 But you're probably saying who cares what it’s called, what do we do with it?
- 00:25 The first thing to do is to recognize when it exists,
- 00:28 because we must do something different with it than we do with normal data.
- 00:32 SPC is based upon having normal data.
- 00:35 If it is not normal, it must be treated with special care.
- 00:38 The good news is that we can often transform non-normal data
- 00:42 into normal data.
- 00:43 Then you can use SPC tools.
- 00:46 Often, the easiest way to transform non-normal data to normal data
- 00:49 is to use a sampling approach.
- 00:52 I don't mean to take just one data point out of a bunch.
- 00:55 This sampling method is to take a group of data points, the sample, and
- 00:59 add them all together.
- 01:00 Then take the next group.
- 01:02 Creating aggregate sample data will often give us normal data
- 01:06 even if the original data points are non-normal.
- 01:09 Another approach is to use a transformation algorithm such as
- 01:12 the Box-Cox transformation.
- 01:14 That is usually too complex to be done by hand, but statistical software
- 01:18 applications like Minitab can do this for us with just a few mouse clicks.
- 01:24 Let's look at the reasons that data may not be normal.
- 01:27 The first reason is too many extreme points.
- 01:30 There are all kinds of weird and special things happening in the process.
- 01:34 In this case, what must be done is first to find and
- 01:37 remove the causes of special extreme points.
- 01:40 Right now, the process is not controllable until those factors have been addressed.
- 01:45 There is no need to transform this data, you're not yet ready for SPC.
- 01:50 Another reason could be that there are multiple unique processes represented in
- 01:54 the data itself.
- 01:55 This type of data will typically be lumpy.
- 01:58 A lump of data representing each of the different processes.
- 02:01 Once again, you're not yet ready for SPC.
- 02:03 In this case, you need to separate out the different processes and
- 02:07 eliminate the unwanted ones.
- 02:08 Then finally, managing the others with their own process management.
- 02:12 The third reason is when the data has already been pre-sorted.
- 02:16 It's usually by either the highest to lowest or lowest to highest.
- 02:20 Either way, the sorting has changed the order of the data points.
- 02:24 And therefore, the shape of the curve is different.
- 02:27 Often, this data is normal in its original form or
- 02:30 sequence, so just go back to the original data set.
- 02:34 The next reason is a common occurrence in physical processes.
- 02:37 There's a natural limit that the process cannot exceed.
- 02:41 This skews the data at one end or the other of the curve.
- 02:44 The temperature of a fluid can't go below its freezing point and still be a fluid.
- 02:49 The distance you drive a car going east or
- 02:51 west is limited by the ocean that that you finally reach.
- 02:55 The time it takes to complete an activity can't be less than zero.
- 02:59 When your data has a natural limit, you will want to transform it.
- 03:03 The transformed data will likely be in the shape of a bell shaped curve.
- 03:07 The final reason is when the measurement system does not have
- 03:10 adequate discrimination to allow the curve to develop.
- 03:13 A dataset that attracts whether a switch was on or
- 03:16 off could only have two values, on and off.
- 03:21 Plotting the data would never look like a bell shaped curved since there
- 03:24 are only two possible values.
- 03:26 However, this is an ideal data set to transform with sampling.
- 03:31 The sampling transformation is based upon the Central Limit Theorem.
- 03:35 This theory states that the average of many values tends to have a normal
- 03:39 distribution.
- 03:40 So instead of taking the individual data points of our on off switch,
- 03:44 we aggregate many instances.
- 03:46 So we could aggregate ten instances of checking the switch and
- 03:49 count how many of those times it was on.
- 03:52 This could be as few as none, or zero, and as many as ten.
- 03:55 We plot the value if we're taking many of these samples,
- 03:58 the curve is likely to be a normal curve.
- 04:01 One of the characteristics that we see with this approach in normalizing data is
- 04:05 that the larger the sample the more the curve looks like a normal curve.
- 04:09 Of course, the larger the sample the more data points you must collect and
- 04:13 analyze to create a data distribution.
- 04:15 In fact, we can see this in the example in the right.
- 04:18 We're plotting the results of flipping a coin.
- 04:21 Heads is a pass or 1, and tails is a fail or zero.
- 04:24 The first graph is a plot of every coin flip.
- 04:27 This clearly is not normal.
- 04:30 In the second graph we are aggregating four coin flips.
- 04:33 The values range from zero to 4.
- 04:35 But as you can see, 2 is the most common value, and the plot looks like a pyramid.
- 04:41 In the third graph we are using a group with a sample size of seven.
- 04:45 By now, the graph of starting to look like the bell shaped curve of data with
- 04:49 a normal distribution.
- 04:51 And the final graph has a sample of 10, and
- 04:53 continues to look like a bell shaped curve.
- 04:56 By aggregating this values we have transformed non-normal
- 05:00 data into normal data.
- 05:01 So a question you're probably asking
- 05:04 is how to determine the number of data points to include in the sample.
- 05:08 If the data is already symmetric, it will transform with relatively few data points.
- 05:13 But if it is not symmetric, it will take a much larger quantity.
- 05:16 In fact, if the data is not normal,
- 05:18 it will often 30 data points to create a normal data distribution set.
- 05:23 This table is a good guide for sizing your sample.
- 05:26 If the data is normal,
- 05:27 you don't need to sample, the curve will already be a normal curve.
- 05:31 If the data is symmetric, meaning there are approximately the same number of data
- 05:35 points above the mean as there are below the mean, then use a sample size of 5.
- 05:40 And with non-normal data, use a sample size of 30.
- 05:45 No-normal data must be handled differently than normal data.
- 05:49 Either remove and control the special cause that's causing the non-normality,
- 05:54 or just simply sample the data and get a normal distribution.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.