Locked lesson.
About this lesson
Non-normal data often has a pattern. Knowing that pattern can help to transform the data to normal data and can aid in the selection of an appropriate hypothesis test.
Exercise files
Download this lesson’s related exercise files.
Non-Normal Data Exercise.docx62.6 KB Non-Normal Data Exercise Solution.docx
62.6 KB
Quick reference
Non-normal Distribution
Non-normal data can occur from stable physical systems. Hypothesis tests can be done using non-normal data.
When to use
Prior to actually conducting the hypothesis test, if the data set parameters are a continuous “Y” and a discrete “X,” the data should be checked to determine if it is normal or non-normal so as to be able to choose the correct test.
Instructions
Non-normal data is often created by stable physical systems. The non-normality is often due to constraints in the system or environment. There are hypothesis tests that are structured to accept non-normal data sets. However, different tests are best suited to different types of non-normality. The non-normal hypothesis test lessons describe which method to use for different types of non-normality. It is often desirable to graph or plot the data so as to determine the nature of the non-normality.
Normality is determined using basic descriptive statistics of the data sample. In particular, non-normal tests usually use either the median or the variance. Descriptive statistics can also provide measures of the level of non-normality. When doing a descriptive statistics test, several parameters are determined:
- Median – the midpoint of the data points. This is often used in Hypothesis tests with non-normal data.
- Variance – a measure of the spread or width of the distribution. This measure is calculated by squaring the standard deviation of the data set.
- Skewness – this is a measure of symmetry. A symmetrical distribution will have a skewness value of zero. The distribution is considered normal as long as the value is between -.8 and +.8. Beyond that, the distribution is non-normal.
- Kurtosis – this is a measure of the tails of the distribution to indicate if they are “heavy” or “light.” There are three types of Kurtosis. Leptokurtic is heavy tails. There are many points near the upper and lower bounds of the data. Mesokurtic is associated with the normal curve. Platycurtic is the condition when the tails are light, they rapidly drop to near zero on the upper and lower edges of the distribution. Kurtosis can be measured in several ways. The method used in Excel is “Sample Excess Kurtosis.” This measure has the advantage that a Normal curve score will be zero – just like with Skewness. In this case, values from -0.8 to +0.8 are still considered Normal. Minitab uses the true Kurtosis scale which places the midpoint at 3.0.
- Multi-modal – this occurs when there are multiple datasets combined into the same set being investigated. This data set being investigated will often have multiple “peaks” representing each of the constituent data sets.
Granularity – this occurs when the measurement system resolution is too coarse for the data. All the data is lumped together in just a few slices.
Hints & tips
- If the Data Analysis Menu does not show on your Data ribbon in Excel, you need to add the Analysis ToolPak Add-in. Go to the File menu, select Options, then select Add-in. Enable the Analysis ToolPak add-in. This is a free feature that is already in Excel, you just need to enable it. You may need to close and reopen Excel for the menu to appear.
- If you don’t have Minitab, consider downloading the free trial. Minitab normally has a 30-day free trial period. All the non-normal tests we discuss are available in Minitab.
- If your data is non-normal, place your data points in a column or row of Excel and select the graph function to determine the shape of your data. The selection of a non-normal hypothesis test is based on the nature of the non-normality.
- 00:04 Hi, I'm Ray Sheen.
- 00:06 We just looked at how to determine whether or not data is normal.
- 00:10 Let's take a minute and
- 00:11 talk about the implications of non-normal data when doing hypothesis testing.
- 00:16 >> So what do we mean by non-normal variation?
- 00:20 Sometimes the data is not normally distributed.
- 00:23 That doesn't mean that there is a special cause variation.
- 00:25 It may mean that there is some aspect of the physical system that prevents the data
- 00:30 from being characterized with a normal distribution.
- 00:33 The good news is that the hypothesis testing does not require the data to be
- 00:37 normally distributed.
- 00:39 Now, the statistical analysis with non-normal data is different, and
- 00:43 often the math is much more complex.
- 00:46 Back when the analysis was being done by hand,
- 00:48 we wanted to use normal data because it was easier to do the analysis.
- 00:53 But it turns out that with computers to help us,
- 00:55 we can do the non-normal analysis math without too much difficulty.
- 01:00 It seems that computers are actually pretty good at doing math, so
- 01:03 we'll let them.
- 01:05 Now a recommendation for using the non-normal tests.
- 01:08 Different tests are suited to different types of non-normality.
- 01:12 So I recommend that you graph your data first so
- 01:14 that you can see what type of non-normality you're dealing with.
- 01:18 This will make it easier to select the best test for your application.
- 01:23 One more thing about test selection.
- 01:25 Excel does not have the non-normal tests in the data analysis function.
- 01:29 So you will need to be using Minitab or another statistical software package for
- 01:34 those types of tests.
- 01:36 Obviously, you want to select the test that best suits your data.
- 01:40 So if you're limited to using Excel,
- 01:42 you'll need to transform your non-normal data to normal.
- 01:45 We'll talk about that later.
- 01:47 So let's talk about what we mean when we say data is not normal.
- 01:51 First, let's acknowledge that there are many things in the physical world, or
- 01:55 the process and
- 01:56 product design that prevent a data distribution from being normal.
- 02:00 Examples include extreme points that disrupt edges of a distribution.
- 02:05 Now, granted, the sudden incidence of many of these
- 02:08 points is an indication of a special cause occurring.
- 02:11 But an occasional one can occur, and
- 02:13 it may skew your data if you have a small distribution.
- 02:17 Also, there may be physical limits.
- 02:20 For instance, a parameter may be limited so that it cannot be less than 0.
- 02:24 Another thing that we're often trying to test with our hypothesis is whether or
- 02:29 not we have a combination of two or more processes within
- 02:32 the same dataset that will normally create a non-normal distribution.
- 02:37 Now, that's not an exhaustive list.
- 02:38 It's just an illustrative one.
- 02:40 First, there's skewness.
- 02:41 In this case, the data is not symmetrical.
- 02:44 The data is weighted, one towards one side or the other.
- 02:48 And typically this occurs when there is a physical limit,
- 02:51 either a natural one such as temperature hitting a boiling or
- 02:54 freezing point that changes how the system performs or an artificial one,
- 02:58 such as a machine limit that saturates a capacitor at a certain level.
- 03:03 Next is kurtosis.
- 03:05 This is the shape measure of the distribution, and
- 03:08 is focused on what happens at the edges or tails.
- 03:12 We have three types of Kurtosis, leptokurtosis, which looks like
- 03:16 heavy tails, many extreme points, and sometimes, has the look of a bathtub.
- 03:22 This is often due to many outliers.
- 03:25 Leptokurtosis in Minitab is a value that is >3.
- 03:29 Excel uses a measure known as Excess Kurtosis which subtracts the value of
- 03:34 3 from the actual Kurtosis number.
- 03:36 So in Excel, leptokurtosis occurs when the number is greater than 0.
- 03:43 Mesokurtosis is the normal curve.
- 03:46 In this case, kurtosis does equal 3 in Minitab, or Excess Kurtosis =0 in Excel.
- 03:52 Now however, you may remember that when we were talking about what is normal
- 03:57 variation that we said we would consider everything from -0.8,
- 04:02 to + 0.8, to be normal within the Mesokurtosis range for Excel, and so on.
- 04:07 Minitab, that would be from 2.2 to 3.8.
- 04:13 Then finally Platykurtosis, is very short tails.
- 04:17 Instead of heavy tails, they're very short.
- 04:19 And think of the platykurtosis as essentially a flat peak with sharp sides
- 04:23 on the edges.
- 04:25 There are very few outliers.
- 04:27 Kurtosis is < 3 and Excess Kurtosis is < 0.
- 04:31 This often occurs when there is either physical limits or
- 04:35 rework or tampering within the data set so that anything that was outside limits
- 04:40 was reworked to be brought back within the central zone.
- 04:45 Another type of non-normality is when there are multiple modes reflected in
- 04:49 the data.
- 04:50 This can occur when the data as collected,
- 04:52 actually has several processes represented in it.
- 04:56 If the modes are widely separated, it's easy to see,
- 04:59 then there will be several distinct peaks when we plot the data.
- 05:03 When the modes are close together, this can often take on a skewness or
- 05:07 kurtotic effect, and it's a little bit more difficult to find.
- 05:12 Finally, there's the issue of granularity.
- 05:14 This is the case when the variable data is not smooth, but
- 05:19 rather seems to come and go in steps or chunks.
- 05:22 This normally means that you have a measurement system problem.
- 05:26 The resolution is not fine enough to distinguish between the different values
- 05:30 in the distribution.
- 05:31 The other possibility is a machine function with step level changes.
- 05:36 Think of it like a gearbox on a car, and
- 05:39 the data shows that you just shifted from first to third.
- 05:44 Let's wrap this up with some principles of hypothesis testing with non-normality.
- 05:51 First, if your hypothesis test is to compare data from multiple data sets.
- 05:55 If the data is in any of them is non-normal,
- 05:57 you must use a non-normal analysis.
- 06:00 Generally speaking, the non-normal tests don't have any trouble with normal
- 06:05 data but the normal tests can have some real problems with non-normal data.
- 06:09 Non-normal data often uses the median for
- 06:12 central tendency rather than the mean that we use with normal tests.
- 06:17 This is because of a skewness effect.
- 06:19 The mean or average value will not be a good indication of central tendency.
- 06:24 Also non-normal data often relies on variance, which is the standard deviation
- 06:28 squared, rather than means or medians, when comparing datasets for similarity.
- 06:34 And finally, when working with skewed data,
- 06:37 you want a few more data points than you would have had with normal data.
- 06:41 The number of data points we need will be based upon the confidence interval,
- 06:45 which we discussed in a different lesson.
- 06:47 Start with the number of data points from the confidence interval calculation,
- 06:51 then divide that by 0.86, or it's actually probably easier to multiply it by 1.16.
- 06:56 This is the minimum number of data points needed with skewed data.
- 07:01 >> Non-normal variation occurs frequently in the real world.
- 07:05 When that happens, determine the nature of the non-normality and
- 07:10 then select the best hypothesis test for that data.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.