Retired course
This course has been retired and is no longer supported.
About this lesson
The Central Limit Theorem is a principle that is used to transform non-normal raw data into a data set that is normal.
Exercise files
Download this lesson’s related exercise files.
Central Limit Theorem.docx102.3 KB Central Limit Theorem - Dice Roll Data.xlsx
10.4 KB Central Limit Theorem - Solution.docx
88.6 KB
Quick reference
Central Limit Theorem
The Central Limit Theorem is a principle that is used to transform non-normal raw data into a data set that is normal.
When to use
If data is non-normal due to process attributes instead of a special cause, the data can be transformed into a normal data set using the Central Limit Theorem.
Instructions
The normal process outputs of some processes do not create a normal curve. The physical characteristics of the process cause it to be skewed to the high or low side of the output. Or there may be a physical limit or stop to the upper or lower end of the output that truncates the shape of a normal curve. This presents a problem, since many of our basic statistical data analyses are based upon the assumption that the data set being analysed is a normal curve.
This type of non-normal data can be transformed into normal data by using the Central Limit Theorem. The theorem states, “When independent random variables are added, their sum tends toward a normal distribution, even if the original variables themselves are not normally distributed.” Using this principle, groups or samples of the process output can be added together, and if the data is truly independent and random, that data set of summed values will be normal.
There are several key points to applying this principle. First, the number of points in each subgroup must be the same every time. Second, the data must be in the same units or measured in the same manner. For instance, if the data was a discrete pass/fail data value, the criteria used for what constitutes a “pass” or a “fail” must be the same for all data points.
A critical determination is to decide how many data points should be in each subgroup. This is determined by the structure of the original raw data. The table below provides guidance.
Hints & tips
- Central Limit Theorem is often used when collecting and analysing data. It is very easy to apply.
- If after applying the Central Limit Theorem the data is still not normal, then you know the underlying data is not just experiencing normal random process variation. Rather, there is some abnormal special root cause that is affecting the results of the data.
- When deciding on subgroup size, consider normal physical characteristics of the process. If the data is collected sequentially, try to set your subgroup size at a normal stop time for the process such as the end of a shift – even if that means using a subgroup size that is a little more than the minimum for the type of data.
- 00:04 Hi, I'm Ray Sheen.
- 00:05 There's a principle we rely on when doing statistical analysis known
- 00:09 as the Central Limit Theorem.
- 00:11 This principle can help us turn non-normal data into normal data.
- 00:17 I know I said that we wouldn't be doing heavy-duty mathematic derivations,
- 00:22 and I mean it.
- 00:23 Although this is a theorem, the principle is straightforward and easy to apply.
- 00:27 The theorem states, when independent random variables are added,
- 00:31 their sum tends toward a normal distribution even if
- 00:34 the original variables themselves are not normally distributed.
- 00:37 What this means is that if the distribution is non-normal,
- 00:41 I can separate the distribution into equal sized subgroups,
- 00:45 then add the values of the data points in each subgroup.
- 00:49 The plot of that subgroup data sums shows that it is a normal distribution.
- 00:53 By doing this sub-grouping and then summing,
- 00:56 we convert the non-normal distribution into a normal distribution.
- 01:00 But remember several key constraints.
- 01:02 First, the subgroups must contain the same number of data points.
- 01:07 Second, be certain the subgroup data point values would determine using the same
- 01:11 units of measure or standard criteria.
- 01:14 When the data is discrete the unit is almost always the counts of the attribute
- 01:18 being studied.
- 01:19 When the data points are continuous be certain they're measured on the same scale
- 01:23 or with the same units.
- 01:25 The easiest way to explain how this theorem is used is with an example.
- 01:29 We'll start with a dataset that contains the results of flipping a coin 500 times.
- 01:34 We assign a value of one if the result is heads and zero if the result is tails.
- 01:40 Now, looking at the time series, we can easily tell if the data is normal or not.
- 01:45 So let's put the data into buckets based upon the number of counts in each
- 01:49 subgroup.
- 01:49 Well, if the initial subgroup size is one data point,
- 01:53 we find that we only have two buckets, a 0 and a 1.
- 01:57 And the data is definitely not normal.
- 02:00 For the data I was working with, there were 247 tails or 0s, and 253 heads or 1s.
- 02:07 Now create subgroup of four consecutive flips of the coin and
- 02:11 add up the sum of the values of the subgroup.
- 02:14 If all four were tails, the value would be zero.
- 02:17 If all four were heads, the value is four.
- 02:20 And if there are only two heads and two tails, the value is two.
- 02:23 As you can see,
- 02:24 we're starting to get some of the features we would expect with a normal curve.
- 02:28 It's roughly symmetric and peaked in the middle.
- 02:30 Now I increase the subgroup size to 7.
- 02:33 The smallest value for a subgroup is 0 and the largest is 7.
- 02:37 This data distribution is looking very much like a normal curve.
- 02:41 And finally I'll increase the subgroup size up to ten, and
- 02:44 I still have a data distribution that looks very normal.
- 02:48 By creating subgroups,
- 02:49 I've transformed the non-normal data distribution into a normal distribution.
- 02:55 While our previous example begs the question how many data points
- 02:59 do I need in a subgroup in order to get a normal distribution?
- 03:03 The answer will depend upon the nature of the data distribution and
- 03:06 the characteristics of the non-normality.
- 03:09 If the data is already symmetric,
- 03:11 we already have one of the key attributes of a normal curve, and it will transform
- 03:16 with a relatively small number of data points in the subgroup, usually five.
- 03:21 When the data is not symmetric, it may be heavily skewed or exponential.
- 03:27 Then you need many more data points.
- 03:28 The rule of thumb is a subgroup size of 30.
- 03:31 Incidentally, this may be a problem.
- 03:34 If your process only creates one data point per day, and you need 30 for
- 03:39 a subgroup, it would take an entire month just to get one subgroup.
- 03:43 And possibly several years to get enough subgroup data points to where you have
- 03:47 a normal distribution that you're gonna have confidence in your calculation of
- 03:51 the mean and standard deviation.
- 03:52 When the central limit theorem doesn't work well for your situation,
- 03:56 you'll need to do a non-normal data analysis.
- 03:58 Summarizing our discussion.
- 04:00 If the underlying data is normal you don't need to create a subgroup to make it
- 04:05 normal.
- 04:05 it already is, just use each data point.
- 04:08 If the underlying data is symmetric,
- 04:10 use subgroup sizes of five to get a normal distribution.
- 04:14 And if the data is skewed use the subgroup size of 30.
- 04:18 The central limit theorem is a simple and practical technique that can
- 04:23 be used to transform non-normal data into normal data.
- 04:27 Which then opens up the opportunity to conduct many different hypothesis tests.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.