Back to course

Central Limit Theorem

Locked lesson.

About this lesson

The Central Limit Theorem is a principle that is used to transform non-normal raw data into a data set that is normal.

Exercise files

Download this lesson’s related exercise files.

Central Limit Theorem.docx
104.4 KB Central Limit Theorem - Dice Roll Data.xlsx
10.4 KB Central Limit Theorem - Solution.docx
88.6 KB

Central Limit Theorem

The Central Limit Theorem is a principle that is used to transform non-normal raw data into a data set that is normal.

When to use

If data is non-normal due to process attributes instead of a special cause, the data can be transformed into a normal data set using the Central Limit Theorem.

Instructions

The normal process outputs of some processes do not create a normal curve. The physical characteristics of the process cause it to be skewed to the high or low side of the output. Or there may be a physical limit or stop to the upper or lower end of the output that truncates the shape of a normal curve. This presents a problem since many of our basic statistical data analyses are based upon the assumption that the data set being analyzed is a normal curve.

This type of non-normal data can be transformed into normal data by using the Central Limit Theorem. The theorem states, “When independent random variables are added, their sum tends toward a normal distribution, even if the original variables themselves are not normally distributed.” Using this principle, groups or samples of the process output can be added together, and if the data is truly independent and random, that data set of summed values will be normal.

There are several key points to applying this principle. First, the number of points in each subgroup must be the same every time. Second, the data must be in the same units or measured in the same manner. For instance, if the data was a discrete pass/fail data value, the criteria used for what constitutes a “pass” or a “fail” must be the same for all data points.

A critical determination is to decide how many data points should be in each subgroup. This is determined by the structure of the original raw data. The table below provides guidance.

Hints & tips

Central Limit Theorem is often used when collecting and analyzing data. It is very easy to apply.
If after applying the Central Limit Theorem the data is still not normal, then you know the underlying data is not just experiencing normal random process variation. Rather, there is some abnormal special root cause that is affecting the results of the data.
When deciding on subgroup size, consider normal physical characteristics of the process. If the data is collected sequentially, try to set your subgroup size at a normal stop time for the process such as the end of a shift – even if that means using a subgroup size that is a little more than the minimum for the type of data.

00:04 Hi, I'm Ray Sheen.
00:06 There's a principle that we rely on when doing statistical analysis
00:09 known as the central limit theorem.
00:12 This principle can help us turn non-normal data into normal data.
00:18 I know I said that we wouldn't be doing heavy duty mathematic derivations, and
00:23 I mean it.
00:24 Although this is a theorem, the principle is straightforward and easy to apply.
00:28 The theorem states, when independent random variables are added,
00:32 their sum tends toward a normal distribution even if the original
00:36 variables themselves are not normally distributed.
00:39 What this means is that if the distribution is non-normal,
00:42 I can separate the distribution into equal-sized subgroups,
00:46 then add the values of the data points in each subgroup.
00:50 The plot of that subgroup data sums shows that it is a normal distribution.
00:54 By doing this subgrouping and then summing,
00:57 we convert the non-normal distribution into a normal distribution.
01:01 But remember several key constraints, first,
01:04 the subgroups must contain the same number of data points.
01:08 Second, be certain the subgroup data point values are determined using the same units
01:13 of measure or standard criteria.
01:15 When the data is discrete,
01:16 the unit is almost always the counts of the attribute being studied.
01:20 When the data points are continuous,
01:22 be certain they're measured on the same scale or with the same units.
01:26 The easiest way to explain how this theorem is used is with an example.
01:30 We'll start with a data set that contains the results of flipping a coin 500 times.
01:35 We assign a value of 1 if the result is heads and 0 If the result is tails.
01:41 Now, looking at the time series, we can easily tell if the data is normal or not.
01:46 So let's put the data into buckets based upon the number of counts in each
01:49 subgroup.
01:50 Well, if the initial subgroup size is one data point,
01:54 we find that we only have two buckets, a 0 and a 1.
01:58 And the data is definitely not normal.
02:01 For the data I was working with, there were 247 tails, or 0s,
02:06 and 253 heads, or 1s.
02:08 Now, create a subgroup of four consecutive flips of the coin and
02:12 add up the sum of the values of the subgroup.
02:15 If all four were tails, the value would be 0.
02:18 If all four were heads, the value is 4.
02:20 And if there are only two heads and two tails, the value is 2.
02:24 As you can see,
02:25 we're starting to get some of the features we would expect with a normal curve.
02:28 It's roughly symmetric and peaked in the middle.
02:31 Now I increase the subgroup size to seven.
02:34 The smallest value for a subgroup is 0 and the largest is 7.
02:38 This data distribution is looking very much like a normal curve.
02:42 And finally, I'll increase the subgroup size up to ten, and
02:46 I still have a data distribution that looks very normal.
02:49 By creating subgroups,
02:51 I have transformed the non-normal data distribution into a normal distribution.
02:57 While our previous example begs the question, how many data points do I need
03:01 in a subgroup in order to get a normal distribution?
03:04 The answer will depend upon the nature of the data distribution and
03:07 the characteristics of the non-normality.
03:10 If the data is already symmetric,
03:12 we already have one of the key attributes of a normal curve, and it will transform
03:17 with a relatively small number of data points in the subgroup, usually five.
03:22 When the data is not symmetric, it may be heavily skewed or exponential.
03:27 Then you need many more data points.
03:29 The rule of thumb is a subgroup size of 30.
03:33 Incidentally, this may be a problem,
03:35 if your process only creates one data point per day and you need 30 for
03:40 a subgroup, it would take an entire month just to get one subgroup.
03:44 And possibly several years to get enough subgroup data points to where you have
03:48 a normal distribution that you can have confidence in your calculation of the mean
03:52 and standard deviation.
03:53 When the central limit theorem doesn't work well for your situation,
03:57 you'll need to do a non-normal data analysis.
04:00 Summarizing our discussion, if the underlying data is normal, you don't need
04:04 to create a subgroup to make it normal, it already is, just use each data point.
04:09 If the underlying data is symmetric,
04:11 use subgroup sizes of five to get a normal distribution.
04:15 And if the data is skewed, use a subgroup size of 30.
04:20 The central limit theorem is a simple and practical technique that can be used to
04:24 transform non-normal data into normal data,
04:27 which then opens up the opportunity to conduct many different hypothesis tests.

Lesson notes are only available for subscribers.

PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.

Central Limit Theorem

About this lesson

Exercise files

Quick reference

Central Limit Theorem

When to use

Instructions

Hints & tips