Locked lesson.
About this lesson
Lean Six Sigma methodology relies heavily on statistical analysis of problems and solutions. A single data point is not sufficient, rather a collection of data is needed for analysis. This collection will have some natural variability within it and descriptive statistics explain the boundaries of that variability.
Exercise files
Download this lesson’s related exercise files.
Descriptive Statistics Exercise - 2023.docx62.7 KB Descriptive Statistics Exercise Solution - 2023.docx
59.3 KB
Quick reference
Descriptive Statistics
Lean Six Sigma methodology relies heavily on statistical analysis of problems and solutions. A single data point is not sufficient, rather a collection of data is needed for analysis. This collection will have some natural variability within it and descriptive statistics explain the boundaries of that variability.
When to use
Descriptive statistics are used whenever there is a data set to be analyzed. That will occur in the Measure, Analyse, Improve, and Control phases.
Instructions
While a single data point is interesting, a set of data values for a process parameter provides a much richer and more complete picture of that aspect of the process. In fact, the more data points the more accurate the picture. However, a large set of numbers is awkward to work with, so the data set is described using a set of standard statistical measures. These descriptive statistics are used throughout the Lean Six Sigma process.
The first three statistics are often used to describe the central tendency of the data set.
Mean – the average value of the data set. It is calculated by adding all the values in the data set and dividing that sum by the number of data values. It is often expressed as:The mean can be heavily influenced by outlier values.
Median – the middle point in the data set. Order the data set from smallest value to largest value. The center point is the median. If the data set has an even number of data points, the average of the two center points is the median. This is a better measure of central tendency when the data is skewed or there are outliers.
Mode – the most frequently occurring value within the data set. This statistic is seldom used in Lean Six Sigma.
The next three statistics are used to describe some aspect of the span or width of the data set.
Range – the value of the data set span. Subtract the smallest data value from the largest data value.
Deviation – the span from the average value of the data set to a specific data point. Deviation is always associated with a specific point, not the entire data set.
Standard Deviation – The square root of the average of deviation squared. This value provides a measure of the width of the data set that accounts for the central tendency and the full range of the data. This statistic is often represented by the Greek symbol, σ.
σ = i=1nxi-x2n
Hints & tips
- Know these definitions and how to find these values. You will be using them often.
- The number of data points can have a significant effect on descriptive statistics. With a small number, one outlier will have a big effect. For that reason, to improve your statistical confidence in describing a dataset, gather more data points.
- 00:04 Hi, I'm Ray Sheen, and it's time to lay a foundation,
- 00:08 for the Lean Six Sigma statistics.
- 00:11 Descriptive statistics will be the building blocks that we'll start with.
- 00:14 Statistics are a way to describe a dataset.
- 00:17 So let's start there.
- 00:19 A data set is a collection of data points associated with a problem or
- 00:23 process parameter.
- 00:24 With statistics, we can gain understanding about the data in the data set.
- 00:29 And we can make predictions about additional data that may be collected and
- 00:33 become part of that dataset.
- 00:35 Descriptive statistics summarize existing data with
- 00:38 the statistical terms like mean and standard deviation.
- 00:41 It provides insight about the collected data.
- 00:45 If that data set is only a subset of a total data population,
- 00:48 that descriptive statistics may be used to predict the performance of
- 00:53 the rest of the data in the population.
- 00:56 However, the key is this describes what exists in the data set.
- 01:02 Inferential statistics are a statistical analysis of the data to determine if
- 01:07 the data can support the inference associated with a hypothesis.
- 01:12 This will help us to draw conclusions about the real world that is reflected in
- 01:17 the data set.
- 01:18 And that means both the existing data and the dataset that we have and
- 01:22 potentially the full population of all the data that could be in the dataset.
- 01:27 Descriptive statistics can be used to predict the probability of an event that
- 01:31 is reflected within the dataset.
- 01:33 This provides insight about the characteristics that we would likely
- 01:37 observe and a sample of the data population.
- 01:41 Let's first describe the mean, median, and mode.
- 01:45 All three of these are statistics that tell us something about the central
- 01:49 tendency, or the most common value for the dataset.
- 01:53 First is the mean, or average value.
- 01:56 This is often referred to as an x with a bar over it and it's called, then, x bar.
- 02:03 The mean is very easy to calculate.
- 02:05 Just add all the data values together and then divide by the number of data points.
- 02:10 The mean is our favorite measure for central tendency when we have normal data.
- 02:14 And I'll talk more about that on another lesson.
- 02:17 One caution about the mean.
- 02:19 When there are large outliers in the data set,
- 02:21 that means a point that is extremely high or extremely low,
- 02:25 the mean can become significantly influenced by just a few key points.
- 02:30 Next is the median.
- 02:32 This is the middle point of the data set.
- 02:34 To find the median, you must first take the data set and order it from highest
- 02:39 to lowest, top to bottom, then, select the point that is in the middle.
- 02:45 If there are an odd number of points, the middle point is that median value.
- 02:50 If there are an even number of points, you have to select the two center values and
- 02:54 take the average of those two.
- 02:56 The median is used when we have a non-normal or skewed data set.
- 03:00 In that case,
- 03:01 it provides a better indication of central tendency than the mean.
- 03:06 The last measure is the mode.
- 03:08 This is the easiest to find since it is just the data value that
- 03:11 occurs most frequently.
- 03:14 While it's easy to find, it has very little value for us in Lean Six Sigma.
- 03:18 So let's shift gears now and consider the range, deviation,
- 03:22 standard deviation, and variance.
- 03:26 Whereas the mean value and median value were looking at central tendency of
- 03:30 the distribution, these statistics look at the edges of the distribution.
- 03:35 They tell us something about the span or
- 03:37 width of the data from the lowest to the highest value.
- 03:41 The first one of these is the range.
- 03:43 This is normally easy to determine.
- 03:46 Again, we order the data points from lowest to highest then we subtract
- 03:49 the value of the lowest from the value of the highest and you have the range.
- 03:53 It is the distance between the two extreme values.
- 03:58 Range is the distance between minimum to maximum.
- 04:01 Deviation is also a distance between two values.
- 04:05 However, in this case, it's a difference between the value of a data point and
- 04:09 the mean or average value of the data points in the set.
- 04:13 So deviation is always associated with a specific data point.
- 04:17 And it tells us how close that data point is at the center of the data.
- 04:23 If the deviation was 0,
- 04:24 it would indicate that that point is right smack in the middle of the data set.
- 04:30 Now, information about a single data point, while may be interesting for
- 04:35 that point, doesn't help us with respect to the data set.
- 04:39 So when we're looking at statistics,
- 04:41 we need to be able to describe the entire data set.
- 04:44 That is why we often use standard deviation in our statistical analysis.
- 04:48 The standard deviation provides a sense of the size of the expected data range.
- 04:54 This is calculated by taking each deviation, which I just described a moment
- 04:59 ago and squaring each of those, that means to multiply it by itself.
- 05:04 Add all those together, then divide by the number of data values.
- 05:08 Now finally, we take the square root of the result.
- 05:12 This is the standard deviation calculation,
- 05:15 and it is normally represented by the Greek symbol sigma.
- 05:19 The standard deviation is an excellent statistic for
- 05:23 estimating the normal range or span of data that will occur in the data set.
- 05:28 In fact,
- 05:29 the Six Sigma method derives its name from the standard deviation value, sigma.
- 05:35 The goal of Six Sigma was to create a process with a standard deviation that
- 05:40 was so small, that the range of -6 sigma to +6 sigma could fall within
- 05:45 the allowable tolerance limits for the process.
- 05:49 The variance will be the last of the descriptive statistic measures that
- 05:52 we'll discuss.
- 05:53 In fact, it's based on the standard deviation and
- 05:55 comes from squaring the standard deviation.
- 05:58 For some analysis,
- 05:59 the variance is much more important than the standard deviation itself.
- 06:03 These basic descriptive statistics will be referred to again and
- 06:08 again throughout this course and they will be covered exhaustively on the IASSC exam.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.