Retired course
This course has been retired and is no longer supported.
About this lesson
Data sets are often displayed in distributions. Different distributions are indicative of different physical phenomena. The ability to recognize a distribution will aid in the identification of process performance issues.
Exercise files
Download this lesson’s related exercise files.
Classes of Distribution.xlsx10.3 KB Classes of Distribution - Solution.docx
162.2 KB
Quick reference
Classes of Distribution
Data sets are often displayed in distributions. Different distributions are indicative of different physical phenomena. The ability to recognize a distribution will aid in the identification of process performance issues.
When to use
Visualization of datasets is often easier to use when explaining characteristics of the data than with tables of numbers. In addition, the classes of distribution have specific characteristics which will dictate what type of hypothesis test is appropriate with that data.
Instructions
Discrete Data Distributions
Discrete data has a limited number of values and there is no meaningful data between the data categories. Therefore, these distributions are always shown as histograms with counts of data elements in each category or bucket that is represented by a vertical bar.
Binomial Distribution
This is a distribution for data that can only take on one of two states such as pass/fail. This relies on using a fixed batch size. The horizontal axis is the count per batch. The vertical axis is normally the percentage of occurrences for that count. An example is the number of defects in a batch.
Poisson Distribution
This distribution is a count of occurrences of an event. The batch is based upon a time dependency, such as a day. The horizontal axis is the count of the instances that occurred in that time period. An example is the number of phone calls in a day.
Geometric Distribution
This distribution is again used with data items that only have two states. In this case, instead of counting the number in a batch, it is based upon when the state changes. The horizontal axis is a time line. The vertical axis shows the probability of the event changing state during that time period. An example is plotting the day of the month when rainfall or snowfall first occurs in that month.
Continuous Distribution
Continuous data is that which can take on an infinite number of values. Between any two data values, there is another data value that could be detected if the measurement system was able to accurately discriminate that level of fraction or decimal. The plots are characterized by a smooth curve, not histogram bars. In all these plots, the horizontal axis is the independent variable and the vertical axis is the process performance dependent variable.
Normal Distribution
This is the bell-shaped curve that represents common cause or random variation. It is symmetric, peaked in the center and the tails approach zero. This is normally our desired distribution for analysis because we know that it represents random variation around the process performance.
Uniform Distribution
This is a horizontal line or essentially equal vertical value for all horizontal axis values. This represents the case where the process performance does not depend upon the independent variable.
Bi-modal Distribution
This is normally an asymmetric curve. There are two (or more) peaks. This represents the case where there are multiple processes embedded in the data. These need to be separated and each process analysed individually.
Exponential Distribution
This is an asymmetric curve. One end starts a point on the vertical axis and the other end of the curve approaches – but never reaches – zero value. A typical physical phenomena that follows this pattern is failure rates of a product or system that is subject to infant mortality.
Log-normal Distribution
This is also an asymmetric curve. Both ends of the curve are at zero. However, one end quickly shoots up and then it slowly decays back to zero. This is also a commonly occurring pattern in the real world. For instance, machine down time follows this pattern, it takes a finite amount of time to do a repair which is the major spike, and some repairs then take longer.
Weibull Distribution
The Weibull curve is actually a family of curves that can take on many shapes including an exponential, log-normal, or even normal. The actual shape varies based upon factors or constants in the Weibull equation. This equation has proven very effective at modelling reliability in complex systems. The factors are based upon the system design parameters.
PDF and CDF
The final topic in distributions is the PDF and CDF displays. PDF stands for Probability Density Function. This is the type of display used in all the curves shown earlier in this reference guide. The height of the vertical axis is showing the probability that a data point will occur at that value of the horizontal axis. The higher the point, the more density at that point of the distribution.
CDF stands for Cumulative Distribution Function and shows the probability that a point in the distribution will have occurred by that level of the horizontal axis. In a CDF, the curve always starts at zero on the left end – a probability for that low end value – and ends at one on the right end – representing that all data points have occurred – and the probability is then 100%.
If the distribution is a uniform distribution, the PDF is a flat line (as shown above) and the CDF is a straight diagonal line going from zero to one. If the underlying distribution is a normal curve when shown as a PDF, it is an S curve when shown as a CDF. The slope starts very shallow when small changes are occurring on the left tail of the normal curve. The slope becomes steep in the center of the curve when the normal curve is peaking, and then the slope becomes shallow again as the horizontal axis approaches the right side of the normal curve. An exponential curve will start at zero, immediately leap up to the value of the vertical axis and then start to flatten out, ending at the value of 1.
Hints & tips
- If the graph is a bar graph (histogram), it is discrete data, if it is a smooth curve, it is continuous data.
- PDF and CDF show the same information, just with different ways of expressing the vertical scale values. PDF is for that specific horizontal scale value. CDF is for all the horizontal scale values to the left of that point.
- 00:04 Hi I'm Ray Sheen, it's often very helpful to recognize the type of distribution that
- 00:09 you're working with.
- 00:10 Different physical phenomena create different types of distributions.
- 00:14 When you can recognize the distribution,
- 00:16 it provides insight into the process performance characteristics.
- 00:21 >> Let's start by looking at some of the more common distributions
- 00:25 that we see with discrete data.
- 00:28 Keep in mind that discrete data can only take on a limited number of values.
- 00:32 So often our distribution will be a plot of,
- 00:35 a count of different discrete value buckets.
- 00:37 By that I mean, how many times did it occur on Monday?
- 00:40 How many times on Tuesday?
- 00:41 How many on Wednesday?
- 00:43 So let's look at some typical distributions.
- 00:46 The first is the binomial distribution.
- 00:48 In this case, the data is pass failed type of data,
- 00:52 what is plotted is the number of counts of fails per standard size subgroup.
- 00:57 Once you've characterized this distribution, you can predict the number
- 01:01 of failures that will occur in the next batch or subgroup.
- 01:04 The next one that consider is the Poisson distribution.
- 01:08 The distribution counts the number of occurrences within a bucket or category.
- 01:13 It is quite often used to count the number of occurrences within a set time period,
- 01:18 such as defects per day or hour.
- 01:21 In this distribution, we can picture what a typical level of process performance
- 01:25 looks like, the normal condition, and the min and max values.
- 01:29 Again, we're dealing with the discrete data because it
- 01:32 is a count of the defects per day.
- 01:35 We can't have 7.4 defects occurring, it's either seven or it's eight.
- 01:39 But we can use this to set expectations about the process performance and
- 01:44 predict the probability of failures or defects.
- 01:47 The geometric distribution shares some similarities with the binomial and
- 01:51 Poisson distribution, but there are some differences.
- 01:54 Like a binomial distribution,
- 01:56 it's often associated with pass fail data or data that can only pick on two states.
- 02:00 But like the Poisson distribution, is often tied to time increments or buckets.
- 02:05 So in this case, we determine how much time does it take before the item
- 02:09 first changes state from compliant to defective.
- 02:12 It is often used to predict failure rates.
- 02:15 Now let's change our focus from discrete data to continuous data.
- 02:19 Continuous or variable data can take on any value.
- 02:22 By that, I mean that a data point could occur between any two data points.
- 02:26 There's not a finite number of values, rather,
- 02:29 there's a possibility of an infinite number of values.
- 02:31 We can always take a fraction between any two data points,
- 02:34 our only limitation is the discrimination of our measurement system.
- 02:38 The first one is a distribution that we've already seen and discussed, and
- 02:42 that's the normal distribution.
- 02:43 This characterized in a form that is symmetric, a high peak in the middle, and
- 02:48 tails that approach zero.
- 02:49 As we have often noted in the past,
- 02:52 this represents the common cause of variability within a process.
- 02:56 The next one is a uniform distribution,
- 02:59 this is symmetric because it is essentially a flat or a horizontal line.
- 03:03 This represents no relationship between the variables being plotted.
- 03:07 No matter how you change the factor that's on the x-axis, the y factor is unphased.
- 03:13 Our next distribution is bi-modal, or, for that matter,
- 03:16 we could have had a tri-modal or even a quad-modal distribution.
- 03:20 This is normally asymmetric, when you see this type of distribution,
- 03:24 it indicates that there are actually multiple independent distributions in your
- 03:29 data set.
- 03:29 You need to analyze this to identify the factor that can separate out the two
- 03:34 distributions.
- 03:35 Once they've been separated, they can each then be analyzed.
- 03:38 This will sometimes occur when you have multiple products,
- 03:42 multiple locations, multiple suppliers, multiple processes,
- 03:47 multiple people, multiple factors that are involved in the process.
- 03:52 Next I would like to discuss the exponential distribution.
- 03:56 This is definitely asymmetric, one end will occur at some point on the y-axis and
- 04:01 the distribution will then trend down when approaching the y value of zero.
- 04:06 There are a number of physical phenomenon that behave in an exponential fashion,
- 04:10 one that is commonly found is failure rates.
- 04:13 Now let's look at the log normal distribution.
- 04:16 This is also asymmetric, but it reflects a value of zero on both the left and
- 04:21 the right side of the curve.
- 04:22 However, unlike the normal curve, it heavily skewed to one side or the other.
- 04:28 It also is a good measure of some physical phenomena such as machine down time.
- 04:32 The last distribution I wanna mention is actually a family
- 04:36 of distributions known as the Weibull curve.
- 04:38 The Weibull curve represents multiple effects, and
- 04:41 therefore, can take on multiple shapes, including exponential or log-normal.
- 04:46 The actual curve is based upon the selection of some parameters in
- 04:49 the formula.
- 04:50 But these will often change over time.
- 04:52 The Weibull distribution is most commonly used for reliability analysis and
- 04:57 predictions.
- 04:58 Let's warp up our discussion of distributions looking at two different
- 05:02 ways to represent a distribution.
- 05:04 These different methods are really just different data visualizations of
- 05:09 the same information.
- 05:10 The Probability Density Function or PDF view, is the format of the distributions
- 05:15 that I have been showing you on the previous slides.
- 05:18 The x-axis takes on a value from the lowest to the highest of
- 05:21 that particular variable, and the y value represents the probability
- 05:25 that a given data point will have that value.
- 05:28 In other words, the higher the y value,
- 05:30 the higher the probability that a data point will be occurring at that x value.
- 05:35 We contrast this with a Cumulative Distribution Function, or CDF.
- 05:39 This is a graph where the x-axis shows the range of the x variables just like PDF.
- 05:46 And the y-axis always starts on the left side at 0, and
- 05:50 rises to a value of 1 on the right side.
- 05:52 What is being plotted is the likelihood that the next x value is equal to or
- 05:57 less than the x value on the axis.
- 06:00 So the far left, it is 0 because there are no other points further to left.
- 06:05 And at the far right, we're up at 1 or
- 06:07 100% because all of the data points are equal to or less than that value.
- 06:13 Let's look at an example, here's the PDF and CDF for
- 06:17 a normal or Poisson distribution.
- 06:19 The histogram is the Poisson view and the curve is the normal view.
- 06:23 As you can see, PDF is the form for what we have and looking at.
- 06:27 It's symmetric and peaked at the center and the tail approaching zero,
- 06:32 while the CDF is a great s-curve plot.
- 06:35 The curve starts at 0, has a gentle upslope, and then in the center,
- 06:40 where the PDF is heaviest, the curve becomes quite steep.
- 06:44 And as it approaches the right tail of the bell shaped curve,
- 06:48 the CDF begins to flatten out again.
- 06:50 Let's contrast that with the geometric or exponential distribution.
- 06:53 The geometric is the histogram and
- 06:55 the curved line is the exponential distribution.
- 06:58 In either case, the PDF is heavy on the left side and
- 07:02 then tapers down to very small values on the right side.
- 07:05 Now when we look at the CDF, it starts at 0 and when we get to the first value
- 07:11 of the x-axis where the distributions starts it quickly shoots up.
- 07:16 Then as the level of the PDF gets smaller and smaller the CDF gets flatten out and
- 07:22 only very minor changes until it gets to the final value of 1.
- 07:26 Now I tried to put an S curve onto the CDF
- 07:29 plot just to show you that it doesn't fit very well at all.
- 07:33 The S curve shape will occur on CDF with a symmetric distribution,
- 07:38 but definitely does not with an asymmetric one.
- 07:41 >> You definitely need to be familiar with these different types of distributions if
- 07:46 you plan to set for the IASSC exam.
- 07:48 And it's also helpful to understand them so
- 07:50 as to spot likely physical phenomena in the data.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.