Locked lesson.
About this lesson
Data is often displayed visually in distributions. Recognizing the type of display and the nature of the distribution can aid in the selection and analysis of a hypothesis test. This lesson will explain data distributions and review the typical distributions for discrete data.
Exercise files
Download this lesson’s related exercise files.
Discrete Data Distribution Exercise.xlsx10.4 KB Discrete Distribution Exercise Solution.docx
125.3 KB
Quick reference
Distributions and Discrete Data
Datasets are often displayed in graphical distributions. Different distributions are indicative of different physical phenomena. Some distributions are unique to discrete data.
When to use
Visualizations of discrete datasets are often easier to use when explaining characteristics of the data than tables of numbers. In addition, the classes of distributions have specific characteristics which will dictate what type of hypothesis test is appropriate with that data.
Instructions
Discrete Data Distributions
Discrete data has a limited number of values and there is no meaningful data between the data categories. Therefore, these distributions are always shown as histograms with counts of data elements in each category or bucket that is represented by a vertical bar.
Bernoulli distribution
This is a distribution for data that can only take on one of two states, such as pass/fail. This distribution shows the percentage of instanced in each state. The horizontal axis is the two states. The vertical axis is the percentage or proportion of each state. An example is the percentage of first-time yield.
Binomial Distribution
This is a distribution for data that can only take on one of two states such as pass/fail. This relies on using a fixed batch size. The horizontal axis is the count per batch. The vertical axis is normally the percentage of occurrences for that count. An example is the number of defects in a batch.
Poisson Distribution
This distribution is a count of occurrences of an event. The batch is based upon a time dependency. Such as a day. The horizontal axis is the count of the instances that occurred in that time period. An example is the number of phone calls a day.
Uniform Distribution
This distribution is the probability of the occurrence of several different outcomes when there is an equal chance of any of the outcomes. The horizontal axis is the outcomes and the vertical axis is percentage. An example is the likelihood of any number on a roll of a dice.
Geometric Distribution
This distribution is again used with data items that only have two states. In this case, instead of counting the number in a batch, it is based on when the state changes. The horizontal axis is a timeline. The vertical axis shows the probability of the event changing state during that time period. An example is plotting the day of the month when rainfall or snowfall first occurs in that month.
Hypergeometric Distribution
This distribution is again used with data items that only have two states. In this case, the information graphed is
the proportional count within a sample. The horizontal axis is a count. The vertical axis shows the probability. An example is the number of defects found in a sample selection of a shipment of products.
PDF and CDF
The final topic in distributions is the PDF and CDF displays. PDF stands for Probability Density Function. This is the type of display used in all the curves shown earlier in this reference guide. The height of the vertical axis is showing the probability that a data point will occur at that value of the horizontal axis. The higher the point, the more density at that point of the distribution.
CDF stands for Cumulative Distribution Function and shows the probability that a point in the distribution will have occurred by that level of the horizontal axis. In a CDF, the curve always starts at a zero on the left end – a probability for that low-end value, and ends at one on the right end – representing that all data points have occurred and the probability is then 100%.
If the distribution is a uniform distribution, the PDF is a flat line (as shown above) and the CDF is a straight diagonal line going from zero to one. If the underlying distribution is a normal curve when shown as a PDF, it is an S curve when shown as a CDF. The slope starts very shallow when small changes are occurring on the left tail of the normal curve. The slope becomes steep in the center of the curve when the normal curve is peaking, and then the slope becomes shallow again as the horizontal axis approaches the right side of the normal curve. And an exponential curve will start at zero, immediately leap up to the value of the vertical axis and then start to flatten out, ending at the value of 1.
Hints & tips
- If the graph is a bar graph (histogram) it is discrete data.
- PDF and CDF show the same information, just with different ways of expressing the vertical scale values. PDF is for that specific horizontal scale value. CDF is for all the horizontal scale values to the left of that point.
- 00:04 Hi, I'm Ray Sheen.
- 00:05 It's often very helpful to recognize what type of distribution you're working with.
- 00:11 Different types of physical phenomena create different types of distributions.
- 00:16 When you can recognize the distribution,
- 00:18 it provides insight into the process performance characteristics.
- 00:21 Let's take a look at some distributions of discrete data.
- 00:26 >> We'll actually start with an overview of PDF and CDF.
- 00:30 The probability density function and cumulative distribution
- 00:33 function are just two ways of graphically displaying the same data.
- 00:37 By the way, PDF and CDF can be used with either discrete or continuous data.
- 00:42 However, in this lesson, we'll be focusing on discrete data.
- 00:46 The probability density function is a graph that shows the likelihood or
- 00:51 percentage of time that an item within the distribution will have a particular
- 00:56 value or be within a particular range.
- 00:58 The number of instances or percentage of items of that value within the range
- 01:03 are plotted with a vertical bar chart.
- 01:05 The height of the bar represents the number of instances of the value.
- 01:09 The cumulative distribution function is the likelihood that an item in
- 01:13 a distribution will be equal to a distribution value or
- 01:17 less than that value.
- 01:18 At the left side of the curve, it always starts at 0.
- 01:21 There is no likelihood that anything could be less than that.
- 01:24 As we proceed through the CDF curve, it grows so that the right side of the curve
- 01:29 is always at a value of 1, or 100% of the distribution is to the left of that.
- 01:35 Let's consider some examples.
- 01:37 First is the normal distribution.
- 01:38 The PDF graph looks similar to a bell-shaped curve,
- 01:42 whereas it's called when created with discrete data, the Poisson distribution.
- 01:47 We see that the highest percentage of values occurs at the mid-point of
- 01:51 the range, and the graph is symmetrical with both the upper and
- 01:55 lower tails dropping down to 0.
- 01:58 The CDF graph of this curve is S-shaped.
- 02:02 Let's compare the PDF and CDF for a geometric or exponential curve.
- 02:07 We see in the PDF, that the highest number of items in the distribution are at
- 02:11 the left edge, and the percentage decreases down to zero at the right edge.
- 02:16 The CDF is still an S-shaped curve, but it has a very sharp rise on the left
- 02:21 side of the graph and then tapers off to near horizontal on the right side.
- 02:27 With our understanding of PDF and CDF in mind, let's look at some of
- 02:31 the common distributions that we see with discrete data.
- 02:34 I will show PDF views of the distributions.
- 02:37 Keep in mind that discrete data can only take on a limited number of values.
- 02:42 So often, our distribution is just a plot of the number of occurrences of those
- 02:46 discrete values.
- 02:48 An example of that would be the number of successes within a subgroup.
- 02:51 So let's look at some typical distributions.
- 02:54 The first is the Bernoulli distribution.
- 02:56 In this case, the data can take on one of two values,
- 03:01 it's either a 0 or 1, a true or false.
- 03:04 The distribution is the number of instances of one value and
- 03:08 the number of instances of the other value.
- 03:10 It shows us a proportional count of the two variables and
- 03:14 it will always be non-normal.
- 03:16 The next one to consider is the Poisson distribution.
- 03:19 This distribution counts the number of occurrences within a bucket or a category.
- 03:24 It's quite often used to count the number of occurrences within a set of
- 03:29 time periods such as a defects per day or per hour.
- 03:32 In this distribution,
- 03:33 we can picture what a typical level of a process performance looks like.
- 03:37 The normal condition and the min are max conditions.
- 03:40 Also, we are dealing with discrete data because there is a count of defects
- 03:45 per day.
- 03:45 We can't have 7.4 defects occurring it's either 7 or 8.
- 03:51 But we can use this to set an expected level of process performance and
- 03:54 predict the probability of failure or defects.
- 03:57 This distribution is usually normal.
- 04:00 The geometric distribution shows some similarities with the Bernoulli and
- 04:05 Poisson, but there are some obvious differences.
- 04:08 Like the Bernoulli, it's often associated with a pass,
- 04:12 fail data where data can only take on one of two states.
- 04:16 But like the Poisson distribution, it's often tied to time increments or buckets.
- 04:21 So in this case, we determine how much time or what time bucket
- 04:26 did the item first change state from compliant to defective?
- 04:31 It's often used to predict failure rates.
- 04:33 This distribution is always non-normal.
- 04:37 Let's look at a few more typical discrete data distributions.
- 04:41 So we'll continue on with the binomial distribution.
- 04:43 Once again, the data element can only take on two values, such as pass or fail.
- 04:49 However, in this case,
- 04:51 the total data population is separated into subsets of a fixed size.
- 04:56 What is plotted is the number or percentage of a subset that have
- 05:01 a particular count, or fail value within that subset.
- 05:05 This distribution could be normal or non-normal.
- 05:08 The next one is the uniform distribution.
- 05:11 It is the probability of a particular outcome within the data population.
- 05:15 When the probability of any given data value is random,
- 05:19 the result is the uniform distribution.
- 05:21 An example would be, rolling a particular value on any single dice,
- 05:26 assuming the dice isn't loaded, distribution is not normal.
- 05:30 The last discrete distribution to consider is the hypergeometric distribution.
- 05:35 As with the geometric distribution, the data can take on one of two states, but
- 05:40 where the geometric distribution was counting how long until the data item
- 05:45 changed state, the hypergeometric distribution counts the number of items in
- 05:50 a subset of the data that are at the level of the change state.
- 05:54 And just a small caveat, this is a true value.
- 05:56 So it does not allow the tester to replace bad ones for good ones and
- 06:01 keep on testing.
- 06:02 They must test with the original items from the sample, both bad and good.
- 06:05 This could be either a non-normal or normal distribution.
- 06:09 >> You definitely need to be familiar with the different types of distributions
- 06:14 if you plan to sit for the IASSC Exam.
- 06:16 And it's also helpful to understand them so
- 06:19 that you can recognize physical phenomena in your data.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.