Locked lesson.
About this lesson
Data is often displayed visually in distributions. Recognizing the type of display and the nature of the distribution can aid in the selection and analysis of a hypothesis test. This lesson will explain data distributions and review the typical distributions for discrete data.
Exercise files
Download this lesson’s related exercise files.
Discrete Data Distribution Exercise.xlsx10.4 KB Discrete Distribution Exercise Solution.docx
125.3 KB
Quick reference
Distributions and Discrete Data
Datasets are often displayed in graphical distributions. Different distributions are indicative of different physical phenomena. Some distributions are unique to discrete data.
When to use
Visualizations of discrete datasets are often easier to use when explaining characteristics of the data than tables of numbers. In addition, the classes of distributions have specific characteristics which will dictate what type of hypothesis test is appropriate with that data.
Instructions
Discrete Data Distributions
Discrete data has a limited number of values and there is no meaningful data between the data categories. Therefore, these distributions are always shown as histograms with counts of data elements in each category or bucket that is represented by a vertical bar.
Bernoulli distribution
This is a distribution for data that can only take on one of two states, such as pass/fail. This distribution shows the percentage of instanced in each state. The horizontal axis is the two states. The vertical axis is the percentage or proportion of each state. An example is the percentage of first-time yield.
Binomial Distribution
This is a distribution for data that can only take on one of two states such as pass/fail. This relies on using a fixed batch size. The horizontal axis is the count per batch. The vertical axis is normally the percentage of occurrences for that count. An example is the number of defects in a batch.
Poisson Distribution
This distribution is a count of occurrences of an event. The batch is based upon a time dependency. Such as a day. The horizontal axis is the count of the instances that occurred in that time period. An example is the number of phone calls a day.
Uniform Distribution
This distribution is the probability of the occurrence of several different outcomes when there is an equal chance of any of the outcomes. The horizontal axis is the outcomes and the vertical axis is percentage. An example is the likelihood of any number on a roll of a dice.
Geometric Distribution
This distribution is again used with data items that only have two states. In this case, instead of counting the number in a batch, it is based on when the state changes. The horizontal axis is a timeline. The vertical axis shows the probability of the event changing state during that time period. An example is plotting the day of the month when rainfall or snowfall first occurs in that month.
Hypergeometric Distribution
This distribution is again used with data items that only have two states. In this case, the information graphed is
the proportional count within a sample. The horizontal axis is a count. The vertical axis shows the probability. An example is the number of defects found in a sample selection of a shipment of products.
PDF and CDF
The final topic in distributions is the PDF and CDF displays. PDF stands for Probability Density Function. This is the type of display used in all the curves shown earlier in this reference guide. The height of the vertical axis is showing the probability that a data point will occur at that value of the horizontal axis. The higher the point, the more density at that point of the distribution.
CDF stands for Cumulative Distribution Function and shows the probability that a point in the distribution will have occurred by that level of the horizontal axis. In a CDF, the curve always starts at a zero on the left end – a probability for that low-end value, and ends at one on the right end – representing that all data points have occurred and the probability is then 100%.
If the distribution is a uniform distribution, the PDF is a flat line (as shown above) and the CDF is a straight diagonal line going from zero to one. If the underlying distribution is a normal curve when shown as a PDF, it is an S curve when shown as a CDF. The slope starts very shallow when small changes are occurring on the left tail of the normal curve. The slope becomes steep in the center of the curve when the normal curve is peaking, and then the slope becomes shallow again as the horizontal axis approaches the right side of the normal curve. And an exponential curve will start at zero, immediately leap up to the value of the vertical axis and then start to flatten out, ending at the value of 1.
Hints & tips
- If the graph is a bar graph (histogram) it is discrete data.
- PDF and CDF show the same information, just with different ways of expressing the vertical scale values. PDF is for that specific horizontal scale value. CDF is for all the horizontal scale values to the left of that point.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.