Retired course
This course has been retired and is no longer supported.
About this lesson
Exercise files
Download this lesson’s related exercise files.
Inferential Statistics.docx63.2 KB Inferential Statistics - Solution.docx
63.3 KB
Quick reference
Inferential Statistics
Inferential statistics relies on the statistical analysis of a subset or sample of an entire population of occurrences to draw conclusions about the entire population. A key to successful inferential statistics is the selection of the sample.
When to use
Inferential statistics are used when the data from an entire population of occurrences or iterations of a product or process are not readily available. This could be because of a long time that the product or process has been in use, or it could be because of the access and availability of the product or process is limited.
Instructions
Inferential statistics is a branch of statistical analysis that relies on using the statistical analysis of a subset or sample from a data population to draw inferences about the statistical measures that are applicable to the entire population. In many cases, the entire data population is not available for measurement. This is particularly true for products or processes that have been in use for a long time period. The earlier iterations of the product or process are either no longer in existence or are out of the control of the product or process manager and therefore cannot be measured as part of the population.
Contrasting descriptive statistics with inferential statistics, there are a few obvious differences. Descriptive statistics analyze a set of data to provide insight into the real world business processes associated with that data. Inferential statistics analyze a sample set of data to provide insights into the larger data population from which the sample was drawn. Descriptive statistics are a mathematical analysis of the existing data. Inferential statistics use the descriptive statistics from the sample data and infer population statistics that will fall within a certain range.
Calculating descriptive statistics for the sample data will provide insight into the statistics applicable to the full population. Terminology that will be used in the hypothesis test discussions will differentiate at times between sample statistics and population statistics.
|
Size (# of points) |
Mean |
Standard Deviation |
Population |
N |
μ |
σ |
Sample |
n |
x-bar |
s |
While it is clear that precise statistical values are only available for the actual data in the subset or sample; if that sample fairly represents the entire population, then those values are excellent surrogates for the statistical measures of the entire population. Therefore, it is imperative that the sampling approach which is used to gather the sample data is one that will fairly represent the entire population. There are several principles that must be followed for this to occur. The sample must be:
- Representative: Sample accounts for changes in the process due to fluctuations in the process variables.
- Sufficient: The sample is large enough so that any patterns in the data are likely to be present in the population.
- Contextual: Soft data is collected to indicate what else is happening in the process.
- Reliable: Data collection is repeatable, reproducible, does not influence the sampling.
- Random: Every member of the population has an equal opportunity of being selected.
Data collection is often relatively easy when there is already a large body of data available from the population. In that case existing data is used, rather than collecting new data. However, the sampling approach must reflect the principles described above. Another key question is the sample size. This will be addressed in a later lesson on confidence intervals. If new data is needed, a measurement systems analysis should be done to ensure the data can be trusted.
Hints & tips
- If all the data is available, use it. Don’t rely on inferential statistics.
- Carefully consider your sampling plan to ensure it is representative, sufficient, reliable, random, and has appropriate contextual information. Your existing data may not cover all of these. If so, it is best to collect additional data rather than ignoring one or two of the data sample characteristics.
- 00:04 Hi, I'm Ray Sheen.
- 00:06 You know often when we conduct a hypothesis test, all of the data from
- 00:11 a processed population is not available instead we only have a sample.
- 00:15 From that sample, you will need to be able to
- 00:18 infer meaningful information about the entire population.
- 00:23 For that, we need inferential statistics.
- 00:26 So what do we mean by the term, inferential statistics?
- 00:30 Inferential statistics is exactly what it says.
- 00:33 It relies on statistical data of a subset or sample data from a data population
- 00:39 to infer or draw conclusions about the entire population or dataset.
- 00:44 We use this approach when the entire population is not available, but
- 00:47 only a subset.
- 00:48 We study that subset in detail and then draw conclusions about the full dataset.
- 00:53 Just to be clear, if the full dataset is available, use it.
- 00:57 Years ago, when the analysis was done by hand,
- 00:59 analyzing large datasets was time consuming and error prone.
- 01:03 Now, with modern data analysis applications, there is not a problem.
- 01:08 So if you have the entire dataset, use it all, but
- 01:11 if not, then we'll work with the sample.
- 01:13 Many times the data set that is available does not represent all of the data that is
- 01:17 applicable.
- 01:18 You may not be able to get all the data from all locations or
- 01:21 all occurrences that have happened throughout all time for that process.
- 01:26 A limitation that we have to deal with is the amount of data that we have in
- 01:30 the data sample.
- 01:31 To be able to extrapolate that data and infer conclusions about the larger
- 01:34 dataset, we need assess whether that sample data is a good surrogate for
- 01:39 the entire data population.
- 01:41 Before we get into that analysis, let's quickly review descriptive statistics and
- 01:45 compare that to inferential statistics.
- 01:48 Descriptive statistics describes the data that is being studied.
- 01:52 We often calculate things like the mean, median, and standard deviation of that
- 01:56 dataset and clarify that the size is small n items in the dataset.
- 02:01 The numerical analysis gives us a mathematical description
- 02:05 of the real world that is represented by that dataset.
- 02:08 Inferential statistics builds on descriptive statistics.
- 02:11 It uses the descriptive statistics from a sample set of data to draw conclusions
- 02:16 about what the descriptive statistics will be for the full population.
- 02:21 By extrapolating the sample data statistics,
- 02:23 we can then draw conclusions about the performance of the larger population.
- 02:27 Even though we don't have the data from that population.
- 02:31 So the sample mean where the standard deviation could be extrapolated
- 02:35 in a full population mean or standard deviation can be determined.
- 02:39 Of course, to do this, we need to carefully consider what is in the sample.
- 02:43 The sampling approach is used to simplify the analysis.
- 02:46 When the cost or complexity of trying to collect all the data is too prohibitive or
- 02:51 some of the items in the population are just not available then sampling is
- 02:54 required.
- 02:55 Sampling is a sound business practice that is used to establish business performance
- 03:00 and estimate process variability.
- 03:02 But there are some ground rules to follow when determining what data is
- 03:05 in your sample.
- 03:06 If the sample is well designed, it can be used as a surrogate for
- 03:09 the entire population.
- 03:11 So some characteristics of the sample are that it must be representative.
- 03:14 The sample is for the entire population.
- 03:17 It accounts for changes in the process due to fluctuations in the process variables.
- 03:22 Another characteristic is that it is sufficient.
- 03:24 The sample is large enough so that any pattern in the population data
- 03:28 are also likely to be present in the sample data.
- 03:32 The next characteristic of a good sample is to understand the contextual aspects
- 03:36 of the sample.
- 03:37 This means soft data is collected to indicate what else is happening in
- 03:40 the process.
- 03:41 For instance, the date and time or other business conditions.
- 03:45 Now let's talk about the need for reliable data.
- 03:48 Data collection is repeatable and reproducible.
- 03:51 The collection process does not influence the sampling.
- 03:54 Therefore, the data is a reliable representative of the full population.
- 03:59 Finally, the data is random.
- 04:01 By that, we mean that every member of the population has an equal opportunity being
- 04:05 selected.
- 04:06 The data wasn't sorted to only get good or only get bad data points.
- 04:10 Now obviously, some of the data points may not be available any longer.
- 04:15 But of the points selected, they are not pre-screened for
- 04:18 certain criteria, which leads to a need to carefully create a sampling plan.
- 04:22 So let's go through how to do this.
- 04:24 It's starts with the question or problem that is being analyzed.
- 04:27 This will determine the population.
- 04:29 Next, consider what type of data is needed to answer that question.
- 04:33 Often the data already exists.
- 04:35 Once we know what is needed,
- 04:36 we can determine how much of the data we can easily access.
- 04:40 Now determine the sample size.
- 04:42 Of course, if all the data's available, use it.
- 04:45 If not, you must determine how much data is needed to have
- 04:48 a statistically significant sample.
- 04:50 We have a lesson coming up that will discuss how to calculate the sample size
- 04:53 based upon the confidence interval.
- 04:56 This calculation is based upon some of the descriptive statistics of the sample.
- 05:00 So based on those statistics, a second sample may need to be collected.
- 05:04 Once you've determine the size and the data that is needed to collected,
- 05:07 creating an approach to ensure that is random.
- 05:10 For example, at every fifth point, a representative such as all locations
- 05:14 including all product lines, including multiple time periods,
- 05:18 including all operations, including and the other parameters would routinely vary.
- 05:24 You're now ready to establish your measurement system and
- 05:27 data collection process.
- 05:28 You may need to conduct the measurement system analysis to ensure that
- 05:31 the measurement area is not excessive.
- 05:34 Now start collecting the data or extracting it from the existing databases.
- 05:38 The data collection constraints and restrictions will ultimately
- 05:41 impact the level of inference that you can apply to the sample data.
- 05:44 This level of inference will be quantified using a confidence level that we will
- 05:49 discuss in a future lesson.
- 05:51 Hypothesis testing relies on the principle of inferential statistics.
- 05:56 Based upon what we know about the sample will influence what we think we know about
- 06:01 the entire population.
- 06:02 And based upon our data definition and the data collection approach we've used,
- 06:07 we can have high confidence in that inference.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.