Back to course

Samples and Sample Selection

Locked lesson.

About this lesson

Hypothesis testing relies on the use of data samples. However, the power and value of the hypothesis test are based on the size of the sample and the means by which it was selected. In this lesson, we consider factors for selecting sample data points and we determine the size of the sample needed based on the desired accuracy of the answer.

Exercise files

Download this lesson’s related exercise files.

Determine Sample Size Exercise.docx
62.4 KB Determine Sample Size Exercise Solution.docx
62.6 KB

Samples and Sample Selection

The size of a sample data set will impact the confidence interval when doing inferential statistics. It is critical for an accurate analysis to have a sample data set that is large enough to provide accurate data and that the data points are not biased or skewed.

When to use

When the confidence level has been set and a desired margin or error or confidence interval has been established, then the sample size equation must be used to determine the sample size. In addition, when collecting sample points, appropriate considerations must be given to which data points are selected so as not to bias the data.

Instructions

Inferential statistics are based on statistically analyzing a data sample and inferring characteristics about the full data population. Sampling is often done to save time and money because collecting all the data points from a population would be nearly impossible. An obvious concern is how big is the sample and how was it chosen.

The sample should meet certain characteristics if it is to be used as a surrogate for the entire population.

Representative – it includes data points that capture the fluctuation and changes found in the data population
Sufficient – there are enough data points to detect patterns in the data
Contextual – other system or environmental effects that could influence the data are recorded
Reliable – the measurement system is able to provide data that is precise and accurate
Random – every data point has an equal chance of being selected

Data collection then should be well-considered and planned activity to ensure the data points will meet these criteria. That means that before data is collected, the problem must be defined so that you can decide what data is needed to conduct the analysis. The sample size can then be calculated using the formula:

Where Z _α/2 is the Z value associated with the confidence level

Where σ is the population standard deviation

Where E_m is the margin of error that is acceptable for the statistics

Then decide on a sample selection process that ensures the data points are representative and random. Before collecting the data, ensure the measurement system will provide accurate and reliable data value

Hints & tips

The data collection approach, including how the data is collected and how many samples are collected, will limit the accuracy of the statistical analysis. So be certain that your approach will provide enough reliable data to conduct the analysis
In some cases, there is already a large body of data that has been collected. If the contextual aspects indicate that you can get representative and reliable data from that database, additional data may not need to be collected
The same size equation is derived from the equation for the Confidence Interval – which essentially becomes the margin of error term in the sample size equation.
Based on the sample size equation, the required number of data points in the sample goes up when the confidence level is increased, or the standard deviation gets larger, or the margin of error is reduced. By the same token, the number of points in the sample can be reduced if the confidence level goes down, the standard deviation gets smaller or the allowed margin of error in the statistics can increase.

00:04 Hi, I'm Ray Sheen.
00:05 Now many times when doing a Lean Six Sigma project,
00:08 you can decide how many samples will be used in the analysis.
00:12 Now you don't have to guess how many to use.
00:14 You can calculate the exact number that you will need based upon your alpha level.
00:20 >> Let's quickly review the terms of sample and population.
00:24 Through our sampling approach, we obtain a subset of the data points.
00:29 Through our analysis of the subset, we can make inferences about the population.
00:35 So sampling is used whenever we are unable to easily obtain all the data
00:39 points from a data population.
00:42 Sampling can be a very effective manner to get reliable statistical information,
00:47 provided we sample appropriately.
00:50 If the subset that we select, which is often called the sample set,
00:54 is truly representative of the entire population,
00:57 the sample statistics can be used to infer the population statistics.
01:03 So what does that mean?
01:05 Well, representative means that the variation in the sample is
01:08 similar to the variation in the full population.
01:12 Another attribute is that the sample is sufficient.
01:15 That means that there are enough data points in the sample subset to reveal any
01:20 patterns in the data.
01:23 Third, is normally helpful to have contextual data,
01:26 which means data about what was happening in the environment in
01:30 which the population resides during the time that the sample was taken.
01:36 Fourth is reliable.
01:38 The sample points are measured in such a way that there is high confidence in
01:42 the precision and accuracy of the sample values.
01:46 And finally the sample is random.
01:48 That means that every point in the population had an equal chance of being
01:52 selected.
01:54 The person collecting the data didn't specially pick out certain items in
01:58 order to get the answer they wanted.
02:01 So to achieve all these characteristics just mentioned,
02:05 you need to do some planning about how you will select your sample data points.
02:10 The first step is obvious.
02:11 Decide what process or problem you will be analyzing.
02:15 This helps to establish the boundaries of what can be sampled.
02:19 Next based upon the analysis, consider what types of data you need.
02:24 Keep in mind in today's business environment,
02:26 there are often huge databases available that may already have plenty of
02:31 data points that you can select from.
02:33 You may not need to collect new data.
02:36 Now determine the size of the sample.
02:38 This is an easy calculation that we will discuss on the next slide.
02:42 A word of caution,
02:43 if you don't know anything about the data you may need to collect some preliminary
02:48 data in order to calculate the final number of data points that you will need.
02:53 Now decide how you will ensure that data is representative and random.
02:57 To do this, you will need to take into account such things as location, timing,
03:02 operators, and process performance.
03:05 Before collecting the data,
03:06 make sure you have a measurement system that you can trust.
03:10 And finally I want to highlight that determining the sample size is not
03:14 a trivial or irrelevant topic.
03:17 The sample size will impact the cost and time of the data collection.
03:22 It will also impact the confidence interval of the statistical values that
03:26 you calculate.
03:27 So think about the depth and
03:29 breadth of the data that is outside your sample population.
03:32 Consider if it is fairly represented by the sample.
03:36 So let's take a look at the equation to calculate the sample size.
03:40 Will derive the sample size equation from the confidence interval equation.
03:45 Now that equation the width of the interval is the Z score for
03:49 the selected alpha times the standard deviation divided by the square root of
03:54 the number of data points in the sample subset.
03:58 The confidence interval equation can be transformed to an equation to
04:02 calculate that sample size n.
04:05 We see that if we take the Z value for the alpha level, times the standard
04:10 deviation and divide that by the desired magnitude of the confidence interval,
04:15 and then square all of that, we get the sample size value.
04:19 One rather obvious point always round up to the nearest whole number.
04:24 We can't select a fraction of a data point.
04:28 So you see that to calculate the sample size,
04:31 we need the population standard deviation.
04:34 Of course we don't have that.
04:35 But if we have done an earlier preliminary sample, we can let the subset standard
04:40 deviation be a surrogate for the population standard deviation.
04:44 Next we need the confidence level,
04:46 which is the alpha value used to calculate the Z-alpha terms.
04:51 And finally, we need the targeted maximum margin of error,
04:55 which is called the Confidence Interval.
04:59 Looking at the equation, there are some obvious implications.
05:03 If you increase the required confidence level, the alpha value,
05:07 that will increase the alpha and increase the required number of data points.
05:13 And if the population has a high variability,
05:16 it will have a high standard deviation.
05:19 The standard deviation goes up, the number of samples goes up.
05:23 Finally, at the margin for
05:25 error also known as the confidence interval gets smaller, that will increase
05:31 the number of samples since the term is in the denominator of our formula.
05:36 >> Based upon your confidence level, desired margin of error and
05:40 the inherent variability within the data population.
05:44 You can precisely calculate the number of data points that you need for your sample.

Lesson notes are only available for subscribers.

PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.

Samples and Sample Selection

About this lesson

Exercise files

Quick reference

Samples and Sample Selection

When to use

Instructions

Hints & tips