Locked lesson.
About this lesson
When the data is discrete data, but there are more than two samples to be compared, the Chi Square test is used. This test is quickly accomplished using Minitab. It can be done using Excel, but requires several intermediate steps. This lesson explains this test approach.
Exercise files
Download this lesson’s related exercise files.
Chi Square Exercise.xlsx12.1 KB Chi Square Exercise Solution.docx
218.8 KB
Quick reference
Chi Square
The Chi Square test is a hypothesis test that considers categories and counts of discrete data items to determine whether the categories are independent.
When to use
When the data that is being analyzed is discrete data and the actual data is counts of different categories, the Chi Square test is the appropriate hypothesis test to determine if the categories are independent.
Instructions
The Chi Square test is a commonly used test to determine the independence of categories within a data set. This test can be used to separate statistically significant dependent relationships from those that are independent. Chi Square can be used when working across multiple sample sets of data.
The data is normally organized in a table of counts. The columns are the count categories and the rows represent the different samples as shown in the table below.
The Chi Square hypothesis test is:
Null hypothesis: Factors are independent
Alternative hypothesis: Factors are dependent
The P value will be based on whether the Chi Square statistic exceeds a value from the Chi-Square table that is based upon the alpha selected and the number of degrees of freedom.
Excel has a function for conducting a Chi Square test.
- The data is first recorded in a table in the format shown above – this is the “Actuals” table.
- An “Expected” table is created by multiplying the Row Percentage times the Column Percentage times the grand total of the counts for each of the cells in the matrix.
- The total for each row and each column should be the same in both matrices, although the Actual matrix will have whole number counts in each cell and the Expected matrix will have a calculated value that is normally not a whole number in each cell.
- Use the CHISQ.TEST function and provide the range for each matrix.
- Excel provides a P value for independence.
Minitab is able to calculate a Chi Square test.
1. Stat > Tables > Chi Square Test for Association
2. Select the data columns (You do not need to create the “Expected” table, Minitab will do that automatically.)
Hints & tips
- If doing the analysis in Excel, be sure the totals for columns and rows are the same in both the Actual and Expected tables.
- Chi Square will tell you if at least some of the factors are independent, however, if some are and some are not, it will not separate out which factors are independent and which are dependent. You will need to test that by reducing some of the columns in your table.
- 00:04 Hi, I'm Ray Sheen.
- 00:06 Sometimes we have many discrete data sets, and
- 00:08 we want to understand whether there's a relationship between them.
- 00:12 The Chi-Square hypothesis test will answer that question for us.
- 00:18 >> Start with our hypothesis test decision tree.
- 00:21 We go to the discrete X and Y data and
- 00:24 more than two variables which is the Chi-Square test.
- 00:29 The Chi-Square test is ideally suited for multiple attribute data hypotheses.
- 00:34 If you have two or more discrete variables in your sample,
- 00:37 it can be used to determine if they are independent or if they are linked.
- 00:42 Some of you may be aware of other uses of Chi-Square test and basic research,
- 00:46 such as checking for goodness of fit or homogeneity.
- 00:50 We will be focusing on the Lean Six Sigma projects, and
- 00:53 multiple discrete attribute testing.
- 00:56 And it can use more than just true false or on off data.
- 01:00 It's excellent for counts.
- 01:02 If you're only comparing two samples, the two sample tests or proportions is best.
- 01:07 But if you are comparing two or
- 01:08 more variables within one sample then use this test.
- 01:12 This test relies on the number of counts for each variable attribute characteristic
- 01:17 to determine if the different attributes are truly independent or
- 01:20 if they are related to each other or some other factor.
- 01:25 When using this test in Lean Six Sigma, be careful to choose your categories for
- 01:29 real potential root causes.
- 01:31 The Chi-Square test doesn't know if the categories make sense to be tracked
- 01:35 separately.
- 01:36 So if you're trying to determine factors relating to people who have arthritis,
- 01:41 Chi-Square will likely show that there is a relationship between people with gray
- 01:45 hair and arthritis.
- 01:47 But that does not mean that gray hair causes arthritis.
- 01:51 Obviously, much more relevant factor would be age which
- 01:54 increases the likelihood of both gray hair and susceptibility to arthritis.
- 01:59 The null hypothesis for the Chi-Square is always that the factors are all
- 02:03 independent, there is nothing connecting any of them.
- 02:07 The alternative hypothesis is that the factors are dependent.
- 02:11 There is a relationship between at least two of the factors.
- 02:14 Let me explain the Chi-Square statistic.
- 02:17 The Chi-Square test will use both the actual data and
- 02:21 a table of expected values which is referred to as a contingency table.
- 02:26 Two mirror image tables, at least in format, are created.
- 02:30 One table has the actual values, and the other table will have expected values,
- 02:34 if there were no relationships between the factors.
- 02:38 With these two tables, the chi-square statistic is calculated
- 02:42 by comparing the proportions from the actual data and the calculated data.
- 02:46 The formula is shown here.
- 02:49 The calculated Chi-Square value is then compared to the Chi-Square value from this
- 02:53 lookup table.
- 02:54 The lookup table is based upon the selected alpha, and
- 02:57 the number of degrees of freedom.
- 03:00 Let's look at how to do this with Excel.
- 03:02 It's a rather complex process that must be followed.
- 03:06 Excel uses the CHISQ TEST function to compare tables of actual values and
- 03:11 the one of expected values.
- 03:14 The actual value we can get from our data.
- 03:17 The expected value we have to create manually.
- 03:20 Both tables are structured in the same manner.
- 03:23 The columns are the event categories where we have the count data.
- 03:27 The rows are the categories that we are using to create the different proportions.
- 03:32 The actual table is easy to populate.
- 03:35 Create your rows and column categories and
- 03:37 then count the instances for each cell in the table.
- 03:41 The expected table is calculated by using the percentage values for
- 03:45 each total row and total column from the actual table.
- 03:49 The value of a cell in the expected table is the row percentage from the actual
- 03:53 table times the column percentage from the actual table times the total count of all
- 03:58 the items in the actual table.
- 04:01 Let's look at an example.
- 04:03 In this example we want to determine if the types of entertainment venues
- 04:06 people attend, is dependent upon their age.
- 04:09 The null hypothesis is that attendance at the different venues and
- 04:13 a person's age are independent.
- 04:15 The alternative hypothesis is that the type of entertainment venue attended does
- 04:20 depend upon the age of the individual.
- 04:23 This table is the actual table.
- 04:25 Their ages are divided into eight categories and
- 04:28 there are seven different entertainment venues.
- 04:31 The total for each row and each column is added up.
- 04:34 And the total for the entire table is added up in the lower right corner.
- 04:39 A total of 1181 data points are in this table.
- 04:42 Now the percentage for each row and
- 04:44 each column is also determined by dividing the row or column total by 1,181.
- 04:51 Next we will build the expected table.
- 04:53 To do this for each cell, take the column percentage for that cell,
- 04:57 multiply it times the role percentage for that cell, and
- 05:00 then multiply it times the grand total.
- 05:03 So in our upper left cell, we take the column percentage for
- 05:07 movies of 19.31%, multiply it times the row percentage for
- 05:13 age 6-12, which is 5.42%, and then multiply that result times 1,181.
- 05:19 Which is the total number of counts in the table.
- 05:23 This gives us a value of 12.36 for that cell.
- 05:27 Then continue with this process for all of the other 55 cells in the table.
- 05:33 Now you are ready to use the Chi-Square test function.
- 05:36 The argument for that functions are the range of the two tables.
- 05:40 Be sure that you do not use the total and percentage rows and
- 05:43 columns in the actual tables, just the rows and columns with data.
- 05:48 The answer from Excel is 1.59 times E to the minus 30 nights which means there
- 05:53 are 38 zeros to the right of the decimal point before we start to see any numbers.
- 06:00 In Lean Six Sigma, we call that P value 0.
- 06:03 So reject the null hypothesis.
- 06:06 The entertainment venues selected by individuals does depend upon their age.
- 06:10 Well, the Chi-Square test function is much easier to complete in Minitab.
- 06:15 The mini tap version of Chi-Square test uses the same actual table as Excel.
- 06:20 But Minitab will create the expected table for you so you don't need to do that.
- 06:25 To do this test, go to the stat pulldown menu, select tables, which is
- 06:30 near the bottom of the list and then select Chi-Square test for association.
- 06:34 That will bring up this panel.
- 06:37 Assuming your data is in the table format, select summarize data into a table.
- 06:43 Then select the table columns that contain the data.
- 06:47 Now select the columns with your category label.
- 06:50 In our example, it was the Age column.
- 06:53 Minitab now creates the expected value, runs the calculation, and
- 06:57 provides a P-value.
- 06:59 Just like with Excel, the P-value is 0, so reject the null hypothesis.
- 07:03 Age does influence the entertainment venue selection.
- 07:07 >> Chi-Square takes some work to do the analysis in Excel,
- 07:10 but it is much easier in Minitab.
- 07:13 Now whichever approach you use,
- 07:15 this hypothesis test can help us to bring order from confusion.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.