Locked lesson.
About this lesson
A statistical analysis or test creates a mathematical model to fit the data in the sample. The real-world data seldom precisely fits the model. The differences between the model and the actual data are known as residuals. This lesson explains how to read residual graphs and analyses.
Exercise files
Download this lesson’s related exercise files.
Residual Analysis Exercise.docx103.4 KB Residual Analysis Exercise Solution.docx
59.2 KB
Quick reference
Residual Analysis
Residuals are the difference between the actual data and the predicted data values based on the Hypothesis test solution. The analysis of the residuals is a way of assessing the validity of the Hypothesis test.
When to use
When the Hypothesis test creates a formula or prediction of the data values, residuals can be calculated. Residuals are created for hypothesis tests that use regression analysis and ANOVA.
Instructions
Some hypothesis tests form a “best fit” equation to model the system performance based on the data set. These “best fit” equations should closely approximate the real world. But normally the actual values will be slightly different. When creating the “best fit” the actual values are compared to the predicted value and the difference is a residual. The “best fit” solution is determined by a set of calculations of these residuals. The mean of the residuals must be zero and the absolute value of the sum of the residuals is at a minimum.
The residuals can be plotted and a review of the residuals will provide an assessment of whether the “best fit” plot is truly a good model for the data set. There are several things to consider when reviewing the residuals. The first is whether the residual plot is normal. A valid “best fit” should result in a normal plot. That of course is characterized by a mean of zero. But also, there are approximately the same number of points above and below the line – it is not skewed. And there is a central tendency to the data – meaning appropriate kurtosis. When plotted in a histogram, the residuals should have a bell-shaped curve. When plotted against the normal line, the residuals should fall on the line or very near it.
In addition, the value of the residuals should not be dependent upon the time-wise nature of the process. That means that neither the mean nor absolute value are time dependent.
Finally, when considering the residual plot that is either based on the order of the residuals occurring or the value of the response variable, watch for patterns in the data. Again a strong pattern is an indication that the “best fit” is missing something.
When the residual analysis indicates a problem with the “best fit,” the solution is normally to switch to a multivariate solution or a non-linear solution. Both of those approaches introduce additional terms into the “best fit” equations that will account for the observed issues. These topics are discussed more in later lessons.
Hints & tips
- Minitab will create the residuals by selecting the graphs button and choosing which residual graphs to use. I normally select the 3-in-one or 4-in-one views.
- Excel has an option to create a residual plot when doing the regression analysis.
- 00:04 Hi, I'm Ray Sheen.
- 00:05 I'd like to take a moment now and discuss residual analysis.
- 00:10 We can do residual analysis with many different types of hypothesis tests,
- 00:14 including linear regression.
- 00:16 So this is a good time to look at this aspect of our hypothesis testing and
- 00:20 analysis.
- 00:22 You're probably asking, what is a residual?
- 00:25 Let's use the regression example to illustrate this concept.
- 00:29 Recall that a regression equation has a best fit equation for
- 00:32 the relationship between the independent variable and the dependent variable line.
- 00:37 But many of the data points did not fall precisely on the line, but
- 00:41 were instead a little bit above or a little bit below the line.
- 00:46 A residual is the difference between the actual value of the data point and
- 00:50 the predicted value of that point based upon the regression model and
- 00:54 the value of the independent variable.
- 00:56 Each data point has a residual.
- 00:59 If the data point falls exactly on the line, the residual would be equal to zero.
- 01:03 In fact, the term best fit is the line that where the sum of all
- 01:08 of the residuals are as close to zero as possible.
- 01:12 And analysis of the residuals can be done to determinate that the regression line
- 01:16 was really the best fit.
- 01:17 For instance, a better fit might be a curve line,
- 01:20 which we will discuss in a later lesson, or the analysis might highlight for
- 01:25 us a special cause factor that is otherwise buried in the data.
- 01:30 Many of the hypothesis tests will calculate residuals.
- 01:33 These tests will all do something similar to what we have with the regression best
- 01:37 fit line.
- 01:38 They determine that the best fit value or the relationship for
- 01:42 the statistic being used and compare the actual data to the best fit.
- 01:45 The differences are residuals.
- 01:48 Regardless of the hypothesis test, the residuals always take on the same form and
- 01:53 format.
- 01:53 So let's look at some of the forms and analysis associated with these.
- 01:59 First, let's go through a few assumptions about how to find and use residuals.
- 02:05 You can turn on a plot of the residuals to judge whether the hypothesis test
- 02:09 statistic is good.
- 02:11 Excel will only provide residual plots for regression analysis.
- 02:15 Minitab will provide residual plots for almost all of the hypothesis tests.
- 02:20 Just select the graph buttons in Minitab and then select which plots you want.
- 02:24 I normally just do the three and one or four and
- 02:28 one option that gives me all of the residual plots.
- 02:32 When your statistical analysis did a best fit analysis,
- 02:35 it also plotted the residuals.
- 02:37 In the plot versus fit graph, the mean value is always zero for
- 02:42 the best fit since that is the definition of the best fit.
- 02:46 It is the answer for which the sum of the residuals is zero.
- 02:50 This plot should be normally distributed with respect to the residual values.
- 02:55 That means there are about the same number of points above and below the mean, and
- 03:00 most of the data points are clustered near the center line and
- 03:03 fewer are out at the edges.
- 03:05 The residual values should be random,
- 03:07 which would mean that the standard deviation and variances are stable.
- 03:12 Also, there shouldn't be a pattern in the data points as plotted.
- 03:16 The residuals should just be random with respect to the dependent or
- 03:19 response variable.
- 03:21 Now, when these assumptions are not valid for the residual data as plotted,
- 03:25 then you need to revise your analysis.
- 03:27 So in the case of a regression analysis,
- 03:30 the revision could be to add another independent variable that may be
- 03:34 underlying cause of the change in both of the independent and dependent variables.
- 03:39 Another option is to try a curved or a non-linear line.
- 03:43 For instance, if the relationship is one of exponential decay,
- 03:48 the residuals will have a distinct pattern.
- 03:51 A residual plot found in Minitab is the normal distribution of the residuals.
- 03:56 The plot of the actual values of residuals should always have a mean of zero for
- 04:01 the best fit line.
- 04:02 However, that does not mean that the residual data is normal.
- 04:06 Minitab will check it for us.
- 04:08 We can use a plot for normality.
- 04:11 In this plot,
- 04:11 the residuals are graphed against a line representing the normal curve probability.
- 04:15 If the actual residual data points closely follow the normal curve line,
- 04:19 then we can say that the residuals are normally distributed.
- 04:23 If normality is not achieved, try adding higher order terms to regression equation.
- 04:27 This will turn it into a nonlinear regression equation but
- 04:31 Minitab can handle that and we will discuss it more in another lesson.
- 04:36 Another perspective on residual analysis is to consider patterns in
- 04:40 the residual variation.
- 04:41 Minitab will create a time-wise plot of the residual values called the versus
- 04:46 order plot.
- 04:47 The sequence or time order is based upon the order of the data in the column.
- 04:52 If you suspect this may be a problem,
- 04:54 make sure the data in the column is in the order in which the process was operated.
- 04:59 The plot allows us to see if the residuals are changing over time,
- 05:04 indicating there's another effect at work.
- 05:06 Our example on this slide shows that we clearly have something else going on.
- 05:12 The values of the residual is getting larger over time.
- 05:15 By the way, Minitab will not draw the blue lines that I added to show that
- 05:19 the final effect was opening up.
- 05:22 You'll just have to eyeball this one.
- 05:24 When you see this, it is almost always because there's some other term that needs
- 05:29 to be added to the analysis, turning this into a multilinear regression analysis.
- 05:35 If that doesn't work try higher order terms, especially an exponential or
- 05:39 a logarithmic term.
- 05:40 Finally, we need to check for independence in the residuals.
- 05:44 That means looking at any of the plots for an obvious pattern that can be found.
- 05:49 Once again, I'll eyeball this effect.
- 05:51 On this chart, we can see some type of oscillation is occurring.
- 05:55 Every three or four points, the residuals change sign and increase in value.
- 06:00 It's like a teenager learning to drive and overcompensating and
- 06:04 bouncing off of one curb then the other.
- 06:06 In this case, I'm only using the plot versus order to check for
- 06:11 this pattern, but some patterns will be visible on the normality graph and
- 06:16 some on the basic plot versus fit graph.
- 06:19 So be sure to check all of them.
- 06:20 Again, that's why I normally do the 3 in 1 or 4 in 1 option for
- 06:25 residual graphs in Minitab.
- 06:27 Depending upon the effect you see, you can try another independent variable or
- 06:31 try a higher order variable.
- 06:33 I'll talk about both of these options in the next lesson.
- 06:37 Residual analysis is an excellent check for
- 06:40 the goodness of the fit of a regression curve.
- 06:43 Keep in mind many of the other hypothesis tests also have residual analysis.
- 06:48 And these same principles of analysis will apply to them.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.