Retired course
This course has been retired and is no longer supported.
About this lesson
A statistical analysis or test creates a mathematical model to fit the data in the sample. The real world data seldom precisely fits the model. The differences between the model and the actual data is known as residuals. The residuals in any analysis, whether a regression analysis or another statistical analysis, will indicate how well the statistical model fits the data. When the residuals indicate a bad fit, a different analytical approach should be selected. This lesson explains how to read residual graphs and analysis.
Exercise files
Download this lesson’s related exercise files.
Residual Analysis.docx246.1 KB Residual Analysis - Solution.docx
60.1 KB
Quick reference
Residual Analysis
Residuals are the difference between the actual data and the predicted data values based upon the Hypothesis test solution. The analysis of the residuals is a way of assessing the validity of the Hypothesis test.
When to use
When the Hypothesis test creates a formula or prediction of the data values, residuals can be calculated. Residuals are created for hypothesis tests that use regression analysis and ANOVA.
Instructions
Some hypothesis tests form a “best fit” equation to model the system performance based upon the data set. These “best fit” equations should closely approximate the real world. But normally the actual values will be slightly different. When creating the “best fit” the actual values are compared to the predicted value and the difference is a residual. The “best fit” solution is determined by a set of calculations of these residuals. The mean of the residuals must be zero and the absolute value of the sum of the residuals is at a minimum.
The residuals can be plotted and a review of the residuals will provide an assessment of whether the “best fit” plot is truly a good model for the data set. There are several things to consider when reviewing the residuals. The first is whether the residual plot is normal. A valid “best fit” should result in a normal plot. That of course is characterized by a mean of zero. But also, there are approximately the same number of points above and below the line – it is not skewed. And there is a central tendency to the data – meaning appropriate kurtosis. When plotted in a histogram, the residuals should have a bell-shaped curve. When plotted against the normal line, the residuals should fall on the line or very near it.
In addition, the value of the residuals should not be dependent upon the time-wise nature of the process. That means that neither the mean or absolute value are time dependent. In the example shown below the value is time dependent and indicates that the “best fit” equation is missing a term that would capture this effect.
Finally, when considering the residual plot that is either based upon the order of the residuals occurring or the value of the response variable, watch for patterns in the data. Again a strong pattern is an indication that the “best fit” is missing something. The graph below illustrates this point.
When the residual analysis indicates a problem with the “best fit,” the solution is normally to switch to a multivariate solution or a non-linear solution. Both of those approaches introduce additional terms into the “best fit” equations that will account for the observed issues. These topics are discussed more in later lessons.
Hints & tips
- Minitab will create the residuals by selecting the graphs button and choosing which residual graphs to use. I normally select the 3-in-one or 4-in-one views.
- 00:04 Hi, I'm Ray Sheen, I'd like to introduce us now to the topic of residual analysis.
- 00:10 We can do residual analysis with many different types of hypothesis testing,
- 00:14 including linear regression.
- 00:16 So this is a good time to introduce this topic and
- 00:19 to understand this aspect of our hypothesis test analysis.
- 00:25 You're probably asking what is a residual?
- 00:28 Let's use the regression example to illustrate this concept.
- 00:31 Recall that the regression equation was the best fit equation for
- 00:34 the relationship between the independent variable and the dependent variable.
- 00:38 But many of the data points did not fall precisely on the line, but
- 00:41 were instead a bit above or below the line.
- 00:44 A residual is the difference between an actual values of your data point and
- 00:48 predicted value of that point based upon the value of the independent variable.
- 00:53 Every data point it has a residual.
- 00:55 If the data was exactly on the line, the residual will be equal to zero.
- 00:59 In fact the term best fit line,
- 01:01 is the line is the line where the sum of all the residual was as small as possible.
- 01:07 An analysis of the residuals can be done to determine that the regression line
- 01:11 was really the best fit.
- 01:13 For instance a better fit might be a curved line
- 01:16 which we will discuss in a later lesson.
- 01:18 Or the analysis may highlight for
- 01:20 us a special cause factor that is otherwise be right in the data.
- 01:24 Many of the hypothesis tests will calculate residuals.
- 01:28 These tests do something similar to the residual best fit line.
- 01:32 They determine the best fit or relationship for
- 01:34 the statistic being used, and compare the actual data to that best fit.
- 01:39 The differences are residuals.
- 01:41 Regardless of the hypothesis test,
- 01:43 the residuals always take on the same form and format.
- 01:46 So let's look at some of the form and analysis associated with those.
- 01:50 First, let's go through a few assumptions about how to find and use the residuals.
- 01:55 You can turn on the plot of the residuals to judge whether the hypothesis test
- 01:59 statistic is good.
- 02:01 Excel will only provide a residual plot for the regression analysis,
- 02:05 Minitab will provide plots for almost all of the hypothesis tests.
- 02:08 Just select the Graphs button in Minitab, and then select which plot you want.
- 02:13 I normally just do the three-in-one, or four-in-one option, and
- 02:16 that gives me all of the residual plots.
- 02:19 When your statistical analysis did a best fit analysis, it can plot the residuals.
- 02:24 In the plot versus fit graph, the mean value is always zero for
- 02:27 best fit plot, since that is the definition of the best fit.
- 02:31 It is the answer for which the sum of residuals is zero.
- 02:35 We hope that this plot will be normally distributed with respect to the residual
- 02:39 values.
- 02:40 That means that there are the same number of points above and below the mean, and
- 02:43 most of the data points are clustered near the center line, and
- 02:46 fewer are found at the edges.
- 02:48 The residual values should be random,
- 02:50 which would mean that the standard deviation and variance are stable.
- 02:54 Also there shouldn't be a pattern on the data plots as plotted.
- 02:58 When these assumptions are not valid for
- 03:00 the residual data as plotted, then you need to revise your analysis.
- 03:05 So on the case of regression analysis, the revision could be to add another
- 03:09 independent variable that could be the underlying cause of the changes in both
- 03:13 the independent and dependent variables.
- 03:15 Another option is to try a curved, or non-linear plot.
- 03:19 For instance, if the relationship is one of exponential decay,
- 03:22 the residuals will have a distinct pattern.
- 03:25 A residual plot that's also found in Minitab is the normal
- 03:29 distribution of the residuals.
- 03:31 The plot of the actual value is the residual should always have
- 03:34 a mean of zero.
- 03:35 Whoever that is not mean residual set is normal, Minitab will check if for us.
- 03:40 We can use the plot of the normality, and this plot residuals are graft against
- 03:44 the line representing the normal curve probability.
- 03:48 If the actual residual value points closely follow the normal curve line,
- 03:52 then we can say that residuals are normally distributed.
- 03:55 If normality is not achieved, try additional higher order terms for
- 03:59 your regression equation.
- 04:01 That will turn it into a non-linear regression equation, but
- 04:04 Minitab can handle that, and we'll discuss it more in another lesson.
- 04:08 Another perspective on residual analysis is to consider the pattern in the residual
- 04:12 variation.
- 04:14 Minitab will create a time-wise plot on the residual values
- 04:18 called the versus order plot.
- 04:20 The sequence or time order is based upon the order of the data in the column.
- 04:24 If you suspect this is a problem, make sure your data is recorded in the column
- 04:29 in the order in which the process was operated.
- 04:32 This plot allows us to see if the residuals are changing over time,
- 04:35 indicating that there is another effect at work.
- 04:38 Our example on this slide shows that we clearly have something else going on.
- 04:43 The value of the residuals getting larger over time.
- 04:46 By the way, Minitab will not draw the blue lines,
- 04:49 I added that to show that the funnel is opening up.
- 04:52 You will just have to eyeball this one.
- 04:54 When you see this, it is almost always because
- 04:57 there's some other terms that needs to be added to the analysis.
- 05:00 Turn this into a multi linear regression analysis.
- 05:04 If that doesn't work, try a higher order term, especially an exponential or
- 05:08 log rhythm term.
- 05:10 Finally, we need to check for independence in the residuals,
- 05:13 that means looking at any of the plots were an obvious pattern that can be found.
- 05:17 Once again, I will eyeball this effect.
- 05:19 On this chart, we see some type of oscillation, every 3 or
- 05:23 4 points the residual changes, signs, and increases the value.
- 05:27 It's like a teenager learning to drive an over controlling of the vehicle.
- 05:32 In this case I'm using the plot versus order to check for this pattern.
- 05:36 But some patterns will be visible on the normality graph and
- 05:39 some on the basic plot versus fit graph, be sure to check all of them.
- 05:43 Again, that's why I do the three in one, or four in one option for residual graphs.
- 05:48 Depending upon the effect you see, you can try another independent variable or
- 05:52 try a higher order variable.
- 05:54 I will talk about both of those options in the next few lessons.
- 05:58 Residual analysis is an excellent check for
- 06:01 the goodness of the fit of the regression curve.
- 06:04 Now keep in mind many other hypothesis tests will also use residual analysis, and
- 06:09 these same principles will apply with each of them.
Lesson notes are only available for subscribers.
PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.