2012

STATISTICS ROUNDTABLE

The Reality of Residual Analysis

It's easy to overlook this important technique when evaluating an analysis model

by Roger W. Hoerl

In the world of statistics textbooks, independent random samples of size 30 from a normal distribution are a dime a dozen—the norm rather than the exception. You simply perform the correct analysis (perhaps the one you just studied in the most recent chapter), everything goes like clockwork, and you move on to the next problem.

Unfortunately, the real world stubbornly refuses to conform to this alternative reality of statistics textbooks. In the real world, you rarely find normal distributions, much less independence. In addition, the variable of interest is typically impacted by variables not in your data set, as unfair as this might sound.

Therefore, you don't have the luxury of simply doing the obvious correct analysis and moving on. Rather, to avoid the potholes in the road to statistical insight, you should always take a critical look at your own analysis to ensure you haven't missed something important.

Statistics, not futures

One way to do this—certainly not the only way—is through residual analysis. No, we're not talking about the futures market. These are statistical residuals, not financial residuals.

Residuals from any model are helpful in evaluating the adequacy of the model itself relative to the data and any assumptions you might make in the analysis. A residual is simply the difference between the observed value of y (the response variable of interest) and the value of y predicted by the model:

Residual = y observed - y predicted

There is one residual for each observation. Statistical software typically standardizes residuals to put them on a common scale. How this gets done goes beyond the scope of this article. Obviously, if you had a model that fit the data perfectly, the residuals would all be zero.

As noted earlier, analysis of the residuals is an effective method for assessing the fit of the model to the data and determining whether the model is useful. The recommended approach is to study a variety of residual plots and look for patterns and trends.

Generally speaking, the absence of any patterns and trends (for example, a random scatter of points) indicates the model is adequate. In other words, the residuals will be "boring"—with no noteworthy patterns or trends—because an adequate model accounts for the systematic (predictable) behavior in y. Therefore, what remains in the residuals will be just random variation or noise.

If the model is not adequate, however, such as a model that does not account for curvature in y, then the residuals will contain both random variation and curvature. So, counter to your intuition, you are hoping to see nothing in residual plots and are concerned about seeing something interesting.

This interesting pattern, however, might alert you to something important that you overlooked. In general, you should not consider the modeling process as complete until the residuals have been thoroughly evaluated.

When estimating a statistical model, the analyst often assumes the residuals are sampled from a common normal distribution with constant variance and no correlation between different residuals. These assumptions are not made in every analysis, but they are frequently made.

There are numerous residuals tests and plots that you can make to evaluate these assumptions. Here are four plots to consider:

  • Residuals versus predicted values.
  • Histogram of the residuals.
  • Residuals versus normal probability scale (normal probability chart).
  • Residuals versus observation sequence (run chart or individuals control chart).

These plots are not independent of each other and should be interpreted as a set. It is not uncommon for a problem or issue to show up in more than one plot. It is also important to recognize there is a lot of judgment used in the interpretation of residual plots. Your ability to properly interpret these plots grows with experience.

Residuals vs. predicted values

If the model gives an adequate fit to the data and the typical assumption of independent normally distributed residuals is satisfied, the plot of the residuals versus predicted values should be boring and show no pattern or trend.

When you observe a nonrandom pattern, you should consider changing the form of the model. One pattern commonly observed is an increasing variation in the residuals as the predicted values increase (V-shaped pattern or megaphone shape).

This indicates the assumption of equal variance of the residuals is not satisfied by the data. The V-shaped pattern suggests the variation increases with the average. In this case, a transformation, possibly a log transformation (for example, analyze log y in place of y) or weighted regression should be used. As a rule of thumb, the log transformation is often helpful if the ratio of maximum y to minimum y is greater than about 10.

A curved relationship between the residuals and predicted values suggests that curvilinear terms (for example, x2) should be included in the model. With several predictor (x) variables, it often helps to plot the residuals versus each x variable to pinpoint the source of the curvature. A transformation of y, such as the log transformation mentioned earlier, might help remove the curvature and the need for additional terms to be added to the model.
Histogram of residuals

A histogram provides a view of the overall distribution of the residuals. Do they appear bell-shaped as you often expect? Are there any outliers? Outlier residuals result when the model does not adequately fit one or more observations.

When outliers exist, they typically show up in several, if not all, of the residual plots. Outliers suggest the observation is due to a special cause (perhaps a measurement error or keying mistake) or the model is inadequate (missing variables and missing model terms).

Residuals vs. normal probability scale

The normal probability plot of the residuals is constructed to check on the normal distribution assumption of the residuals.

This technique is better to use than the histogram because the linearity pattern you are looking for is easier for people to perceive than a bell-shaped pattern. The histogram, however, can help you pick up other abnormalities, and it is still a useful plot.

In the normal probability plot, the residuals are ordered by size and plotted versus the normal probability scale, which is, fortunately, calculated by the computer. If the plot approximates a straight line, the assumption that the residuals follow a normal distribution is reasonably satisfied.

Keep in mind that no real data will ever follow a normal (or any other) distribution exactly. So you are not looking for a perfect line in this plot, just a general linear trend.

The presence of a curvilinear relationship in the normal probability plot of the residuals suggests the normal distribution assumption is not satisfied. The curvature might be because the relationship between y and one or more of the x variables is not modeled properly.

For example, you might need to add an x2 term to the model. In this situation, as noted earlier, it might also be appropriate to consider transformations for y and one or more of the predictor variables (x's). These plots are diagnostic. Again, considerable judgment is required to use them properly.

Residuals vs. observation sequence

This plot of residuals versus time sequence should show no trends if the model is adequate and no special causes if you are using the individuals control chart. A trend would suggest there are one or more variables not currently in the model that have changed during the time spanned by these data, and it would improve the predictability of the model if this variable could be added to the model.

A summary of some patterns and trends found in residual plots and some likely problem sources are summarized in Table 1.

Residual plots, patterns and trends

An example

Figures 1 and 2 are sets of residual plots using Minitab software.1 They come from a regression analysis of a financial accounting variable: goodwill.

In Figure 1, you see a number of problems, including outlier residuals in the individuals (I) chart, curvature in the normal plot, a very peaked histogram and less variation at low values of y than high values of y (residuals versus predicted values or fits).

Minitab residual plots for goodwill

Based on this residual analysis, the analysts tried a log transformation of y and one x variable, and repeated the analysis, producing the residual plots in Figure 2. You can see these look much more random and boring, although there is still one outlier.

Minitab residual plots for log goodwill)

Rarely do we find perfect residuals in real applications.

Final warning

Some Six Sigma practitioners with limited backgrounds in theoretical statistics have recommended doing normality or other tests of the y variable before modeling. This is not good practice, as the theoretical assumptions are almost always for the residuals, not the original y variable. The y variable will generally contain both random and systematic variation.


Reference

  1. Roger W. Hoerl and Ronald D. Snee, Statistical Thinking: Improving Business Performance, Duxbury Press, 2002.

Bibliography

Box, G.E.P, W.G. Hunter and J.S. Hunter, Statistics for Experimenters, Wiley Interscience, John Wiley & Sons, Inc., 1978.


ROGER W. HOERL is manager of General Electric's global research applied statistics lab. He has a doctorate in applied statistics from the University of Delaware. He is an ASQ fellow, an ASQ Brumbaugh Award recipient and a GE Global Research College fellow.