Why Statisticians Model Data

How this intrinsic behavior helps build process knowledge

by Lynne B. Hare

What follows is an apology. Not an apology as in "I’m sorry," but an apology as a justification or explanation.

You might be thinking it shouldn’t be necessary to explain why statisticians model data. It is just their nature. Why do dogs sniff bushes? Why do cats rub up against legs? Why do bears … well, you get the idea.

The statistician’s behavior may have been learned and then embedded during graduate school and throughout his or her subsequent career—you’ll get no argument from me there. But it is not nearly as mindless or knee-jerk as you might think.

There are at least two main reasons for data modeling behavior. The more obvious is to learn about variables or factors that influence responses of interest.

Take as examples the decay of active drug ingredients and the efficacy of certain surfactants in body washes. In the first example, the goal is to learn the effects of time, storage conditions and packaging materials on stability.

In the second, the goal is to determine which surfactants, acting along or in combination and at various proportions of the mix, will be responsible for the greatest efficacy.

Neither of these goals can be met without modeling data. Both goals are important to the businesses involved, and no one is better positioned to meet them than statisticians.

But a second reason for data modeling behavior is as important as the first, and it is applicable to all types of processes. A key element of process improvement is the quantification of process capability. I’m not talking about your father’s Cp or Cpk, but rather about getting down to the inherent, intrinsic variation due to the interface between equipment and raw materials in manufacturing or due to common causes of variation in the service sector.

Modeling can help strip away most—if not all—of the assignable causes, leaving a residual variation that represents capability. After capability is estimated, anything that departs from it should be met with intolerance because it gets in the way of performance at its best.

Modeling benefits

Here is an example that does both. It points to potential causal relationships, and it whittles down the total variation to unveil the inherent, intrinsic variation. 

There is an inline homogenization device—with multiple input and output streams—which mixes ingredients (homogeneously) into hand cream. Some really capable engineers have gotten hold of it as evidenced by its ability to output the amount passed through the streams in millisecond intervals. That’s fast, and it is tempting to turn this monster on and see what it does.

But a statistical red flag goes up: Do we really need data in millisecond intervals, and wouldn’t they be highly correlated if we had them? That is, wouldn’t high amounts be followed by other high amounts, and wouldn’t lows be followed by lows?

A local engineer mentioned the reporting interval could be scaled back to one second. As a result, a one-second interval was chosen instead of the default interval, even though the same kind of correlation, called autocorrelation,1 would be expected (although to a lesser extent).

Throwing caution to the wind, we started the homogenizer and let it run until an expert said it had reached equilibrium. I’m not sure what that means, but he had a wrench in his hand, so I didn’t ask. We let it run for a little more than 10 minutes, output the homogenizer’s data into a laptop and plotted the data2 in a simple run chart, as shown in Figure 1.

Figure 1

It seems pretty clear that the amounts delivered by the homogenizer are not stable over time. For example, look at the final group of observations beginning at about the 475th observation. Why are they so much more scattered than the earlier observations? We don’t know, of course, but clearly something happened. An increase in variation such as that could not have happened by chance alone.

If we had failed to plot the data and instead had taken them on blind faith (heaven forbid), we would have calculated a standard deviation of 3.02. Eliminating the observations beyond the 475th brings the standard deviation down to 2.14. That’s a considerable step toward getting to the inherent, intrinsic variation. We could eliminate other segments of the data as well, but the more we do that, the more we risk making arbitrary decisions.

Autoregressive model

Digging a little deeper, we can examine the autocorrelations representing the data up to observation 475. The correlogram is shown in Figure 2. Notice a few high peaks at the early lags, then a damping oscillation. For those in the know about time-series modeling, that is a tipoff about the kind of modeling that should be done.

Figure 2

The message is that we should look at a model of the form:

Xt = c0 + c1Xt– 1 + c2Xt– 2 + … + εt

in which:

  • Xt is the predicted value at time t.
  • Xt - 1, Xt - 2 … are observed values at times t - 1, t - 2, and so on.
  • c0 is a constant representing the mean of the series.
  • c1, c2, … are coefficients estimated from the data.
  • εt is the error at time t (the difference between the observed and the predicted values).

This is called an autoregressive model because it regresses observations on their predecessors. When the model is fit to the data, the autoregressive coefficients up to and including the fourth order (lag) are found to be statistically significant while those above that order are not as shown in Table 1.

Table 1

The suggestion is that the homogenizer’s "memory" lasts about four seconds. That information gives rise to the question, "What part of the process upstream of the homogenizer might cause such memory?" Is there a pump with a pulse of about four seconds? A mixer with four components to it? We don’t know, but we have some clues to fuel detective work.

Another key finding from this time-series modeling effort is that the residual standard deviation is 1.90. This estimate of inherent, intrinsic capability is down from the passive, initial estimate of 3.02 and the reduced estimate stemming from elimination of observations following an obvious process change. It points to an opportunity for improvement of a process initially thought to be stable.

The point is that modeling helps identify causality, determine underlying process capability or both. Any way you look at it, it builds process knowledge.

And that’s what will make you smarter than your average bear.


  1. To calculate an autocorrelation, write down the output numbers in one column, then rewrite them in a second column, but shifted down one row, leaving a blank in the first row of the second column. If you calculate the correlation coefficient of the numbers in the first and second column, ignoring blanks, you will have the autocorrelation of lag 1. Do it again, but move the new column down two rows, and you will have an autocorrelation of lag 2. You can continue this for many lags, assuming your data set is long enough. If you plot the value of the correlation on the y axis against the lag on the x axis, you have a corellogram.
  2. Refer to Lynne B. Hare’s first law of data analysis: "Always, always, always, without exception plot the data—and look at the plot." But you know that already.


Thanks to JMP for use of its software and Keith Eberhardt and Mark Vandeven for careful reading and tactfully placed suggestions.

Lynne B. Hare is a statistical consultant. He holds a doctorate in statistics from Rutgers University in New Brunswick, NJ. Hare is past chairman of the ASQ Statistics Division and a fellow of both ASQ and the American Statistical Association.

Average Rating


Out of 0 Ratings
Rate this article

Add Comments

View comments
Comments FAQ

Featured advertisers