Avoiding the pitfalls of automated systems
by Lynne B. Hare
Much ink and training time have been devoted to measurement systems analysis (MSA)—the adopted child of Six Sigma—and other statistical training programs that continue to add great value through improvements in quality and productivity.
MSA embraces sample-to-sample (or incident-to-incident variation) together with accuracy and precision. Precision, in turn, is often thought to consist of repeatability and reproducibility. Repeatability is the variation from repeated measures of the same sample by the same operator, while reproducibility is variation brought about by operator-to-operator differences and the failure of differences among samples to be the same from operator to operator. See Figure 1.
Most impactful quality education programs cover the precision portion of MSA by teaching gauge repeatability and reproducibility, and several statistical software packages contain special point-and-click features for combined tabular and graphical analyses of data stemming from these studies. Carried out properly, they provide great value by isolating and quantifying major sources of process variation.
The portion of MSA that seems to get less attention is accuracy. This could be because it’s easy to assume that an instrument is accurate, or you think it is best to take the equipment vendor at his word, or it is difficult and time consuming to assess accuracy, or for a variety of other reasons.
A new reason, offered to me during a consulting session, was that the equipment contained a black box routine, meaning one with a computer algorithm that is unknown to the user used for self-calibration. "What could possibly go wrong?" I thought, so I asked to take a closer look. As part of its morning exercises, the equipment’s computer requests submission of its "knowns" (K) at three levels. It provides assessments (A) for each of the three. Table 1 shows an example. Next, it calculates a regression equation relating the assessments to the knowns.
With these data come the slope and intercept estimates of the least squares fit together with a correlation coefficient. The slope is 0.9040, the intercept is 0.0211 and the correlation coefficient is 0.9997. The equation relating assessment to known is:
A = 0.0211 + 0.9040K (equation one).
In addition, we get predictions at each known observation, the difference between the predicted and the observed, and that difference expressed as a percentage of the corresponding known. As a bonus, there is a printer plot showing each of the three plotted observations.
Transformations are available, apparently, but the default is "linear/linear," presumably linear in both K and A, with the suggestion that some other transformations are available to the user. Also, you are told that this is a linear regression data reduction, so maybe there are other model forms available. You would have to read the manual to find out. Does anyone read the manual anymore?
You can check, too, but I found that I agreed closely with the estimates of intercept, slope and correlation, all assuming a linear model. See Table 2. Incidentally, I was delighted to see the plot of the data. It agrees with my advice: Always, always, always—without exception—plot the data, and look at the plot.
The question is how best to use the plotted data and the slope, intercept and correlation estimates to tell if you should be satisfied with the accuracy of the device. Granted, from the plot, the data appear to follow a straight line, and the fact that the slope is close to one suggests a sensitivity of assessment to known, while the intercept being near zero conforms to intuition about the relationship. But how far from one must the slope be, and how far from zero must the intercept be, to cause concern?
Perhaps the software author tried to help answer these questions by listing the correlation coefficient. The Pearson correlation coefficient measures proximity to linearity, so the closer to one, the greater the assurance of no curvature. But again, where do you draw the line? If the correlation coefficient were 0.90 instead of 0.9997, should we worry about curvature?
Some assistance may come from examining confidence intervals about the estimates.
As can be seen from Table 2, confidence bounds about the intercept and slope estimates appear to support the anticipated values of zero and one, respectively. But you are unable to produce a confidence interval about the correlation coefficient with only three observations.
The likely intent of the black box output is for the analyst to use the regression line equation 1 in reverse to arrive at the "calibrated" or adjusted known value from the assessed value. Indeed, in this case, this was the practice. The reverse equation is:
K = (A - 0.0211) / 0.9040 (equation two).
How good is the estimate of the known, given the assessment? Inverse prediction shows that the 95% confidence limits for the known—given an assessed value of 0.2, for example—are 0.0678 and 0.3170. This interval occupies almost 50% of the known data range, so I wonder about the worth of the estimate to the analyst and the ultimate customer of the result.
What can be done to lower the range of uncertainty? Run more knowns. It turns out that if 10 knowns were submitted, say from 0.05 to 0.50 in steps of 0.05, and if the same relationship as in equation one were found, the range of uncertainty would be narrowed to roughly 8% of the data range.
A better approach
But wait. We’ve gotten way ahead of ourselves. Look back at Table 2 and Figure 2. Is there anything to persuade us that the intercept and slope of the regression of assessment on known are different from zero and one, respectively? No. Then why should you adjust the assessment values you read? There is nothing to suggest anything unusual in the direct readings. Therefore, the inverse regression calculations are unnecessary. Moreover, their use results in increased variation of the outcome.
How might you do better to test and assure accuracy? For starters, it may not be necessary to calibrate with every analytical run. You should experiment to learn how long the equipment holds its calibration in the face of repeated use. Secondly, you should give yourself a fighting chance at finding departures from expectation by running about 10 knowns, instead of only three, that span the range of routine operation. Examine the slope and intercept together with their respective confidence intervals. It is OK to look at the correlation coefficient to get a rough idea of linearity, but you might be better off fitting a second order model—after plotting the data, of course—to learn if there is significant curvature in the relationship between assessment and known.
Table 3 shows how the expanded data set might appear, and Figure 3 shows a plot of the data.
The plot pleads for a linear fit, but just to satisfy your curiosity, you can fit a second order term as in the model:
A = b0 + b1K + b2K2 (equation three),
b0 represents the intercept.
b1 represents the linear slope.
b2 represents the second order coefficient, which is a measure of curvature.
Relevant statistics are shown in Table 4.
Notice that in the second order model, the K2 term’s 95% confidence interval contains zero. This suggests there is no real evidence of curvature, and the second order model is not a good fit for these data. Instead, you should look at the first order model’s statistics in the lower half of Table 4. There, you see that the confidence intervals contain the anticipated zero intercept and slope of one. The correlation coefficients for both models are very close to one, but as stated earlier, they don’t really provide the best information for linearity in this case.
You might speculate that the equipment manufacturer has two departments developing the analytical machinery. One department produces a machine that works very hard to provide highly accurate results, and the other department creates parallel software to check on the accuracy and guide the analyst toward correction when the accuracy doesn’t match expectations. This is all well intended, but the departments don’t know that they are sometimes working at cross purposes. Even when the machine gets results that are well within expectation, given the inherent variation, the software tells the analyst to modify them.
Now, in defense of those who produce and sell analytical equipment and its accompanying black box software, you must remember that both workload and speed of results are of the essence. Technicians cannot be expected to spend hours on calibration. So the morning’s scanty exercise of only three calibration points might not be such a bad thing as a matter of routine.
It might portend disaster, but I couldn’t recommend it as routine or as the basis for inverse prediction. Rather, there should be a more extensive calibration study involving more levels of the known.
Automated equipment or not, and black box software or not, the point is that a plan for the routine assessment and maintenance of accuracy is essential. Without it, you’re just busy with numbers.
- The author thanks J. Richard Trout for his helpful improvement ideas on this column.
Lynne B. Hare is a statistical consultant. He holds a doctorate in statistics from Rutgers University in New Brunswick, NJ. He is past chairman of the ASQ Statistics Division and a fellow of both ASQ and the American Statistical Association.