Gage R&R Reminders
Running gage repeatability and reproducibility studies properly
by Lynne B. Hare
Major characteristic specification limits of a popular brand were 3% and 4%. Product lower than 3% lacked consumer appeal, and product greater than 4% had reduced shelf life. The analytical method variation, expressed as plus or minus two standard deviations, occupied 40% of the specification range.
As a result of this high method variability, some in-specification batches were rejected, and some out-of-specification batches were released to the marketplace. Batch rejection resulted in scrap and rework, while releasing out-of-specification product sparked consumer complaints and probably loss of sales.
"What to do?" wondered the technical specialist. The statistician suggested averaging the results of multiple samples, thereby reducing the standard deviation by the inverse square root of the sample size.
For instance, the mean of four samples would have a standard deviation equivalent to the original standard deviation divided by two. It follows that decisions made on the basis of averages would be correct more often than those based on individual observations.
The technical specialist took the averaging suggestion to manufacturing personnel, who pushed back and said they had done all they could to take single measurements. A suggestion that manufacturing take samples less frequently but in larger numbers was rejected because the characteristic might change between sampling times. "With such large data variation, how would anyone ever know?" the statistician wondered.
A familiar story
This is one of many such stories that result from a misunderstanding of the presence and nature of variation inherent in all processes. Fortunately, programs such as lean Six Sigma have been on the rise. One of their usual features is measurement systems analysis (MSA), which teaches that variation can be broken down into component parts:
- Process or product variation.
- Measurement system variation.
Measurement system variation is broken down into accuracy and precision, and precision can be further broken down into repeatability and reproducibility, as shown in Figure 1. Within a measurement system, accuracy is the proximity of the mean measured value to the true value, and precision is the closeness of replicated observations. Replication can take the form of the same person measuring the same sample with the same instrument (repeatability) or different people measuring the same sample with the same instrument (reproducibility).
The measurement of accuracy usually takes place in an analytical laboratory and is often carried out by choosing samples that span the range of interest and using a regression analysis to find a calibration curve that relates the measurements to a "true," or reference, standard value. Multiple samples at each level of the reference standard values are taken, and analytical bias is evaluated as the departure from a 45° line defining the ideal relationship between measured and reference standard values.
In practice, the true value may never be known exactly, but a substitute for the true value can be found. While calibrating rapid moisture measurement equipment, for example, vacuum oven moisture is considered an excellent surrogate for the absolute truth. Because vacuum oven moisture is time consuming, quicker methods are regressed on it as a standard and then used for day-to-day operations.
Gage R&R study
Precision, with its component parts, is frequently measured using a gage repeatability and reproducibility (R&R) study. By the way, the word "gage" is a modernization of "gauge" from late Middle English (1375–1425). Feel free to use that tidbit at your next party.
A typical gage R&R study might be a crossed design in samples by operators and with replicated measures by each operator. In other words, each sample is tested at least in duplicate by each of at least two operators.
Sources of variation in this case are "samples" to measure production variation, "operators" to determine if they show any glaring differences, "samples by operators" to determine if operator differences depend on which sample you have in mind and "residual variation," which is taken as repeatability.
One purpose of a gage R&R study is to quantify variation due to these sources early in a Six Sigma project so the Green Belt (GB) or Black Belt (BB) is assured of tracking the real problem instead of mistakenly chasing measurement variation take.
Candidate GBs and BBs learn this technology during training, which is almost always too short, and are then set loose from the classroom into the real world—some timidly and some with great bravado—to solve its problems.
What could possibly go wrong?
Consider this sample problem. Table 1 shows data resulting from a gage R&R study of water activity of a food product. Water activity is a measure of the degree of association of water with other substances. The lower the water activity of a food product, the less inclined that food is to support the growth of microorganisms.
Water activity, therefore, can be a critical measurement in determining food safety because it measures the susceptibility of a food product to be attacked by bacteria before being attacked by you. There’s another one-liner for your next party.
Notice that 10 food samples are evaluated in duplicate by three operators. An examination of output from one highly reliable software vendor shows the repeatability component of variation, in standard deviation terms, to be 0.0049. So what? Well, the software also indicates that it constitutes 48.6% of the sum of all of the standard deviations that correspond to the variance components from the analysis of variance.
Wait a minute. If almost 49% of the variation is due to the measurement system, what good is the measurement system? Should you call the chemists? Find a better measurement system?
Not so fast. The limits of the 95% confidence interval about the repeatability percentage standard deviation are 13.1% (lower) and 68.4% (upper). This speaks volumes about the uncertainty associated with the estimate. Look further: The 49% is 49% of the total, which is a measure of the combined variation due to samples, operators, the sample-by-operator interaction and replicates. For the 49% to be valid as means of judging the measurement repeatability, the rest of the numbers must be valid, too.
Let’s start with the samples. Are they truly representative of production—short and long-term? If they are chosen from a single batch or as one from each of the most recent batches, they are probably not truly representative of production. As a matter of fact, it is likely their variation is an underestimate of the actual production variation. If that is the case, the estimated percentage of repeatability is overstated.
Secondly, do you have enough samples in your gage R&R data set to give you a reasonable level of confidence in the estimate of variation? Many of the training data sets stop at 10, as does the one in Table 1. That may not be sufficient for the desired level of confidence.
Figure 2 shows 95% confidence intervals about a standard deviation of two with sample sizes running from five to 30. Notice how the upper bound of the intervals descends rapidly as the sample size increases, and it begins to level off between sample sizes of 15 and 20.
This suggests that to avoid pitfalls of uncertainty, those planning gage R&R studies should consider samples numbering at least 15 to 20, with some level of assurance that these samples are, indeed, representative of actual production.
The lesson of Figure 2 also applies to other sources of variation. But practical constraints limit the number of operators. Planners may not have a choice in this regard. It must be pointed out that these operators are human, and they talk to each other, asking questions such as: "What did you get for sample seven?"
To avoid cross-talk induced bias, code the samples so they are blind to operators. A next level of coding sophistication is the coding of samples so they are also blind to the person presenting them to the operators. Double-blind testing of this kind is often used in clinical trials in the pharmaceutical industries.
The operator-by-sample interaction term measures differences among operators from sample to sample. It should not be large, but if it is, it serves as a warning of sample dependent operator results. Some operator training or retraining may be in order.
Software gage R&R output often gives probabilities of misclassification when specifications are input. This includes the probability of declaring a result out of specification given that it is actually in and the probability of declaring a result in specification when it is actually out. Naturally, these should both be small.
Lastly, the definitions of R&R should be remembered. Repeatability is the variability resulting from the same operator reading the same sample, while reproducibility is the variability resulting from multiple operators reading the same sample.
For some kinds of testing, the measurement process is destructive so a sample cannot be measured more than once. The best you can do is blend a large sample and then subdivide it into near-uniform parts.
If this is your situation, or if you are concerned about some of the other pitfalls of gage R&R, you might want to discuss them with your local friendly statistician. Maybe at the same party where you discuss gage, gauge and water activity.
The author thanks JMP and Minitab for use of their software and to Keith Eberhardt for reviewing the column.
Lynne B Hare is a statistical consultant. He holds a doctorate in statistics from Rutgers University in New Brunswick, NJ. He is past chairman of the ASQ Statistics Division and a fellow of ASQ and the American Statistical Association.