Dependent Univariate Observations And Statistical Control
by Robert L. Mason and John C. Young
Have you ever wondered why your process control procedure doesn’t work just right? It might be that the underlying assumptions are not completely valid.
For example, the assumption that the sample observations be collected from a normal distribution is often fundamental to the derivation of the distribution of the control statistic. In turn, the control statistic distribution provides information necessary to determine the control limits associated with the charting procedure. If the assumption of normality is invalid, the control limits could be incorrect.
Another underlying assumption common in developing a control procedure is that each sampled observation (or its transform) is statistically independent of the others. In terms of probability, this assumption translates into the need for independently and identically distributed observations. Using such observations allows you to avoid many problems in the distribution derivation process.
For example, with independent observations you don’t have to be concerned about mixtures of distributions or the correlation between observations. They also allow you to interpret systematic patterns in a control chart as signals rather than common process variation.
Many industries produce data streams in which the observations are highly dependent. The dependency among the observation vectors is usually due to variation that is a result of a sustained pattern of variation in the absence of special causes and is often exhibited as a form of autocorrelation. Examples of this type of variation include:
- Process operations such as the ramping of a reactor
- Step-changes induced by load changes created by a demand factor
- Depletion of a critical component necessary for the operation of a production unit
Dependent observations are characteristic of these processes. However, you have to make sure that the dependency is not created by a reoccurring special cause.
Reoccurring special causes would create a dependency representing a process problem. Dependencies among the process observations will render a control procedure ineffective unless alternative analysis approaches are chosen. Thus, it is important to know what procedures can be used to detect such data dependencies.
Given a consecutive sequence of n observations, how do you determine whether they are independent of one another? One method for detecting a time dependency among the observations is to observe their variation in a time-sequence plot.
For example, the time-sequence plot of the variable x1 presented in the upper half of Figure 1 demonstrates random variation. The haphazard scatter in the plotted points indicates the observations are uncorrelated.
No pattern is discernable in the plot, and the observations vary at random around the mean value, which is represented by the horizontal line in the plot. Due to the randomness of the data, you would find it difficult to predict a future value from a past observation using this sample.
In contrast, the time-sequence plot of the variable x2 presented in the lower half of Figure 1 displays a time (serial) correlation among the observations. The data are observations on a process coolant in which the contributing factor in the correlation is the cyclical time trend in the ambient temperature.
Notice the observations are related: For an extended time period, observations above the mean value (represented by the horizontal line in the plot) tend to be followed by observations above the mean.
Similarly, observations below the mean are followed by other observations below the mean for an extended time period. Thus, we have several runs of observations above and below the mean value. In this case, we say the data are autocorrelated.
A major diagnostic tool for detecting autocorrelation is the sample autocorrelation coefficient.1 This value can be computed from the data and used to test the hypothesis that the population autocorrelation coefficient is zero. Lag autocorrelation values using different lag times can also be plotted in a correlogram.2
An example of a correlogram for lag times ranging from one to 20 for the observations on x2 is presented in Figure 2.
For those unfamiliar with a lag correlation, it is obtained by pairing each sample observation with an observation at a fixed previous time (lag) and computing an ordinary correlation coefficient between the resulting paired data.
For example, a lag 1 correlation would correspond to the ordinary correlation between each sample observation and its immediate past value. From the correlogram in Figure 2, the lag 1 correlation has a value of 0.967. This is exceptionally high and indicates a strong time correlation be-tween the adjacent observations in the sample.
For a lag 2 correlation, we would compute the ordinary correlation coefficient obtained by pairing each sample observation with the observation taken two time periods before it. The lag 2 correlation in Figure 2 has a value of 0.945. We could continue in this fashion to obtain all the lag correlation values given in the plot. The declining size of the lag correlations in Figure 2 as the lag time increases between observations confirms that the strongest time correlations exist between adjacent observations.
Another approach for detecting nonrandom variability in the data is to examine the variation of the process variables. For example, the plot of x1 in Figure 1 suggests only random variation is present and that this variation is only due to common causes. Thus, the variance of x1 is strictly a function of the variation of the random measurement error in the data.
However, when the observations appear related, as in the plot of x2 in Figure 1, there is an additional source, a special cause, contributing to the variation. The variance of x2 would be a function of both the measurement-error variation in the data and the variation due to the special cause. Because these latter two variances are positive, the variation among the observations on x2 is larger than the variation exhibited among the observations on x1.
When estimating the size of the total variation in a process, we can use the common sample variance estimator s2 that is given by:
in which n is the sample size, xj represents the value of jth observation and x– is the sample mean. The statistic s2 measures the total variation of a variable including both the random variation inherent in the process and the special cause variation, such as that induced in variable x2 due to the cyclical temperature.
Another estimator of variation is known as the mean square successive difference (MSSD) estimator.3 This estimator, which we will label q2, is given by:
in which xj and xj + 1 are successive observations. Because this estimator uses the successive differences of consecutive observations, it is useful in characterizing the variation when there is a gradual data trend causing a shift in the process mean. However, when there is only randomness in the data, the value of q2 will be very similar to the value of s2.
A useful statistic4 for comparing the above two estimators is to construct the statistic given by:
in which r = q2/s2. When the process data are normally distributed and the sample size is 20 or more, the above z statistic is approximately a standard normal distribution. Thus, we can compare the z score to a standard normal value to determine whether there is a trend in the data. Large values of z, such as those exceeding two or three, would indicate there is a trend in the data, and thus the data are not independent. Otherwise, the data could be considered independent.
For example, consider Figure 1 (p. 62) in which the observations on x1 are assumed to contain only random fluctuation. The computed sample variance of x1 is given by s2 = 0.901, while the value of the successive difference variance estimator is given by q2 = 0.830. For the n = 300 observations, r = 0.922 and z = 1.36. Because z is less than two, we could conclude there is little difference in these two estimates, and the data can be accepted as being independent.
In contrast, consider the observations of x2 in Figure 1. For this data, the values of are s2 = 124.18 and q2 = 4.10. With n = 300 observations, r = 0.033 and z = 16.81. Because z is much greater than two, there is a clear indication the data are dependent. The range of coolant temperatures during the one-year period is 52 to 94¼ F with an average temperature of 73¼ F, and there is an obvious cycle in the data due to the ambient conditions.
The total variation of the coolant temperature, as given by s2, considers the cycle created by these ambient conditions. This occurs because the deviations of s2 are taken from the overall average and are greatly influenced by the cyclical rise and fall of the observations above and below the mean. As might be expected from such a temperature cycle, s2 is large in magnitude and indicates there is much variation in the data. In contrast, the estimate provided by q2 is small in size. This is because it ignores the cycle and considers only the consecutive differences between the adjoining observations. Con-sequently, q2 is small in magnitude and underestimates the true variation.
An interval extending +/- two standard deviations about the mean value of x2, computed as (50.624, 95.196) when based on s, contains 99% of the observations, whereas a similar interval, computed as (68.862, 76.958) when based on q, contains only 21% of the observations.
- W.J. Kolarik, Creating Quality: Process Design for Results, WCB/McGraw-Hill, 1999.
- H.R. Bellinson, John von Neumann, R.H. Kent and B.I. Hart, “The Mean Square Successive Difference,” Annals of Mathematical Statistics, Vol. 12, 1941, pp. 153-162.
- D.S. Holmes and A.E. Mergen, “An Alternative Method to Test for Randomness of a Process,” Quality and Reliability Engineering International, Vol. 11, 1995, pp. 171-174.
ROBERT L. MASON is an institute analyst at Southwest Research Institute in San Antonio. He received a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of both ASQ and the American Statistical Assn.
JOHN C. YOUNG is president of InControl Technologies and a professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.