Detecting Dependent Observations In Multivariate Statistical Process Control
by Robert L. Mason and John C. Young
A multivariate process consists of two or more process variables that interact and are considered a simultaneous group. In a typical multivariate process, a change in one variable can create a change in another. For example, turn up the temperature, and the pressure goes up. Such relationships among variables are helpful in understanding the process, and many control chart procedures are designed to extract information about the process from these variable correlations.
In constructing most control charts used with a multivariate process, the analyst makes an inherent assumption that the observation vectors, unlike the variables being measured, are not dependent on one another. Since the validity of this data assumption is often a major concern, it would be useful to examine how we might detect a data dependency among the observations of a multivariate process.
One simple solution would be to examine the various components of the observation vectors using univariate statistics.1 For example, we might examine the variation contained in the time-sequence plot of the observations corresponding to each variable. Although this task could become very time consuming when the number of variables to consider is large, such an examination of the individual variables provides useful insight into the behavior of the process.
For example, consider two variables: the reactor inlet temperature and inlet pressure taken from an industrial chemical process. Figure 1 shows a stacked time-sequence plot of the two variables. Observe the data runs (for example, succession of all positive or all negative values) above and below the mean value line for both variables. These runs indicate a lack of randomness in the data and show that both variables contain some form of data dependency. This is not an unusual situation for process data taken from the chemical industry.
Detecting Data Dependencies In Multivariate Observations
A more informative way to determine when multivariate observations are dependent would be to use statistical tools. One method of detecting data dependencies in a multivariate process is to closely examine the plots and graphs of the control chart statistics being used.
For example, the plot in Figure 2 contains the values of the T2 statistic for each observation vector taken from a multivariate process. Similar to univariate control charting, if a set of multivariate observations taken from a process is independent and the process is in control, there should be no systematic patterns in the plot of the T2 values. This is not the case with the plot in Figure 2. Note the definite bowl shaped pattern that occurs in the plot. This bowl shape is due to a strong linear trend in at least one of the variables.2
Another example in Figure 3 shows the time-sequence plot of the residuals associated with a multiple regression model that relates one process variable to several others. If these residuals were independent, there would be no systematic pattern in this plot. Obviously, this is not the case.
Observe the presence of autocorrelation (for example, time correlation) among the residuals by noting the extended runs above and below the mean value of zero. The Durbin-Watson statistic,3 which can be used to test for the presence of a time correlation among the data points in a regression analysis, would be significant for this data.
Because data dependencies occur between observations, it is often difficult to detect a data dependency such as autocorrelation in the scatterplot of two variables. For example, consider a scatterplot of the data for the two variables—temperature and pressure—used in Figure 1. This plot is presented as Figure 4. Notice the plot is void of the systematic patterns present in the two time-sequence plots in Figure 1. A multivariate statistical procedure can be used to help detect these data dependencies and test the independence of a set of multivariate observations. A description of one such method follows.
Estimation of Variation Using Multivariate Data
One of the most frequently used estimators of variation in a p-variable multivariate process is the square matrix S that represents the common covariance estimator.4 The p diagonal elements of S are the individual sample variances of the p process variables under consideration.
Each element is equivalent to the univariate sample variance value frequently used to measure the total variation in a sample for an individual variable. The off-diagonal elements of the matrix S are the sample covariances between a pair of the variables, and these measure the covariation between two variables.
As an example, consider the 2,062 observations on the reactor temperature and pressure variables plotted in Figure 4. We compute the sample covariance matrix S as well as the corresponding sample correlation matrix R for comparison. These are given as
The diagonal elements of S are 1177.2 and 58.8, and these two values represent the sample variances of the temperature and pressure variables, respectively, in the sample. The off-diagonal element of S equals 54.4, and it is the sample covariance between these two variables. Using the R matrix, the sample correlation between the two variables is 0.203, which is obtained by dividing 53.4 by the square roots of 1177.2 and 58.8.
A multivariate counterpart to S is the successive-difference sample covariance matrix,5 labeled SD. This matrix has values similar to those given in the matrix S, except they are computed using the successive differences of consecutive (in the order of occurrence) observation vectors. Motivation for using SD is generated from the univariate variance estimator, known as the mean square successive difference (MSSD) estimator.6
Each diagonal variance component in SD is the same as the MSSD estimator of the variance for the corresponding variable. An off-diagonal covariance component in SD is a measure of the covariance between consecutive differences for the corresponding variable components of the observation vector. For our example data, the MSSD sample estimate SD and the corresponding sample correlation matrix RD are given as:
Similar to a univariate process, little difference exists between the common estimator S and the MSSD estimator SD for a multivariate process when the observations are independent (and identically distributed). However, when there is a dependency among these observations, these two matrices will differ. A statistical test procedure exists for testing this hypothesis, and it is based on comparing S and SD.7
Returning to our example data, note the large differences between the corresponding elements of S and SD. For example, using S, the sample variance for the first variable is 1177.2, which is more than 10 times larger than the variance estimate, 109.1, obtained using the successive difference estimate given in SD. Also, note that the correlation between the temperature and pressure variables is only 0.203 using the values for the variances and covariance given in S, but the correlation increases to 0.904 when based on the corresponding estimates contained in SD.
It is interesting that the strong correlation noted when using SD is not supported by the configuration of the points given in the scatterplot of the two variables presented in Figure 4 (p. 57). Examining the data using the test procedure for comparing S and SD yields a significant value and indicates that these two matrices are not statistically equivalent.
These results suggest that some type of nonrandom variation is present in the data. This conclusion confirms what we saw in the time-sequence plots of the two variables given in Figure 1 (p. 57). In this set of data, the observations are clearly not independent.
Since we must consider many variables at the same time when monitoring a multivariate process, detection of data dependencies between and among the observation vectors is not straightforward. The test procedures that exist for checking for the nonrandomness of the observation vectors are more complicated than those used with the corresponding univariate procedures and require knowledge of multivariate analysis.
The best approach for checking for data dependencies is software on multivariate procedures that will perform the necessary computations. If you lack such tools, we recommend, at a minimum, you apply the graphic methods of univariate analysis (for example, time-sequence plots) to the individual variables of the multivariate observation vector to detect the various forms of autocorrelation that denote data dependencies. This simple visual approach should be of great value to you.
- R.L. Mason and J.C. Young, “Dependent Univariate Observations and Statistical Control,” Quality Progress, Vol. 40, 2007, pp. 62-64.
- R.L. Mason, Y.M. Chou, J.H. Sullivan, Z.G. Stoumbos and J.C. Young, “Systematic Patterns in T2 Charts,” Journal of Quality Technology, Vol. 35, 2003, pp. 47-55.
- Bovas Abraham and Johannes Ledolter, Introduction to Regression Modeling, Thomson Brooks/Cole, 2006.
- R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis, fifth edition, Prentice Hall, 2002.
- D.S. Holmes and A.E. Mergen, “An Alternative Method to Test for Randomness of a Process,” Quality and Reliability Engineering International, Vol. 11, 1995, pp. 171-174.
- H.R. Bellinson, J. von Neumann, R.H. Kent and B.I. Hart, “The Mean Square Successive Difference,” Annals of Mathematical Statistics, Vol. 12, 1941, pp. 153-162.
- D.S. Holmes and A.E. Mergen, “A Multivariate Test for Randomness,” Quality Engineering, Vol. 10, 1998, pp. 505-508.
ROBERT L. MASON is an institute analyst at Southwest Research Institute in San Antonio. He received a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of both ASQ and the American Statistical Assn.
JOHN C. YOUNG is a retired professor of statistics from McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.