Implementing Multivariate Statistical Process Control Using Hotelling's T2 Statistic
by Robert L. Mason and John C. Young
Multivariate statistical process control (MVSPC) can be defined as the application of multivariate statistical procedures for the purpose of increasing the quality and productivity of a business. These techniques have found application in many areas of both the service and manufacturing industries. For example, MVSPC has been used in the monitoring of manufacturing processes, patient care, disease outbreak and customer satisfaction.
One of the most popular multivariate control procedures is based on Hotelling's T2 statistic, the multivariate analogue of the univariate Shewhart statistic.1 The T2 statistic allows you to monitor many process variables because it considers them as a simultaneous group of items that interact with one another.
A control procedure based on the T2 statistic takes note of the fact that a change in one variable can cause a rippling effect throughout an entire system. Because it considers the interrelationships among the variables, the T2 statistic produces a powerful tool that is useful in detecting subtle system changes.
Many ask how to apply the T2 statistic when implementing a full-scale MVSPC. In particular, users want to know where the control procedure should be applied in the process. These questions, among other issues, are the topics of this article.
Implementation of a univariate Shewhart control procedure is straightforward. After the variable to be charted is selected, a preliminary data set consisting of observations from an in-control process is obtained. The major purpose of this data set is to provide estimates of the mean and standard deviation of the charting variable. These estimates are used in establishing a preliminary control procedure that is further used to clean the data of atypical observations.
When this purging of bad observations is accomplished, the resulting data set, labeled the historical data set (HDS), is used to obtain the sample estimates of the mean and standard deviation needed for establishing the control procedure. Thus, new observations are monitored using a control procedure based on these estimates. The construction of the HDS is referred to as a Phase I operation, and the monitoring of new observations is termed a Phase II operation.
Due to the many variables involved in a multivariate process, a Phase I operation for a T2 statistic involves several more steps than one for a univariate Shewhart procedure. In general, processes have three components: input, processing and output. This is depicted for a hospital system in Figure 1.
Many variables are associated with each process component. In turn, the practitioner must determine where to locate the control procedure. We suggest the area where problems may exist or occur with serious consequences. For example, a control procedure for an industrial process might be established to detect inconsistencies in feeding stock, tracking movement of the processing variables and maintaining the quality of production on the output component.
The steps involved in the planning stage of a Phase I operation are listed in Figure 2. They involve establishing goals, studying and mapping the process, and obtaining information on the variable relationships. The personnel who actually operate the system are excellent sources for this information.
Data collection stage
Once the plan is completed, it's time to evaluate the preliminary data set. This stage consists of verifying the data quality by examining for either human or electronic data errors.
This can be accomplished most easily by utilizing the graphical tools of a statistical computer package. Outlying data points can be identified and, if necessary, removed, and relationships between variables can be more carefully examined.
In order to achieve a better linear relationship among the variables, it may be necessary to re-express some variables in more appropriate forms such as logarithms or power functions. Theoretical knowledge of these relationships is helpful in this effort, but if such information is unavailable or does not exist, then decisions should be based on the empirical evidence provided by the HDS.
Any remaining problems also must be considered and addressed. When missing information, for example, one might substitute an estimate for the missing components, or simply delete the data containing the missing information. The operations involved in the data collection stage are presented in Figure 3.
Detecting data problems
The next stage in implementing a multivariate control procedure consists of detecting data problems. Unlike the previously mentioned data collection problems, these data problems can affect the use and performance of the T2 statistic and must be thoroughly investigated.
The T2 statistic for an observation on p-variables, such as X' = (x1, x2,...,xp), is given as T2 = (X - X)'S-1 (X - X), where the sample mean vector X represents a measure of the process center.
The sample covariance matrix S provides information on individual variable variation and on the correlation between the components of the observation vector. As in univariate control, both estimates are obtained from the preliminary data.
The use of the T2 statistic requires that the covariance estimate S contain no exact redundancies among the process variables. This can occur when two variables are perfectly (or near perfectly) correlated. A data redundancy can usually be removed by deleting one of the variables from the study. Several software packages, such as QualStat and SAS, which are used in MVSPC, offer procedures for locating and resolving these problems.
The use of T2 as a control statistic for MVSPC requires the observations to be independent. In most applications, this requirement is easily satisfied. However, in certain industrial applications a time dependency may exist between the observations. This is usually labeled as a form of autocorrelation.
Numerous statistical procedures are available for detecting autocorrelation. Time-sequence plots, as presented in Figure 4, are one such popular tool.
Detection of autocorrelation does not prohibit the use of the T2 control procedure. One must simply adjust the data for the presence of the dependency.2 The problems associated with detecting data problems are summarized in Figure 5.
With the investigation and resolution of all data problems, the user is ready to purge the preliminary data set of statistical outliers. An outlier is an observation that is far removed from the bulk of the data.
The T2 statistic is a measure of the (squared) statistical distance that the observation vector is from the sample mean vector. This distance is computed relative to the variable relationships or scatter of the points as given by the covariance matrix S.
Like straight line or Euclidean distance, the T2 statistic is univariate. Observations with large T2 values are potential outliers since the implication is that the observation is located at a great statistical distance from the data center.
To determine what is a large distance, use the probability function that describes the random behavior of the T2 statistic. The purging procedure consists of calculating the T2 value for each observation and comparing it to a critical distance value, labeled the upper control limit (UCL). Observations with a T2 > UCL are removed after investigation for cause; otherwise they are retained.
The process is continued until a homogeneous data set is obtained. This data set becomes the HDS and provides the estimates of X and S to be used to construct the T2 control statistic for monitoring future observations.
Figure 6 summarizes the steps involved in constructing a historical data set. In general, the analytical part of implementation can be handled with software packages such as QualStat and SAS. When these steps are completed and the HDS is constructed, new observations can begin to be monitored. This is the start of the Phase II operation.
The corresponding T2 control procedure in this phase of operations is based on a different UCL than that used in the Phase I operation. The UCL value for a Phase I operation is based on a beta distribution whereas the UCL value for a Phase II operation is based on an F-distribution. Otherwise, the two procedures are very similar.
After the T2 values for the new observations are computed, they are compared to this new UCL. A signal is declared for an observation when the value exceeds the UCL. Results are exhibited in a T2 chart such as the one presented in Figure 7.
Because of the inherent complexity of multivariate data, implementing a multivariate control system based on a T2 statistic is more complicated than initiating a univariate control system. The benefits, however, far exceed the additional effort. Using the T2 statistic with MVSPC not only allows for the monitoring of individual variables, but also provides an excellent technique for determining when the relationships between variables are fouled. This has led to a multitude of successful applications of this methodology in many different industries.
QualStat is a product of InControl Technologies Inc.
SAS and JMP are products of SAS Institute Inc.
1. Robert L. Mason and John C. Young, "Why Multivariate Statistical Process Control?" Quality Progress, December 1998, pp. 88-93.
2. Robert L. Mason and John C. Young, "Improving the Sensitivity of the T2 Statistic in Multivariate Process Control," Journal of Quality Technology, Vol. 31, No. 2, pp. 155-165.
ROBERT L. MASON is a staff analyst in the statistical analysis group at Southwest Research Institute in San Antonio. He earned a doctorate in statistics from Southern Methodist University in Dallas. Mason is an ASQ Fellow.
JOHN C. YOUNG is president of InControl Technologies Inc. in Houston and a statistics professor at McNeese State University in Lake Charles, LA. He earned a doctorate in statistics from Southern Methodist University in Dallas.