Multivariate Tools: Principal Component Analysis
by Robert L. Mason and John C. Young
One of the greatest benefits of multivariate thinking1 and the application of multivariate methods is they show how process variables are interconnected and interrelated. We would like to expound on the application and understanding of one such tool known as principal component analysis (PCA).
Most multivariate tools are not readily understood due to their mathematical complexity, so we’ll present what we consider a minimal mathematical approach in explaining what principal components (PCs) are and how they can be used to understand the interrelations between and among a group of process variables.
Principal Component Analysis
Consider a scatter plot of two standardized variables,
x1and x2, such as that presented in Figure 1. A boundary was constructed around
the points, and a line was drawn through the widest part of the data. The equation
of the line, in terms of x1 and x2, can be written as a11x1 + a12x2 = 0, where a11 and a12 are suitably chosen constants.
The linear combination on the left-hand side of this equation is the first principal component, PC1, and is given by PC1 = a11x1 + a12x2.
A line perpendicular to the first was
then drawn through the second widest part of the data (see Figure 2). The linear
combination associated with the second line is the second principal component, PC2,
given by PC2 = a21x1 + a22x2.
The first PC is the linear combination of both variables that has the most variation among all such linear combinations. Similarly, the second PC is the linear combination of both variables that is uncorrelated with the first principal component and has the second largest amount of variation.
In geometric terms, the PCs are obtained by rotating the original coordinate axis (x1, x2) about the origin to a new axis system (PC1, PC2) that consists of the PCs. The exact form of the linear combinations needed to form these new axes is provided using the eigenvectors, a1 and a2, of the correlation matrix, R, associated with the data vector, x' = (x1, x2). A method for computing the eigenvalues and eigenvectors of R based on two variables is summarized in Figure 3 (p. 84). When this procedure is used, the resulting principal components are computed as
A PC analysis is usually performed using the eigenvectors and eigenvalues of the correlation matrix of the involved variables because the correlation matrix is independent of scale. Working with the correlation matrix is equivalent to working with the standardized values of the variables—a mean of zero and a standard deviation of one. An alternative approach is to use the eigenvalues and eigenvectors of the covariance matrix of the involved variables, though this approach is scale dependent.
Principal Components And Statistical Control
Certain properties of PCs make them ideal statistics for process control. For example, when using two variables and their corresponding correlation matrix, the first PC is always a function of the sum of the two process variables, and the second PC is a function of the difference of the two process variables. (This result is reversed when there is a negative correlation between the two variables.)
An article in the Journal of Quality Technology describes an industrial application of this property in which two PCs are used as control statistics in comparing results from a main lab with the results of the corresponding unit lab.2 This approach helps determine when the main lab and the unit lab are in agreement and how the process is to be monitored if the labs do agree.
If x1 and x2 have a correlation coefficient of r = 0.8428, the two principal components based on their corresponding correlation matrix, R, are defined as PC1 = 0.7071x1 + 0.7071x2 and PC2 = 0.7071x1 - 0.7071x2, where x1 and x2 are standardized variables.
In this setting, the variance of PC1
is defined by λ1 = (1+ r) = 1.8428, and the variance of PC2 is defined by
λ2 = (1- r) = 0.1472, where λ1 and λ2 are the eigenvalues of R.
If the correlation between x1 and x2
The eigenvalues of the matrix, R, play an important role in determining the percentage of variation explained by the PC. For the ith PC in the two-variable setting, in which i = 1 or 2 , this percentage is given by [λi/(λ1+λ2)]100%. Thus, the first PC explains [1.8428/2]100% = 92.14% of the total variation, and the second PC explains 7.86% of it.
When there are more than two variables, say x1, x2, …, xp, many characteristics of the PCs can be extended. For example, the PCs are uncorrelated, are linear combinations of all p variables and explain the p dimensions of maximum variation down to minimum variation, from PC1 to PCp. Also, a number of existing multivariate control statistics can be expressed in PC form. For example, a Hotelling’s T2 statistic can be decomposed in terms of the principal components as T2 = PC12/λ1 + PC22/λ2 + … + PCp2/λp.
A T2 statistic involving p variables
and a control region is illustrated in Figure 4 in a space defined by the first
three PCs. The points plotted outside the ellipsoidal control region are atypical
to the data swarm and are ideal candidates for outliers.
Singularities and Dimension Reduction
Another major use of a PC analysis is to locate
singularities or near singularities, such as data redundancies (see Figure 5). The
scatter plot contains two variables that are perfectly correlated, in which the
correlation between x1 and x2 is 1.00, and the two variables form an exact linear
The red line in Figure 5 represents the first PC. However, there is no second dimension of variation in the data because the data do not extend in any other direction. This implies the amount of variation explained by the second PC or λ2 is zero. For each instance in which such an exact linear relationship occurs, one eigenvalue will be zero.
The exact singularity depicted in Figure
5 can be removed by eliminating either x1 or x2 from the study, which then reduces
the dimension from two to one. Similarly, in the case of a near singularity in which
an eigenvalue is close to but not exactly zero, it is often equally desirable to
study the data in fewer dimensions. To illustrate this latter case, we examined
a PC analysis for an industrial process containing six process variables, labeled
X1, X2, X3, X4, X5 and X6. The eigenvalues and eigenvectors of the correlation matrix
for a set of data from the process are given in Table 1.
Each row of the table contains the eigenvalue and coefficients of the standardized variables for the corresponding PC. The eigenvalues are the variances of the PCs. For example, the PC in row one accounts for 44.12% of the total variation. By inspecting the absolute magnitude of the coefficients of the standardized variables for each PC, we can determine which variables are contributing most to the variation. In this case, the largest element in absolute value for the first PC (0.5567) is associated with the variable X5, while the second largest element in absolute value (-0.5389) is associated with X6. Thus, these two variables contribute the most to the dimension with the maximum variation.
The first three PCs in Table 1 account for 78.49% of the total variation. If the process is studied in the space defined by these three principal components, the dimension of the problem can be reduced from six to three. Notice, however, the number of original variables is not affected since each of the three chosen PCs is a linear combination of all six variables. This type of dimension reduction is useful in studying processes with a large number of variables.
PCA is a data analysis technique used to describe the multivariate structure of the data.3 PCs are helpful in identifying meaningful underlying variables and can be used to reduce the dimensionality of a set of data and determine linear relationships between the variables that can be harmful when using other statistical procedures. PCA is a vital tool in process control and data mining activities.
- R.L. Mason and J.C. Young, “Multivariate Thinking,” Quality Progress, April 2004, pp. 89-91.
- N.D. Tracy, J.C. Young and R.L. Mason, “A Bivariate Control Chart for Paired Measurements,” Journal of Quality Technology, 1995, pp. 370-376.
- J. Edward Jackson, A User’s Guide to Principal Components, John Wiley and Sons, 1991.
ROBERT L. MASON is an institute analyst at Southwest Research Institute in San Antonio, TX. He received a doctorate in statistics from Southern Methodist University and is a Fellow of both ASQ and the American Statistical Association.
JOHN C. YOUNG is president of InControl Technologies and a professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.