## 2019

STATISTICS ROUNDTABLE

# Multivariate Tools: Principal Component Analysis

**by Robert L. Mason and John C. Young **

One of the greatest benefits of multivariate thinking^{1} and the application
of multivariate methods is they show how process variables are interconnected and
interrelated. We would like to expound on the application and understanding of one
such tool known as principal component analysis (PCA).

Most multivariate tools are not readily understood due to their mathematical complexity, so we’ll present what we consider a minimal mathematical approach in explaining what principal components (PCs) are and how they can be used to understand the interrelations between and among a group of process variables.

### Principal Component Analysis

Consider a scatter plot of two standardized variables,
x_{1}and x_{2}, such as that presented in Figure 1. A boundary was constructed around
the points, and a line was drawn through the widest part of the data. The equation
of the line, in terms of x_{1} and x_{2}, can be written as a_{11}x_{1} + a_{12}x_{2} = 0, where a_{11} and a_{12} are suitably chosen constants.

The linear combination on the left-hand
side of this equation is the first principal component, PC_{1}, and is given by PC_{1} = a_{11}x_{1} + a_{12}x_{2}.

A line perpendicular to the first was
then drawn through the second widest part of the data (see Figure 2). The linear
combination associated with the second line is the second principal component, PC_{2},
given by PC_{2} = a_{21}x_{1} + a_{22}x_{2}.

The first PC is the linear combination
of both variables that has the most va_{}riation among all such linear combinations.
Similarly, the second PC is the linear combination of both variables that is uncorrelated
with the first principal component and has the second largest amount of variation.

In geometric terms, the PCs are obtained
by rotating the original coordinate axis (x_{1}, x_{2}) about the origin to a new axis
system (PC_{1}, PC_{2}) that consists of the PCs. The exact form of the linear combinations
needed to form these new axes is provided using the eigenvectors, a_{1} and a_{2}, of
the correlation matrix, R, associated with the data vector, x' = (x_{1}, x_{2}).
A method for computing the eigenvalues and eigenvectors of R based on two variables
is summarized in Figure 3 (p. 84). When this procedure is used, the resulting principal
components are computed as

A PC analysis is usually performed using the eigenvectors and eigenvalues of the correlation matrix of the involved variables because the correlation matrix is independent of scale. Working with the correlation matrix is equivalent to working with the standardized values of the variables—a mean of zero and a standard deviation of one. An alternative approach is to use the eigenvalues and eigenvectors of the covariance matrix of the involved variables, though this approach is scale dependent.

### Principal Components And Statistical Control

Certain properties of PCs make them ideal statistics for process control. For example, when using two variables and their corresponding correlation matrix, the first PC is always a function of the sum of the two process variables, and the second PC is a function of the difference of the two process variables. (This result is reversed when there is a negative correlation between the two variables.)

An article in the* Journal of Quality
Technology* describes an industrial application of this property in which two PCs
are used as control statistics in comparing results from a main lab with the results
of the corresponding unit lab.^{2} This approach helps determine when the main lab
and the unit lab are in agreement and how the process is to be monitored if the
labs do agree.

If x_{1} and x_{2} have a correlation coefficient
of r = 0.8428, the two principal components based on their corresponding correlation
matrix, R, are defined as PC_{1} = 0.7071x_{1} + 0.7071x_{2} and PC_{2} = 0.7071x_{1} - 0.7071x_{2},
where x_{1} and x_{2} are standardized variables.

In this setting, the variance of PC1
is defined by λ_{1} = (1+ r) = 1.8428, and the variance of PC_{2} is defined by
λ_{2} = (1- r) = 0.1472, where λ_{1} and λ_{2} are the eigenvalues of R.
If the correlation between x_{1} and x_{2}

The eigenvalues of the matrix, R, play
an important role in determining the percentage of variation explained by the PC.
For the i^{th} PC in the two-variable setting, in which i = 1 or 2 , this percentage
is given by [λi/(λ_{1}+λ_{2})]100%. Thus, the first PC explains [1.8428/2]100%
= 92.14% of the total variation, and the second PC explains 7.86% of it.

When there are more than two variables,
say x_{1}, x_{2}, …, x_{p}, many characteristics of the PCs can be extended. For example,
the PCs are uncorrelated, are linear combinations of all p variables and explain
the p dimensions of maximum variation down to minimum variation, from PC_{1} to PC_{p}.
Also, a number of existing multivariate control statistics can be expressed in PC
form. For example, a Hotelling’s T^{2} statistic can be decomposed in terms of
the principal components as T^{2} = PC_{1}^{2}/λ_{1} + PC_{2}^{2}/λ_{2} + … + PC_{p}^{2}/λp.

A T^{2} statistic involving p variables
and a control region is illustrated in Figure 4 in a space defined by the first
three PCs. The points plotted outside the ellipsoidal control region are atypical
to the data swarm and are ideal candidates for outliers.

### Singularities and Dimension Reduction

Another major use of a PC analysis is to locate
singularities or near singularities, such as data redundancies (see Figure 5). The
scatter plot contains two variables that are perfectly correlated, in which the
correlation between x_{1} and x_{2} is 1.00, and the two variables form an exact linear
relationship.

The red line in Figure 5 represents
the first PC. However, there is no second dimension of variation in the data because
the data do not extend in any other direction. This implies the amount of variation
explained by the second PC or λ_{2} is zero. For each instance in which such
an exact linear relationship occurs, one eigenvalue will be zero.

The exact singularity depicted in Figure
5 can be removed by eliminating either x_{1} or x_{2} from the study, which then reduces
the dimension from two to one. Similarly, in the case of a near singularity in which
an eigenvalue is close to but not exactly zero, it is often equally desirable to
study the data in fewer dimensions. To illustrate this latter case, we examined
a PC analysis for an industrial process containing six process variables, labeled
X1, X2, X3, X4, X5 and X6. The eigenvalues and eigenvectors of the correlation matrix
for a set of data from the process are given in Table 1.

Each row of the table contains the eigenvalue and coefficients of the standardized variables for the corresponding PC. The eigenvalues are the variances of the PCs. For example, the PC in row one accounts for 44.12% of the total variation. By inspecting the absolute magnitude of the coefficients of the standardized variables for each PC, we can determine which variables are contributing most to the variation. In this case, the largest element in absolute value for the first PC (0.5567) is associated with the variable X5, while the second largest element in absolute value (-0.5389) is associated with X6. Thus, these two variables contribute the most to the dimension with the maximum variation.

The first three PCs in Table 1 account for 78.49% of the total variation. If the process is studied in the space defined by these three principal components, the dimension of the problem can be reduced from six to three. Notice, however, the number of original variables is not affected since each of the three chosen PCs is a linear combination of all six variables. This type of dimension reduction is useful in studying processes with a large number of variables.

PCA is
a data analysis technique used to describe the multivariate structure of the data.^{3} PCs are helpful in identifying meaningful underlying variables and can be used to
reduce the dimensionality of a set of data and determine linear relationships between
the variables that can be harmful when using other statistical procedures. PCA is
a vital tool in process control and data mining activities.

### REFERENCES

- R.L. Mason and J.C. Young, “Multivariate Thinking,” Quality Progress, April 2004, pp. 89-91.
- N.D. Tracy, J.C. Young and R.L. Mason, “A Bivariate Control Chart for Paired Measurements,” Journal of Quality Technology, 1995, pp. 370-376.
- J. Edward Jackson, A User’s Guide to Principal Components, John Wiley and Sons, 1991.

**ROBERT L. MASON**is an institute analyst at Southwest Research Institute in San Antonio, TX. He received a doctorate in statistics from Southern Methodist University and is a Fellow of both ASQ and the American Statistical Association.

**JOHN C. YOUNG** is
president of InControl Technologies and a professor of statistics at McNeese State
University in Lake Charles, LA. He received a doctorate in statistics from Southern
Methodist University.

Featured advertisers