## 2020

STATISTICS ROUNDTABLE

# Transforming Data

## Use care when transforming variables in multivariate analyses

By Robert L. Mason and John C. Young

What separates a multivariate analysis from a univariate analysis in process control? As simple as this question might appear, the answer can sometimes be difficult to understand.

A univariate analysis is commonly defined as the analysis of a data set containing only one variable. In this case, it is generally assumed the variable of interest is not influenced by other variables. When other variables do influence the variable, good experimental practice requires that controls be placed on the influential variables to limit their effect.

For example, suppose it is known that the performance of a piece of lab equipment varies with the percentage of humidity in the air. Before collecting data using this piece of equipment, it would be necessary to implement some type of humidity control to restrict its influence. Without controlling humidity in the proximity of this sensitive lab equipment, you might obtain biased data.

The most common definition of a multivariate analysis is that it involves the analysis of a group of variables that form a correlated set. The variables are considered as a simultaneous group, yet they interact with one another and move together. A change in one variable can produce a change in one or more of the other variables.

For example, a change in temperature will produce a corresponding change in atmospheric pressure. In a multivariate analysis, the relationship between two variables is expressed using a pairwise correlation coefficient, and the relationship between an individual variable and a group of variables is expressed using a multiple correlation coefficient.

The separation between univariate and multivariate procedures appears to be straightforward. However, this is not always evident. For example, the *p* correlated variables in a multivariate data set can often be transformed by an appropriate mathematical transformation into *p* independent variables, each of which can be monitored separately using univariate procedures.

You start with a complex multivariate process with *p* variables and mathematically reduce it to a set of *p* independent transformed variables. The transformed variables will not be the same as the original variables, but instead might consist of linear combinations of the original variables—that is, a constant multiplied by the sum and differences of the original variables.

Consider a set of data consisting of two variables, such as height and weight. The relationship between the two variables is depicted in the scatterplot in Figure 1. The correlation coefficient between the two variables for this data set is r = 0.7004. The plot and the positive correlation indicate that as height increases, weight also increases. Together, these two variables form a bivariate system.

Using the data of Figure 1, the following linear combinations of height and weight in Equation 1 below will produce two independent variables, *z _{1}* and

*z*, in which

_{2}*z*and

_{1}*z*are given by:

_{2}*z _{1}* = 0.7071 (height) + 0.7071 (weight) = 0.7071 (height + weight)

*z _{2}* = 0.7071 (height) – 0.7071 (weight) = 0.7071 (height – weight)

These linear combinations are considered an orthogonal transformation because the two resulting vectors are perpendicular to each other in the geometric sense. Because of this transformation, *z _{1}* and

*z*can be monitored separately as independent variables rather than monitoring the correlated height and weight variables together as a single multivariate system.

_{2}### Correlated to independent

Because orthogonal transformations are readily available and univariate methods are easier to use and understand, you might ask, "Why not monitor every multivariate process by transforming the correlated variables to independent variables?" Such an approach would definitely avoid the complexities involved in performing a multivariate analysis.

A main objection in using transformations is that the transformed variables might be more difficult to interpret. When a signal is detected through a statistical control chart of the transformed variables, it might not be easy to determine the source of the signal in terms of the individual process variables or a subgroup of the process variables.

For example, suppose process temperature plays an important role in a number of the transformed variables. If the temperature drifts too high, the statistical control procedure might indicate signals for several of the transformed variables. This would make it extremely difficult to determine whether the temperature increase was the problem.

As a counter argument, situations exist in which the transformed variables might have more meaning than the original variables. For example, suppose company A supplies gas on a regular basis to company B, and both companies record the amount of gas supplied.

Which record is used to determine payment? Denote the amount monitored by company A as *y _{1}* and the corresponding amount monitored by company B by

*y*. The two companies seldom—if ever—observe an equal value between

_{2}*y*and

_{1}*y*. The positive correlation between the two readings, however, is very high with the value of

_{2}*y*being consistently higher than

_{1}*y*.

_{2}Suppose the orthogonal transformation given in Equation 1 is made on the two variables, *y _{1}* and

*y*. The two new transformed variables are denoted as

_{2}*z*and

_{1}*z*. As in Equation 1, the variable

_{2}*z*will be a function of the sum of

_{1}*y*and

_{1}*y*, for example z1 = 0.7071 (

_{2}*y*+

_{1}*y*). The second transformed variable

_{2}*z*will be a function of the difference of the

_{2}*y*and

_{1}*y*, for example

_{2}*z*= 0.7071 (

_{2}*y*-

_{1}*y*).

_{2}Starting with *z _{2}*, let's examine the interpretation of the two transformed variables. A value of zero for

*z*indicates perfect agreement between the two companies in the measured amount of gas, for example

_{2}*y*=

_{1}*y*. A univariate control procedure on this variable would indicate when the two companies disagreed. In contrast, the variable

_{2}*z*is a function of the sum of the two original variables, for example

_{1}*y*+

_{1}*y*. It is the best guess as to the true amount of gas delivered and received when there is agreement between the two systems.

_{2}### Monitoring variables

This leads to another important question: Can you monitor an independent set of variables using multivariate procedures? In the above problem, why not monitor the transformed variables *z _{1}* and

*z*as a multivariate system—for example, as a vector observation (

_{2}*z*,

_{1}*z*)?

_{2}A main reason for monitoring p variables as a multivariate system is to take advantage of the intercorrelations among the variables. For an independent set of variables, there are no intercorrelations to analyze. Yet most multivariate control procedures, by their construction, will place restrictions on these independent variables. This is illustrated by an example comparing the two separate and independent univariate Shewhart control regions on *z _{1}* and

*z*to the single multivariate control region on (

_{2}*z*,

_{1}*z*) as determined using a Hotelling's

_{2}*T2*statistic.

^{1}

The box in Figure 2 represents the joint Shewhart control regions for *z _{1}* and

*z*obtained by treating the two variables independently. The control region for

_{2}*z*is along the horizontal axis and the control region for

_{1}*z2*is along the vertical axis. Assuming an error rate of 0.0027 for each chart, the upper control limits (UCL) and lower control limits (LCL) are located at ±3 for

*z*and

_{1}*z*. Thus, the probability that both Shewhart variables are inside the box formed by these limits is [1 – 2 * (0.0027)] = 0.9946.

_{2}The circle in Figure 3 represents the multivariate control region for the observation vector (*z _{1}*,

*z*) as determined using the

_{2}*T*statistic. To make the error rate of this circular region agree with the error rate of the box in Figure 2, we use the same error rate of 2 * (0.0027) = 0.0054 for the circle. The corresponding multivariate critical value, based on a chi-square distribution with two degrees of freedom and an error rate of 0.0054, is 10.4427. The square root of this value is 3.23. This is the point at which the circle intercepts the two axes in Figure 3. With this error rate, the probability of being inside the circle in Figure 3 is 0.9946, which is identical to the probability of being inside the box in Figure 2.

^{2}Observe the discrepancy between the two procedures. For example, suppose that *z _{2}* is at its mean value of zero and

*z*is at an extreme value of 3.10. The Shewart procedure produces the correct result:

_{1}*z*is out-of-control, and

_{1}*z*is in-control. However, the multivariate

_{2}*T*procedure, due to its control procedure being confined to a circular region, gives incorrect results. Because both variable values for

^{2}*z*and

_{1}*z*are inside the circular control region, you would conclude the observation is in-control.

_{2}The reason for this difference is evident. In the circle, the value of one variable is conditional on the value of the other variable. This restriction does not exist for the boxed region formed by the two independent variables. The conclusion is that you should not use multivariate procedures to monitor a group consisting of only independent variables because such an approach might not always be accurate.

### Careful consideration

We demonstrate that the *p* variables of a multivariate process can be mathematically transformed to *p* new independent variables. We recommend making such transformations only when the answer to the control problem makes more sense in the transformed space than in the original variable space.

### Reference

- Robert L. Mason and John C. Young, Multivariate Statistical Process Control With Industrial Applications, ASA-SIAM, 2002.

**Robert L. Mason** is an institute analyst at Southwest Research Institute in San Antonio. He has a doctorate in statistics from Southern Methodist University and is a fellow of ASQ and the American Statistical Association.

**John C. Young** is a retired professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.

Featured advertisers