## 2020

STATISTICS ROUNDTABLE

# Detection Decisions

## Select the right statistical methods to examine data, find outliers

by Robert L. Mason and John C. Young

A Potential outlier is an observation located a considerable distance from the main data swarm. The inclusion of such outlying observations in a data analysis can produce erroneous estimates of means, variances and the correlations between variables.

In general, this distortion increases with the distance the point is located from the main data swarm. With a single variable, an outlier will be separate from and stand out on either end of the data set. This is usually readily apparent in data plots.

For example, consider a set of 81 observations of bottom sulfur readings from a chemical reactor. Figure 1 is a frequency histogram of these readings. Notice that the last two intervals—composed of the two largest observations—are somewhat removed from the cluster of the remaining intervals containing the other 79 observations. The inclusion of these two large observations in the data set will inflate the sample variance and increase the size of the sample mean. A Shewhart control procedure would designate these two observations as potential outliers.

Outliers in an industrial process become more and more difficult to detect with an increase in the dimensionality of the data. Although an outlier may not stick out on the end of the data distribution for multiple variables, they will stick out somewhere.

For example, consider a set of bivariate data in which the outlying observation does stick out. Figure 2 is a scatterplot of the waist size and chest size of a random sample of 147 college students. Notice there are only 80 visible points in the plot because there are multiple observations at some points.

The two circled observations in Figure 2 are potential outliers.
The first observation (27, 28) is marginally different from the others, but the
second observation (47, 43) is definitely different from the others. Why does
this latter observation stick out? Because it is the only observation in the
data set in which the chest size of the college student is smaller than the
waist size. It is interesting to note that a statistical control procedure
using the *T ^{2}* statistic to locate outliers designates the
second observation as an outlier but does not designate the first observation
as an outlier.

### Different ways to go

The search for determining the observations that stick out in a multidimensional data set has led to the development of many different statistical procedures for outlier detection.

One example is the procedure based on
examining the data set in a subspace of the principal component space.^{1}
Principal components are linear combinations of the original variables that are
orthogonal to one another and are derived using either the correlation or
covariance matrix of the data.

The first few principal components of such data are sensitive to changes in variation and covariation of the variables, while the last few principal components are sensitive to strong collinearities in the data. Reducing the dimensionality of the data by using only the first two or three principal components often allows an analyst to visibly locate outliers in principal component plots, which contributes to variation problems.

Another popular outlier detection procedure
is based on using a control chart of the *T ^{2}* statistic
and designating points with

*T*values outside the control limits as outliers.

^{2}^{2}When the sample data contain clusters of outliers, however, this statistic is subject to masking and swamping problems. Clustering outlier observations on the fringe of a data swarm is the main cause of problems called swamping and masking.

Swamping occurs when the cluster pulls the data swarm toward it. In doing so, non-outlying observations on the fringe, opposite the cluster, will appear farther from the data swarm and be designated as potential outliers. Masking occurs when the cluster of outliers pulls the data swamp toward it and inflates the estimates of the mean and covariance parameters in the directions of the cluster so individual observations within the cluster do not show up as outliers.

Some of the more recent procedures for detecting multivariate outliers include those based on the use of robust estimators. Such outlier detection schemes are not subject to the masking and swamping problems that can plague methods based on common estimators.

And outliers do not have the same ill effects on the robust estimators as they do on the common estimators. Robust estimates of the variances, means and correlations of the related variables, however, may be far removed from the true values of these statistics. In addition, many of these procedures are limited from practical use because they can be computationally intense.

### Better detection schemes

A preliminary step
that would improve these outlier detection schemes is to follow the procedure
recommended when creating a multivariate statistical control procedure for an
industrial process.^{3}

In a preliminary data analysis (that is, a phase I analysis) of such a procedure, a set of data is obtained under good operational conditions as judged by the process engineer. The data set then is subjected to a detailed data analysis from numerous perspectives. Charts, graphs and plots are used to locate unusual patterns and clusters in the data set. When these occur, irregularities are investigated for cause. This type of detailed data analysis will remove many of the outliers and data abnormalities that could be difficult for classical statistical procedures to detect.

To illustrate
this approach, consider a preliminary data set consisting of 55 observations on
four variables. Suppose no detailed data analysis has been performed to search
for data abnormalities. With a = 0.01, we
use the *T ^{2}* statistic based on the common estimates of the mean
vector and covariance matrix to scan for potential outliers. Figure 3 shows the

*T*control chart.

^{2}Because
none of the *T ^{2 }*values signal in the chart in Figure 3, you might
conclude that no outliers are present in the data. If we had plotted the data
using the first three principal components of the correlation matrix and
scanned the
resulting plot, however, a completely different conclusion would result. The
first three principal components explain more than 98% of the total variation
present in this data set. A principal component plot for these three components
is shown in Figure 4. The data swarm is enclosed in a 99% ellipsoid that
corresponds to a = 0.01 and is equivalent to the

*T*chart in Figure 3.

^{2}Although no observation is outside the *T ^{2}*
ellipsoid in Figure 4, two different data clusters are clearly evident in the
plot. The larger cluster (the blue points) consists of the first 50
observations plotted in the

*T*chart in Figure 3. The smaller cluster (the red points) contains the last five observations located on the

^{2}*T*chart. The outlying observations in the smaller cluster were added to an original data set of 50 observations (with no outliers) to illustrate how a cluster of outliers can mask the performance of the

^{2}*T*statistic. A detailed data analysis based on use of the principal component plot in Figure 4 would have helped the analyst quickly spot this cluster of potential outliers.

^{2}These five observations would also have
been detected as potential outliers if a time-sequence plot of each of the four
variables had been examined. These plots are shown in Figure 5. Notice the last
five observations for variables *x _{1}*,

*x*and

_{3}*x*all have much lower values than the rest of the observations on each variable. These three variables are the ones that dominate the first three principal components. In contrast, the last five observations on variable

_{4}*x*do not show any such change from the rest of the observations. This variable loads the heaviest on the fourth principal component, which has little influence on the results.

_{2}If you examine a plot of the data of the last three principal components, there is no evidence of the separation of the small cluster of outliers. This is shown in Figure 6. Notice the 50 observations (the blue points) in the large cluster now overlap with the five observations (the red points) in the small cluster.

Recall that plots of the first few
principal components are most sensitive to changes in the variation and covariation among the variables. The data separation seen
in the principal component plot in Figure 4 is caused by the influence of the
five outliers associated with variables *x _{1}*,

*x*and

_{3}*x*on the overall variation based on the first principal component. As shown in Figure 6, these outliers do not have such an effect on the overall variation based on the last three principal components.

_{4}### References

- Robert
L. Mason and John C. Young, "Multivariate Tools: Principal Component Analysis,"
*Quality Progress*, February 2005, pp. 83-85. - Robert
L. Mason and John C. Young, "Another Data Mining Tool,"
*Quality Progress*, February 2003, pp. 76-79. - Robert
L. Mason and John C. Young,
*Multivariate Statistical Process Control With Industrial Applications,*ASA-SIAM, 2002.

**Robert L. Mason** is an institute analyst
at Southwest Research Institute
in San Antonio, TX. He has a doctorate
in statistics from Southern Methodist
University in Dallas and is a fellow
of ASQ and the American Statistical
Association.

**John C. Young** is a retired professor
of statistics from McNeese State
University in Lake Charles, LA. He
received a doctorate in statistics from
Southern Methodist University.

Featured advertisers