Select the right statistical methods to examine data, find outliers
by Robert L. Mason and John C. Young
A Potential outlier is an observation located a considerable distance from the main data swarm. The inclusion of such outlying observations in a data analysis can produce erroneous estimates of means, variances and the correlations between variables.
In general, this distortion increases with the distance the point is located from the main data swarm. With a single variable, an outlier will be separate from and stand out on either end of the data set. This is usually readily apparent in data plots.
For example, consider a set of 81 observations of bottom sulfur readings from a chemical reactor. Figure 1 is a frequency histogram of these readings. Notice that the last two intervals—composed of the two largest observations—are somewhat removed from the cluster of the remaining intervals containing the other 79 observations. The inclusion of these two large observations in the data set will inflate the sample variance and increase the size of the sample mean. A Shewhart control procedure would designate these two observations as potential outliers.
Outliers in an industrial process become more and more difficult to detect with an increase in the dimensionality of the data. Although an outlier may not stick out on the end of the data distribution for multiple variables, they will stick out somewhere.
For example, consider a set of bivariate data in which the outlying observation does stick out. Figure 2 is a scatterplot of the waist size and chest size of a random sample of 147 college students. Notice there are only 80 visible points in the plot because there are multiple observations at some points.
The two circled observations in Figure 2 are potential outliers. The first observation (27, 28) is marginally different from the others, but the second observation (47, 43) is definitely different from the others. Why does this latter observation stick out? Because it is the only observation in the data set in which the chest size of the college student is smaller than the waist size. It is interesting to note that a statistical control procedure using the T2 statistic to locate outliers designates the second observation as an outlier but does not designate the first observation as an outlier.
Different ways to go
The search for determining the observations that stick out in a multidimensional data set has led to the development of many different statistical procedures for outlier detection.
One example is the procedure based on examining the data set in a subspace of the principal component space.1 Principal components are linear combinations of the original variables that are orthogonal to one another and are derived using either the correlation or covariance matrix of the data.
The first few principal components of such data are sensitive to changes in variation and covariation of the variables, while the last few principal components are sensitive to strong collinearities in the data. Reducing the dimensionality of the data by using only the first two or three principal components often allows an analyst to visibly locate outliers in principal component plots, which contributes to variation problems.
Another popular outlier detection procedure is based on using a control chart of the T2 statistic and designating points with T2 values outside the control limits as outliers.2 When the sample data contain clusters of outliers, however, this statistic is subject to masking and swamping problems. Clustering outlier observations on the fringe of a data swarm is the main cause of problems called swamping and masking.
Swamping occurs when the cluster pulls the data swarm toward it. In doing so, non-outlying observations on the fringe, opposite the cluster, will appear farther from the data swarm and be designated as potential outliers. Masking occurs when the cluster of outliers pulls the data swamp toward it and inflates the estimates of the mean and covariance parameters in the directions of the cluster so individual observations within the cluster do not show up as outliers.
Some of the more recent procedures for detecting multivariate outliers include those based on the use of robust estimators. Such outlier detection schemes are not subject to the masking and swamping problems that can plague methods based on common estimators.
And outliers do not have the same ill effects on the robust estimators as they do on the common estimators. Robust estimates of the variances, means and correlations of the related variables, however, may be far removed from the true values of these statistics. In addition, many of these procedures are limited from practical use because they can be computationally intense.
Better detection schemes
A preliminary step that would improve these outlier detection schemes is to follow the procedure recommended when creating a multivariate statistical control procedure for an industrial process.3
In a preliminary data analysis (that is, a phase I analysis) of such a procedure, a set of data is obtained under good operational conditions as judged by the process engineer. The data set then is subjected to a detailed data analysis from numerous perspectives. Charts, graphs and plots are used to locate unusual patterns and clusters in the data set. When these occur, irregularities are investigated for cause. This type of detailed data analysis will remove many of the outliers and data abnormalities that could be difficult for classical statistical procedures to detect.
To illustrate this approach, consider a preliminary data set consisting of 55 observations on four variables. Suppose no detailed data analysis has been performed to search for data abnormalities. With a = 0.01, we use the T2 statistic based on the common estimates of the mean vector and covariance matrix to scan for potential outliers. Figure 3 shows the T2 control chart.
Because none of the T2 values signal in the chart in Figure 3, you might conclude that no outliers are present in the data. If we had plotted the data using the first three principal components of the correlation matrix and scanned the resulting plot, however, a completely different conclusion would result. The first three principal components explain more than 98% of the total variation present in this data set. A principal component plot for these three components is shown in Figure 4. The data swarm is enclosed in a 99% ellipsoid that corresponds to a = 0.01 and is equivalent to the T2 chart in Figure 3.
Although no observation is outside the T2 ellipsoid in Figure 4, two different data clusters are clearly evident in the plot. The larger cluster (the blue points) consists of the first 50 observations plotted in the T2 chart in Figure 3. The smaller cluster (the red points) contains the last five observations located on the T2 chart. The outlying observations in the smaller cluster were added to an original data set of 50 observations (with no outliers) to illustrate how a cluster of outliers can mask the performance of the T2 statistic. A detailed data analysis based on use of the principal component plot in Figure 4 would have helped the analyst quickly spot this cluster of potential outliers.
These five observations would also have been detected as potential outliers if a time-sequence plot of each of the four variables had been examined. These plots are shown in Figure 5. Notice the last five observations for variables x1, x3 and x4 all have much lower values than the rest of the observations on each variable. These three variables are the ones that dominate the first three principal components. In contrast, the last five observations on variable x2 do not show any such change from the rest of the observations. This variable loads the heaviest on the fourth principal component, which has little influence on the results.
If you examine a plot of the data of the last three principal components, there is no evidence of the separation of the small cluster of outliers. This is shown in Figure 6. Notice the 50 observations (the blue points) in the large cluster now overlap with the five observations (the red points) in the small cluster.
Recall that plots of the first few principal components are most sensitive to changes in the variation and covariation among the variables. The data separation seen in the principal component plot in Figure 4 is caused by the influence of the five outliers associated with variables x1, x3 and x4 on the overall variation based on the first principal component. As shown in Figure 6, these outliers do not have such an effect on the overall variation based on the last three principal components.
- Robert L. Mason and John C. Young, "Multivariate Tools: Principal Component Analysis," Quality Progress, February 2005, pp. 83-85.
- Robert L. Mason and John C. Young, "Another Data Mining Tool," Quality Progress, February 2003, pp. 76-79.
- Robert L. Mason and John C. Young, Multivariate Statistical Process Control With Industrial Applications, ASA-SIAM, 2002.
Robert L. Mason is an institute analyst at Southwest Research Institute in San Antonio, TX. He has a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of ASQ and the American Statistical Association.
John C. Young is a retired professor of statistics from McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.