Influence and Effect
Know how to measure the effect a data point has on a statistic
by Robert L. Mason and John C. Young
Being able to determine the effect a data point has on summary statistics provides useful insight into the construction of better parameter estimators.
Consider, for example, a person with an annual income of more than $1 million in a room with nine others with annual incomes in the $25,000 to $50,000 range. The average income for this group of 10 is greater than $100,000, but this is not a useful summary statistic of the typical income of most members of the group.
From this example, you observe that the inclusion of an observation far removed from the bulk of the observations in a sample can have a great effect on the estimated overall mean. If included, the outlying observation actually pulls the average value of the group toward it.
In another example, consider the group of bivariate observations contained in the circle in Figure 1. Such a circular region of data points indicates that the correlation between the two variables, x1 and x2, is close to zero; that is, no linear relationship exists between the two variables.
Observe the labeled point in the upper right-hand corner of the plot but outside of the circle. As the distance between this point and the mean of the group of points in the circle increases along the drawn 45° line, the correlation between the two variables will increase and approach its maximum value of one. Thus, this single outlying observation can distort the estimated value of the true correlation.
The variance, σ2, of a variable x is defined as the average of the squared deviation of that variable from its population mean, µ. Consequently, the square of the distance that an outlying observation is from its mean, that is, (x – µ)2, can have a great impact on the estimated value of the variance parameter.
For example, including the outlying point in Figure 1 with the circular group of points in the plot will increase the variances of x1 and x2. This occurs because the outlying point causes the data to be spread wider in both dimensions.
In two dimensions, scatter plots can be constructed to show how one or more data points can change the estimate of the means, the variances and the correlation coefficient between the variables.
For example, Figure 2 contains four observations, labeled A, B, C and D, which are removed from the bulk of the data enclosed in the ellipse. The inclusion of points A or C will not affect the correlation coefficient because both support the linear trend in the data.
Including point A, however, will increase the variances and decrease the means of x1 and x2, while including point C will increase the variances and the means of x1 and x2. The inclusion of points B or D will affect the correlation coefficient between the two variables because both lie in directions opposite of the linear trend of the data. In addition, including point B will decrease the mean but increase the variance of x1, while including point D will increase the mean and variance of x1.
The influence function
Several mathematical procedures exist for determining the effect an observation has on these particular estimates. One popular procedure1 used in developing robust statistical estimators is based on developing an influence curve or influence function for use as a measure of the effect that an observation has on the parameter being estimated.
When applied to the mean, the influence
function is exactly what you would expect: a measure of the difference between
an observation and the mean:
x – µ. Likewise, when applied to the variance, you obtain the expected answer that the influence function is the squared distance between an observation and the mean: (x – µ)2 – σ2.
When applied to the sample correlation coefficient, r, between two variables, x1 and x2, the contours of the influence function are a set of hyperbolae given by the formula
in which y1 and y2 are the studentized values of x1 and x2, and c is a chosen constant value. The selection of the value of c for drawing these contours is arbitrary (and chosen to include the bulk of the points), but nevertheless serves to identify observations removed from the data swarm.
Superimposing these hyperbolic contours over the corresponding scatter plot for y1 and y2 allows you to determine which observations are having the greatest effect on the estimate of the correlation coefficient. Points inside the hyperbolae will influence function values greater that +c or less than –c. Points outside the hyperbolae will have influence function values between –c and +c. Figure 3 shows an example of these contours for a case in which c = ± 2.7 and r = 0.81.
Detailed procedures exist for interpreting the data points in relation to the contour plots.2,3 Those points located on the side of the data swarm but inside the hyperbolae, such as point A in Figure 3, will decrease the value of the correlation coefficient. Those points located within the hyperbola on the ends of the data swarm, such as points B and C in Figure 3, will increase the correlation coefficient.
Influence functions also can be used for detecting outliers in a bivariate sample.4 For example, any point located within the hyperbolae, such as points A, B and C in Figure 3, are subject to removal. In addition, the influence function value in (1) can be computed for any other observation in the sample.
Influence function example
Figure 4 contains the scatter plot of 212 observations selected at random from a bivariate normal distribution in which the variables are standardized with a correlation coefficient of 0.812. In this form, the correlation is the same as the covariance between the two variables. These observations are represented in Figure 4 by the points within the ellipse, excluding the one labeled point 2.
For illustrative purposes, two additional observations, point 1 with coordinates (-2, 2) and point 2 with coordinates (1, 1), have been added to the plot in Figure 4. Point 1 is outside the data swarm and inside the hyperbolae, indicating it could be an outlier and could possibly have influence on the computation of sample statistics.
This is verified in Table 1 by comparing the sample correlation coefficient values obtained with and without this point (while ignoring point 2). Including point 1 decreases the pairwise correlation between the two variables from 0.812 to 0.776. In addition, Table 1 includes the effect on the sample means and variances of y1 and y2. For y1, the absolute value of the mean increases and the standard deviation slightly increases when point 1 is included. For y2, the absolute value of the mean decreases, but the standard deviation slightly increases when point 1 is included.
The coordinates of point 1 are (-2, 2). Including this point in the original sample, the value of the correlation coefficient is r = 0.776. These coordinates and the correlation coefficient are needed to compute the value of the influence function for point 1 using the earlier equation. The computed value is -7.1, which is less than the chosen value of c = -2.7. This result independently confirms what we see in Figure 4, namely that point 1 is inside the hyperbolae and has a decreasing effect on the correlation coefficient.
In contrast, point 2, with coordinates (1, 1), is contained within the data swarm (elliptical region) and outside of the hyperbolae in Figure 4. Thus, this point should have minimal effect on the sample estimates. This is confirmed when examining the results in Table 2.
When the summary statistics and the correlations are computed with and without this point (while excluding point 1), small differences are noted in all of the statistics. These results also are confirmed by the small value of the influence function for this point. The computed value using the earlier equation is 0.2, which is between –c = -2.7 and +c = +2.7, and is therefore outside the hyperbolae in Figure 4.
As can be seen from these examples, the influence function of a statistic is a key component in robust estimation because it helps you assess the influence that an observation has on the estimation of a statistic. It also is important in detecting outliers in the data.
- Frank R. Hampel, "The Influence Curve and its Role in Robust Estimation," Journal of the American Statistical Association, 1974, pp. 383–393.
- Susan J. Devlin, Ramanathan Gnanadesikan and John R. Kettenring, "Robust Estimation and Outlier Detection With Correlation Coefficients," Biometrika, 1975, pp. 531–545.
- Michael R. Chernick, "The Influence Function and its Application to Data Validation," American Journal of Mathematical and Management Sciences, 1982, pp. 263–288.
- Chernick, "The Influence Function and its Application to Data Validation," American Journal of Mathematical and Management Sciences, see reference 3.
Robert L. Mason is an institute analyst at Southwest Research Institute in San Antonio. He has a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of ASQ and the American Statistical Association.
John C. Young is a retired statistics professor at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.