## 2019

STATISTICS ROUNDTABLE

# Influence and Effect

## Know how to measure the effect a data point has on a statistic

by Robert L. Mason and John C. Young

Being able to determine the effect a data point has on summary statistics provides useful insight into the construction of better parameter estimators.

Consider, for example, a person with an annual income of more than $1 million in a room with nine others with annual incomes in the $25,000 to $50,000 range. The average income for this group of 10 is greater than $100,000, but this is not a useful summary statistic of the typical income of most members of the group.

From this example, you observe that the inclusion of an observation far removed from the bulk of the observations in a sample can have a great effect on the estimated overall mean. If included, the outlying observation actually pulls the average value of the group toward it.

In another example, consider the group of bivariate observations contained in the circle in Figure 1.
Such a circular region of data points indicates that the correlation between
the two variables, *x _{1}* and

*x*, is close to zero; that is, no linear relationship exists between the two variables.

_{2}Observe the labeled point in the upper right-hand corner of the plot but outside of the circle. As the distance between this point and the mean of the group of points in the circle increases along the drawn 45° line, the correlation between the two variables will increase and approach its maximum value of one. Thus, this single outlying observation can distort the estimated value of the true correlation.

The variance, σ^{2},
of a variable *x* is defined as the average of the squared
deviation of that variable from its population mean, µ. Consequently, the
square of the distance that an outlying observation is from its mean, that is,
(*x*
– µ)^{2}, can have a great impact
on the estimated value of the variance parameter.

For example, including the outlying point
in Figure 1 with the circular group of points in the plot will increase the
variances of *x _{1}* and

*x*. This occurs because the outlying point causes the data to be spread wider in both dimensions.

_{2}In two dimensions, scatter plots can be constructed to show how one or more data points can change the estimate of the means, the variances and the correlation coefficient between the variables.

For example, Figure 2 contains four observations, labeled A, B, C and D, which are removed from the bulk of the data enclosed in the ellipse. The inclusion of points A or C will not affect the correlation coefficient because both support the linear trend in the data.

Including point A, however, will increase
the variances and decrease the means of *x _{1}* and

*x*, while including point C will increase the variances and the means of

_{2}*x*and

_{1}*x*. The inclusion of points B or D will affect the correlation coefficient between the two variables because both lie in directions opposite of the linear trend of the data. In addition, including point B will decrease the mean but increase the variance of

_{2}*x*, while including point D will increase the mean and variance of

_{1}*x*.

_{1}### The influence function

Several
mathematical procedures exist for determining the effect an observation has on
these particular estimates. One popular procedure^{1} used in
developing robust statistical estimators is based on developing an influence
curve or influence function for use as a measure of the effect that an
observation has on the parameter being estimated.

When applied to the mean, the influence
function is exactly what you would expect: a measure of the difference between
an observation and the mean:

*x*
– µ. Likewise, when applied to the variance, you obtain the expected
answer that the influence function is the squared distance between an
observation and the mean: (*x* – µ)^{2} – σ^{2}.

When applied to the sample correlation
coefficient, *r,* between two variables, *x _{1}*
and

*x*, the contours of the influence function are a set of hyperbolae given by the formula

_{2}^{,}

in which *y _{1}* and

*y*are the studentized values of

_{2}*x*and

_{1}*x*, and

_{2}*c*is a chosen constant value. The selection of the value of

*c*for drawing these contours is arbitrary (and chosen to include the bulk of the points), but nevertheless serves to identify observations removed from the data swarm.

Superimposing these hyperbolic contours
over the corresponding scatter plot for *y _{1}* and

*y*allows you to determine which observations are having the greatest effect on the estimate of the correlation coefficient. Points inside the hyperbolae will influence function values greater that +

_{2}*c*or less than –

*c*. Points outside the hyperbolae will have influence function values between –

*c*and +

*c*. Figure 3 shows an example of these contours for a case in which

*c*= ± 2.7 and

*r*= 0.81.

Detailed procedures exist for interpreting
the data points in relation to the contour plots.^{2,3}
Those points located on the side of the data swarm but inside the hyperbolae,
such as point A in Figure 3, will decrease the value of the correlation
coefficient. Those points located within the hyperbola on the ends of the data
swarm, such as points B and C in Figure 3, will increase the correlation
coefficient.

Influence functions also can be used for
detecting outliers in a bivariate sample.^{4}
For example, any point located within the hyperbolae, such as points A, B and C
in Figure 3, are subject to removal. In addition, the influence function value
in (1) can be computed for any other observation in the sample.

### Influence function example

Figure 4 contains the scatter plot of 212 observations selected at random from a bivariate normal distribution in which the variables are standardized with a correlation coefficient of 0.812. In this form, the correlation is the same as the covariance between the two variables. These observations are represented in Figure 4 by the points within the ellipse, excluding the one labeled point 2.

For illustrative purposes, two additional observations, point 1 with coordinates (-2, 2) and point 2 with coordinates (1, 1), have been added to the plot in Figure 4. Point 1 is outside the data swarm and inside the hyperbolae, indicating it could be an outlier and could possibly have influence on the computation of sample statistics.

This is verified in Table 1 by comparing
the sample correlation coefficient values obtained with and without this point
(while ignoring point 2). Including point 1 decreases the pairwise correlation between the two variables from 0.812 to 0.776. In addition, Table 1
includes the effect on the sample means and variances of *y _{1}*
and

*y*. For

_{2}*y*, the absolute value of the mean increases and the standard deviation slightly increases when point 1 is included. For

_{1}*y*, the absolute value of the mean decreases, but the standard deviation slightly increases when point 1 is included.

_{2}The coordinates of point 1 are (-2, 2).
Including this point in the original sample, the value of the correlation
coefficient is *r* = 0.776. These coordinates and the
correlation coefficient are needed to compute the value of the influence
function for point 1 using the earlier equation. The computed value is -7.1,
which is less than the chosen value of *c* = -2.7. This result
independently confirms what we see in Figure 4, namely that point 1 is inside
the hyperbolae and has a decreasing effect on the correlation coefficient.

In contrast, point 2, with coordinates (1, 1), is contained within the data swarm (elliptical region) and outside of the hyperbolae in Figure 4. Thus, this point should have minimal effect on the sample estimates. This is confirmed when examining the results in Table 2.

When the summary statistics and the
correlations are computed with and without this point (while excluding point
1), small differences are noted in all of the statistics. These results also
are confirmed by the small value of the influence function for this point. The
computed value using the earlier equation is 0.2, which is between –*c*
= -2.7 and +*c* = +2.7, and is therefore outside the
hyperbolae in Figure 4.

As can be seen from these examples, the influence function of a statistic is a key component in robust estimation because it helps you assess the influence that an observation has on the estimation of a statistic. It also is important in detecting outliers in the data.

### References

- Frank
R. Hampel, "The Influence Curve and its Role in
Robust Estimation,"
*Journal of the American Statistical Association*, 1974, pp. 383–393. - Susan
J. Devlin, Ramanathan Gnanadesikan
and John R. Kettenring, "Robust Estimation and Outlier
Detection With Correlation Coefficients,"
*Biometrika**,*1975, pp. 531–545. - Michael
R. Chernick, "The Influence Function and its
Application to Data Validation,"
*American Journal of Mathematical and Management Sciences,*1982, pp. 263–288. - Chernick, "The Influence Function and its Application to
Data Validation,"
*American Journal of Mathematical and Management Sciences,*see reference 3.

**Robert L. Mason** is an institute analyst at Southwest
Research Institute in San Antonio. He has a doctorate in statistics from
Southern Methodist University in Dallas and is a fellow of ASQ and the American
Statistical Association.

**John C. Young** is a retired statistics professor at McNeese State University in Lake Charles, LA. He received a
doctorate in statistics from Southern Methodist University.

Featured advertisers