More is Not Always Better
Quality and lower dependence trump quantity
By Christine M. Anderson-Cook
Having more data is better, statisticians often say. But, as with much in life, the devil is in the details as to how to interpret such a statement.
All things being equal, if offered a choice between small or large sample sizes, the larger sample size is preferred. Or is it?
What do we mean by "all things being equal"? In this case, the larger sample is still a representative sample of the same population we are trying to characterize. Clearly, having a larger sample that was taken from a different population or consisting of data from several different populations would be a poor alternative to the smaller sample taken from the "right" population.
For example, if I were trying to understand the voting preferences of New Mexico voters, and my choices were a sample of 100 New Mexico voters or 1,000 nationwide voters, the answer is simple. Having the right data is more important than having a big sample size.
Not always quantity
There are other ways more data can be undesirable. Consider a situation I encountered in the last few months:
An engineer asked me to help with a multiple linear regression problem. She was interested in constructing a model to predict the value of a response—strength of a material—as a function of age and usage.
The data were from an observational study collected from a range of ages. Obtaining usage values involved time-consuming extraction of information from a database. Initial examination of the data showed relatively high correlation among all variables (strength, age and usage) as shown in Figure 1.
High correlation between the response and each of the explanatory variables (-0.845 for strength with age and -0.838 for strength with usage) indicates that we could hope to predict the response well with knowledge of the explanatory variable values. On the other hand, the correlation between the two explanatory variables was found to be 0.988, which we will see can potentially cause problems.
Recall that correlation is a measure of the strength of the linear relationship between two variables in which a value of -1 or +1 indicates a deterministic relationship, and zero indicates no linear relationship at all. The sign of the correlation tells you whether the variables change in the same direction (as age increases, usage also tends to increase) or in opposite directions (as age increases, strength tends to decrease).
The primary goal of the study was to predict strength for different combinations of age and usage. A natural model, given the relationships observed in the pairwise scatterplots, might be to model strength as a first-order model with age and usage as explanatory variables. The form of this standard multiple linear regression model would be:
Strength = β0 + β1Age + β2Usage + ε
in which β0, β1 and β2 are the model parameters to be estimated using least squares, and e is the error term.
First, we consider the impact of the correlation on the estimation of model parameters. In a perfect world, the most precise estimation of β1 and β2 would be obtained if age and usage were independent of each other. Independence implies the correlation between age and usage is zero, and that knowing the observed value of age does not provide information about corresponding value of usage.
Just as with a designed experiment, such as a factorial experiment, we want to choose combinations of factors that are orthogonal to each other so the factor effects can be estimated independently of each other.1
In this example, we have a very specific idea about what the value of usage would be given a particular age. If we observed an age value of four, it would be quite surprising if the usage value was outside the range (12, 18). Similarly, if we observed a usage value of 15, we would expect age to be in the range (3.5, 4.5).
Table 1 shows a summary of the regression model, including age and usage as explanatory variables. Table 2 shows the corresponding summary for the model with age as the only covariate. We also could have produced a similar table with usage as the only covariate, but because this did not predict strength as well as the age-only model, I have omitted it.
Note that R2 (75.22% vs. 75.09%) and R2adj (74.35% vs. 74.66%) are essentially the same for both models, indicating the age-only model explains similar amounts of variation in the response2 as the model with age and usage. One of the first differences we note is that the test of hypothesis for age (and also usage) in Table 1 indicates there is not enough evidence to reject the null hypothesis that the coefficient is equal to zero. In the model with age only, however, the coefficient is highly significant.
It is important to note the tests are fundamentally different:
- In the age-only model, the test of hypothesis is evaluating whether age is beneficial for explaining the variation of the response.
- In the model with both explanatory variables, we are testing if age is important, conditional on usage already being included in the model.
These are two very different tests, especially because we already have seen that knowing usage gives us specific information about likely values of age. Another important note is that the standard errors of the coefficients for age are different by an order of magnitude: 1.1538 for the model with age and usage, and 0.1505 for the age-only model.
The precision by which we can estimate the model parameters is a function of how correlated the explanatory variables are to one another. The variance inflation factor (VIF)3 measures the severity of multicollinearity, or how much the variance of coefficients is increased because of correlation between the explanatory variables.
In this case, the VIF for age is 41.8. That means the variance of the coefficient β1 in the model with age and usage is nearly 42 times as large as what we would have observed if age and usage had been independent. Clearly, we are paying a very large price with the precision of our estimation because of the high correlation between explanatory variables.
Recall, however, that the main goal of the study was to be able to predict the value of the response at various combinations of age and usage. Will this be affected by multicollinearity as well? Consider prediction of the response near the middle of age and usage—for example, age = 4 and usage = 13.
For these values, the model with age and usage predicts a response of 22.28 (95% confidence interval for mean of [21.16, 23.41]), while the age-only model predicts a response of 22.20 (with 95% confidence interval of [21.30, 23.11]). Note the second prediction ignores the available value of usage.
While the point estimates are similar, the key difference between the two estimates is the width of the confidence intervals. The age-only model has an advantage, with a narrower width of 1.81 vs. 2.25 for the model with age and usage. Further note the age-only interval is contained completely within the interval based on age and usage.
This difference in widths becomes more pronounced as we move away from the center of the data. Consider predicting the response for (age, usage) = (1, 5), with the point estimates of 27.72 and 27.96 for the models with both covariates and age only, respectively. The 95% confidence intervals are (25.30, 30.14) vs. (26.45, 29.47), with widths of 4.84 and 3.02, respectively.
The precision of estimation of model parameters and prediction are negatively affected by the inclusion of usage in the model. When presented with these results, my engineering colleague remained hesitant to exclude usage in the final model, saying, "But I worked so hard to collect that data!"
A few takeaways
There are several potential lessons to be learned from this example:
When we have choices about what explanatory variables to collect, we should consider the amount of new information additional variables provide.
In this example, if the engineer had been able to anticipate the high correlation between the two measures (age and usage), the additional expense of collecting the usage data might have been avoided. Additional explanatory variables should contribute new information and not be surrogates for the variables already in the model.
Another lesson: Even after our data are collected, using some model selection criteria (such as Akaike Information Criterion,4 Mallow's Cp5 or stepwise regression6) to trim our model of unnecessary terms can improve the estimation of model parameters and prediction.
STACK of statistics roundtables
This monthly QP column tackles a variety of topics related to the collection, analysis, interpretation, explanation and presentation of data. Pass along ideas for future columns by e-mailing firstname.lastname@example.org.
- George E.P. Box, J. Stuart Hunter and William G. Hunter, Statistics for Experimenters: Design, Innovation, and Discovery, Wiley, 2005, p. 173.
- Douglas C. Montgomery, Elizabeth A. Peck and Geoffrey G. Vining, Introduction to Linear Regression Analysis, Wiley, pp. 90-91.
- Lyman Ott and Michael T. Longnecker, A First Course in Statistical Methods, Thompson Brooks/Cole, 2004, p. 615.
- K.P. Burnham and D.R. Anderson, Model Selection and Multimodel Inference: A Practical-Theoretical Approach, Springer-Verlag, 2002.
- Norman R. Draper and Harry Smith, Applied Regression Analysis, Wiley, 1998, pp. 332-334.
- Montgomery, Peck and Vining, Introduction to Linear Regression Analysis, see reference 2.
Christine Anderson-Cook is a research scientist at Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario, Canada. Anderson-Cook is a senior member of ASQ and a fellow of the American Statistical Association.