Find the best way to handle missing data in surveys
by I. Elaine Allen and Julia E. Seaman
The United States recently completed its 23rd federal population census. The first census was mandated by the U.S. Constitution and carried out under Thomas Jefferson in 1790. Until 1950, the census was conducted in person or by telephone, so the risk of missing data was minimal.
Since the census changed to a mail-response format, nonresponse rates for overall response, as well as unanswered questions, have increased. This problem is not isolated to the mail-in census, however, and affects most surveys—especially large-scale surveys—regardless of the format.
For example, statisticians and organizers of one large annual survey on entrepreneurship in the United States1 encountered two issues related to nonresponse that affected the quality of the research:
- The overall refusal rate and the refusal rate for specific questions in random digit dialing (RDD) surveys continue to increase. In the 2008 survey, more than 25,000 calls were made to get 4,000 responses.
- The inability to reach cell phone-only users in the United States—in which RDD-sampling of cell phones is not permitted by law—produces a biased demographic. In addition, there were too few respondents in the 18 to 35 age groups when compared to the U.S. age distribution, thus requiring an oversampling of the 18 to 35 age groups.
Thankfully, there are techniques to address missing values in survey data. To deal with missing or underreported groups in the overall survey results, weights can be used to produce a representative sample for a given population. More complex imputation methods also have been developed to fill in missing data for specific questions. These methods can prove to be more intricate. In addition, using such techniques comes with implications that can affect statistical analyses.
Types of missing values
Missing values in a survey can be classified by the degree of randomness of the missing information. The easiest—and strongest assumption—that can be made is that the data are "missing completely at random." This means no other information in the survey can help the researcher fill in the missing data. Statistically, there is not enough information in the respondent’s completed data to create a conditional probability to improve the missing data.
In this case, a random value from another respondent’s results can be used to fill in the holes. This assumption is unlikely to be completely satisfied, and a better imputation of the value can be obtained by using some of the respondent’s data.
Another strong assumption is "data are missing at random." This assumption requires variables that can conditionally help fill in missing data and offer a range of values that provide a better model of the missing information.
For example, consider imputing missing ages of respondents based on the marked level of completed education: Ages 20-23 are equally likely to be used for college graduates, while ages 17-20 are common for high school graduates. Values in these ranges are chosen for the missing data based on the variable of highest completed education.
"Not missing at random" is the most likely type of data to be imputed. Knowing the other available data of the respondent, the researcher can impute the missing value—such as imputing an area code based on the respondent’s zip code—with a high degree of probability.
With any imputation procedure, bias in the analyses should be minimized while maximizing the use of available information for the researcher and giving reasonable estimates of variability and error.
The following techniques use values from other responders or intelligent guessing to fill in missing data:
Deletion of the respondent or pairwise deletion: These are the simplest ways to deal with missing data. But they can eliminate usable data along with the missing data and can produce biased results.
During analysis, there is the option of deleting the complete case, deleting the variable for all cases, or pairwise deletion, in which all available data are considered for estimation and contribute to statistical summaries but may create different sample sizes between different analyses. Pairwise deletion, while not deleting entire respondents, may produce bias if the respondents with partial data are markedly different from those with complete data.
Hot-deck procedures: This technique uses the actual responses provided by other respondents in a study as the basis for assigning answers for missing information from a particular respondent. The easiest way to implement this overall imputation is to take a random respondent and enter their value for the missing data. A better way is to use a hot-deck procedure within characteristics that are known for the respondent with missing data.
For example, if gender, ethnicity and years of schooling have been completed but age is missing, a random respondent with the same gender, ethnicity and years of schooling is chosen from the respondents who match, and that respondent’s age is entered for the missing data.
Variants of this include hierarchical procedures in which matching variables are ranked so gender and years of schooling are more important than ethnicity when imputing age. The matches in which ethnicity is different but the important variables match exactly can be used to fill in the missing data.
The U.S. Census Bureau has used this technique for imputing missing values. In addition, John Stiller and Donald R. Dalzell have published a macro for implementing these techniques in SAS software.2
A related imputation technique, the cold-deck procedure, is similar but uses statistical summaries. We’ll discuss this later in the column.
Interpolation and extrapolation: This technique estimates the missing data through algebraic interpolation and, if data are assumed to take on a certain shape or distribution, using a function to impute the missing values.
Deductive imputation: This may be a qualitative or quantitative technique. Qualitatively, and useful for small surveys, the researcher may be able to read the results of the respondent and, with a high degree of confidence, impute the missing value.
For example, given a respondent’s address, the researcher may be able to impute ethnicity or home ownership based on the researcher’s knowledge of the area. Time consuming and nonprobabilistic, this method cannot be justified statistically.
The following techniques are designed to minimize bias, variance or both:
Substituting the mean or cold-deck procedures: These are easy and justifiable imputation techniques. Simple mean substitution will fill in any missing data for a particular variable with the mean of that variable from the entire population. Complex mean substitution will fill in missing data for a respondent with the mean of the variable conditionally related to the missing data, similar to the hot-deck technique.
For missing age values, for example, the overall mean age is imputed as a simple mean substitution. Using the mean age for all Asian female, high school graduate respondents for missing age data among that demographic is a more complex imputation procedure. In some cases, a level of randomness or stochasticity is achieved by adding a random value based on the age distribution.
Problems with this technique arise in the calculation of the number of degrees of freedom or standard errors of any analyses as the imputed data are included as the respondent’s data. They are, in fact, statistical measures.
By increasing the number of degrees of freedom or decreasing the standard error, the result of this technique is more likely to produce statistically significant results. Many statistical software packages allow easy mean substitution for missing data. Some allow for subgroup mean substitution derived from important conditional variables.
Regression and stochastic regression techniques: These implement a linear (or, theoretically, a nonlinear) model to predict the missing data. For these methods, a model is fit to the data predicting the variable with the missing values using all nonmissing data, and the predicted missing values are imputed.
An appealing result of this technique is that regression methods will result in not only a predicted value, but also a confidence bound for this value. The researcher can then substitute the mean and the extremes into the missing data to examine their effects on other analyses.
It is also an easier method than identifying the important variables related to the variable with missing data and calculating the related means, which may come from an extremely small group. Similar to the mean substitution method, however, this method will increase the degrees of freedom in analyses, and any resulting statistical tests will more likely be statistically significant.
Decision trees: This method, used for missing value substitution3 for supervised machine learning techniques in data-mining applications, is based on probabilities calculated using categorical (or variables binned to become categorical) variables. They are statistical but rely on machine-learning algorithms instead of researcher-created models.
While it may be a statistical technique, this method is designed for large data sets in which statistical testing is not appropriate. Clearly, if statistical methods were applied, it would suffer from the same increased likelihood of statistical significance as the methods mentioned earlier.
Table 1 illustrates all these techniques using the 2008 U.S. Global Entrepreneurship Monitor (GEM) Survey.4 The actual age of the respondent, 25, was removed to test the different analytical methods against the known value. The overall imputed values for her age vary from 22 to 48 years, with most methods within three years.
The results show:
- This is data not missing at random.
- Statistical and nonstatistical techniques can be equally accurate.
The actual mean age in the GEM Survey was 48 (range of 18 to 99), the female mean age was 43 (range of 18 to 78), and the mean for a female college graduate with two years of work experience was 24 (range of 19 to 25).
These techniques provide simple to complex methods for imputing missing data. The techniques vary from completely researcher-intensive techniques to totally machine-learning driven techniques.
All of these methods can be extended from one missing value to a multiple imputation technique for multiple missing values per respondent. But, beware of using multiple imputation: The greater the percentage of imputed data in the sample, the larger the error bars must be drawn around any inferences made on analysis results.
Remember, the methods used and the percentage of missing data imputed must be disclosed as part of the assumptions in any results reported. When used wisely, however, techniques correcting for missing data can broaden analyses and strengthen results.
- Abdul Ali, I. Elaine Allen, Candida Bygrave Brush and William D. DeCastro, et al., "What Entrepreneurs Are Up To: The 2008 National Entrepreneurial Assessment for the United States of America," The Global Entrepreneurship Monitor, 2009, www.gemconsortium.org.
- John Stiller and Donald R. Dalzell, "Hot-Deck Imputation With SAS Arrays and Macros for Large Surveys," Proceedings of the 23rd Northeast SAS Users Group Meeting, 1997, www.nesug.org/proceedings/nesug97/posters/stiller.pdf.
- Bhekisipho Twala, "An Empirical Comparison of Techniques for Handling Incomplete Data Using Decision Trees," Applied Artificial Intelligence, 2009, Vol. 23, pp. 373-408.
- Ali, Allen, Brush, DeCastro, et al., "What Entrepreneurs Are Up To: The 2008 National Entrepreneurial Assessment for the United States of America," see reference 1.
- Carlson, Barbara Lepidus, Brenda G. Cox and Linda S. Bandeh, "SAS Macros Useful in Imputing Missing Survey Data," Proceedings of the 20th Annual SAS Users Group International Conference, 1995, Cary, NC, pp. 1,089-1,094.
- Fielding, Shona, Peter M. Fayers, Alison McDonald, Gladys McPherson and Marion K. Campbell, "Simple Imputation Methods Were Inadequate for Missing Not at Random (MNAR) Quality of Life Data," Health and Quality of Life Outcomes, 2008, Vol. 6, No. 57.
- Lin, Ting Hsiang, "Missing Data Imputation in Quality-of-Life Assessment: Imputation for WHOQOL-BREF," Pharmacoeconomics, 2006, Vol. 24, No. 9, pp. 917-925.
- Sande, Innis G., "Imputation in Surveys: Coping with Reality," The American Statistician, 1982, Vol. 36, No. 3, pp. 145-152.
- Schafer, Joseph L. and John W. Graham, "Missing Data: Our View of the State of the Art," Psychological Methods, 2002, Vol. 7, No. 2, pp. 147-177.
I. Elaine Allen is research director of the Arthur M. Blank Center for Entrepreneurship, director of the Babson Survey Research Group and professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
Julia E. Seaman is a doctoral student in pharmacogenomics at the University of California, San Francisco, and a statistical consultant for the Babson Survey Research Group at Babson College. She earned a bachelor’s degree in chemistry and mathematics from Pomona College in Claremont, CA.