第三十三篇 Find the best way to handle missing data in surveys
Find the best way to handle missing data in surveys
by I. Elaine Allenand Julia E. Seaman
The United States recently completed its 23rd federalpopulation census. The first census was mandated by the U.S. Constitution andcarried out under Thomas Jefferson in 1790. Until 1950, the census wasconducted in person or by telephone, so the risk of missing data was minimal.
美国最近完成了其第23届联邦人口普查。第一次人口普查的实施是在1790年依据宪法授权在Thomas Jefferson执政时期进行的。直到1950年,人口普查采用亲自登记或打电话的方式,所以缺失值的风险降被降低到最小。
Since the census changed to a mail-response format,nonresponse rates for overall response, as well as unanswered questions, haveincreased. This problem is not isolated to the mail-in census, however, andaffects most surveys—especially large-scale surveys—regardless of the format.
For example, statisticians and organizers of one largeannual survey on entrepreneurship in the United States1 encounteredtwo issues related to nonresponse that affected the quality of the research:
In addition, there were too few respondents in the 18 to 35 age groups when compared to the U.S. age distribution, thus requiring an oversampling of the 18 to 35 age groups.
Thankfully, there are te chniques to address missing values insurvey data. To deal with missing or underreported groups in the overall surveyresults, weights can beused to produce a representative sample for a given population. More compleximputation methods also have been developed to fill in missing data forspecific questions. These methods can prove to be more intricate. In addition,using such techniques comes with implications that can affect statisticalanalyses.
Types of missing values缺失值的类型
Missing values in a survey can be classified by the degreeof randomness of the missing information. The easiest—and strongestassumption—that can be made is that the data are "missing completely atrandom." This means noother information in the survey can help the researcher fill in the missingdata. Statistically, there is not enough information in the respondent’scompleted data to create a conditional probability to improve the missing data.
In this case, a random value from another respondent’sresults can be used to fill in the holes. This assumption is unlikely to becompletely satisfied, and a better imputation of the value can be obtained byusing some of the respondent’s data.
Another strong assumption is "data are missing atrandom." This assumptionrequires variables that can conditionally help fill in missing data and offer arange of values that provide a better model of the missing information.
For example, consider imputing missing ages of respondentsbased on the marked level of completed education: Ages 20-23 are equally likelyto be used for college graduates, while ages 17-20 are common for high schoolgraduates. Values in these ranges are chosen for the missing data based on thevariable of highest completed education.
"Not missing at random" is the most likely typeof data to be imputed. Knowing the other available data of the respondent, theresearcher can impute the missing value—such as imputing an area code based onthe respondent’s zip code—with a high degree of probability.
With any imputation procedure, bias in the analyses shouldbe minimized while maximizing the use of available information for theresearcher and giving reasonable estimates of variability and error.
Nonstatistical imputation非统计性推算
The following techniques use values from other respondersor intelligent guessing to fill in missing data:
Deletion of the respondentor pairwise deletion: These are the simplestways to deal with missing data. But they can eliminate usable data along withthe missing data and can produce biased results.
During analysis, there is the option of deleting thecomplete case, deleting the variable for all cases, or pairwise deletion, inwhich all available data are considered for estimation and contribute tostatistical summaries but may create different sample sizes between differentanalyses. Pairwise deletion, while not deleting entire respondents, may producebias if the respondents with partial data are markedly different from thosewith complete data.
Hot-deck procedures: This technique uses the actual responses provided by otherrespondents in a study as the basis for assigning answers for missinginformation from a particular respondent. The easiest way to implement thisoverall imputation is to take a random respondent and enter their value for themissing data. A better way is to use a hot-deck procedure withincharacteristics that are known for the respondent with missing data.
For example, if gender, ethnicity and years of schoolinghave been completed but age is missing, a random respondent with the samegender, ethnicity and years of schooling is chosen from the respondents whomatch, and that respondent’s age is entered for the missing data.
Variants of this include hierarchical procedures in whichmatching variables are ranked so gender and years of schooling are moreimportant than ethnicity when imputing age. The matches in which ethnicity isdifferent but the important variables match exactly can be used to fill in themissing data.
The U.S. Census Bureau has used this technique for imputingmissing values. In addition, John Stiller and Donald R. Dalzell have publisheda macro for implementing these techniques in SAS software.2
美国人口普查局已经应用此项技术来推断缺失值。此外,John Stiller 和Donald R. Dalzell公布了一个用于在SAS软件中执行此项技术的宏。
A related imputation technique, the cold-deck procedure, issimilar but uses statistical summaries. We’ll discuss this later in the column.
Interpolation andextrapolation: This technique estimates the missingdata through algebraic interpolation and, if data are assumed to take on acertain shape or distribution, using a function to impute the missing values.
Deductive imputation: This may be a qualitative or quantitative technique. Qualitatively,and useful for small surveys, the researcher may be able to read the results ofthe respondent and, with a high degree of confidence, impute the missing value.
For example, given a respondent’s address, the researchermay be able to impute ethnicity or home ownership basedon the researcher’s knowledge of the area. Time consuming and nonprobabilistic,this method cannot be justified statistically.
Statistical imputation统计计算
The following techniques are designed to minimize bias,variance or both:
Substituting the mean orcold-deck procedures: These are easy and justifiableimputation techniques. Simple mean substitution will fill in any missing datafor a particular variable with the mean of that variable from the entirepopulation. Complex mean substitutionwill fill in missing data for a respondent with the mean of the variableconditionally related to the missing data, similar to the hot-deck technique.
For missing age values, for example, the overall mean ageis imputed as a simple mean substitution. Using the mean age for all Asianfemale, high school graduate respondents for missing age data among thatdemographic is a more complex imputation procedure. In some cases, a level ofrandomness or stochasticity is achieved by adding a random value based on the agedistribution.
Problems with this technique arise in the calculation ofthe number of degrees of freedom or standard errors of any analyses as theimputed data are included as the respondent’s data. They are, in fact,statistical measures.
By increasing the number of degrees of freedom ordecreasing the standard error, the result of this technique is more likely toproduce statistically significant results. Many statistical software packagesallow easy mean substitution for missing data. Some allow for subgroup meansubstitution derived from important conditional variables.
Regression and stochasticregression techniques: These implement alinear (or, theoretically, a nonlinear) model to predict the missing data. Forthese methods, a model is fit to the data predicting the variable with themissing values using all nonmissing data, and the predicted missing values areimputed.
An appealing result of this technique is that regressionmethods will result in not only a predicted value, but also a confidence boundfor this value. The researcher can then substitute the mean and the extremesinto the missing data to examine their effects on other analyses.
It is also an easier method than identifying the importantvariables related to the variable with missing data and calculating the relatedmeans, which may come from an extremely small group. Similar to the meansubstitution method, however, this method will increase the degrees of freedomin analyses, and any resulting statistical tests will more likely bestatistically significant.
Decision trees: This method, used for missing value substitution3 forsupervised machine learning techniques in data-mining applications, is based onprobabilities calculated using categorical (or variables binned to becomecategorical) variables. They are statistical but rely on machine-learningalgorithms instead of researcher-created models.
While it may be a statistical technique, this method isdesigned for large data sets in which statistical testing is not appropriate.Clearly, if statistical methods were applied, it would suffer from the sameincreased likelihood of statistical significance as the methods mentionedearlier.
Table 1 illustrates all these techniques using the 2008U.S. Global Entrepreneurship Monitor (GEM) Survey.4 The actual ageof the respondent, 25, was removed to test the different analytical methodsagainst the known value. The overall imputed values for her age vary from 22 to48 years, with most methods within three years.
表格1 描述了在2008美国全球企业家观察的调查中运用的所有技术。受调查者的实际年龄25岁被隐藏,来测试不同分析方法的结果与真实值之间的差异。总的推算对她的年龄从22到48岁不等,大多数都在三年内。
Table 1
The results show:
• This is data not missing at random.
• Statistical and nonstatistical techniques can be equally accurate.
• 数据缺失不是随机的
• 统计的和非统计的技术可以同样准确
The actual mean age in the GEM Survey was 48 (range of 18to 99), the female mean age was 43 (range of 18 to 78), and the mean for afemale college graduate with two years of work experience was 24 (range of 19to 25
Full disclosure
These techniques provide simple to complex methods forimputing missing data. The techniques vary from completely researcher-intensivetechniques to totally machine-learning driven techniques.
All of these methods can be extended from one missing valueto a multiple imputation technique for multiple missing values per respondent.But, beware of using multiple imputation: The greater the percentage of imputeddata in the sample, the larger the error bars must be drawn around anyinferences made on analysis results.
Remember, the methods used and the percentage of missingdata imputed must be disclosed as part of the assumptions in any resultsreported. When used wisely, however, techniques correcting for missing data canbroaden analyses and strengthen results.
I. Elaine Allenis research director of the Arthur M. Blank Center for Entrepreneurship,director of the Babson Survey Research Group and professor of statistics andentrepreneurship at Babson College in Wellesley, MA. She earned a doctorate instatistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
I. Elaine Allen是Arthur M. Blank中心企业家调查项目的调研总监,Babson调研小组的主管以及位于Wellesley, MA的Babson学院里统计学和企业家方面的教授。她在位于Ithaca, 纽约的Cornell大学曾获得过博士学位。Allen是美国质量协会的成员之一。
Julia E. Seaman is a doctoral student in pharmacogenomics at the University ofCalifornia, San Francisco, and a statistical consultant for the Babson SurveyResearch Group at Babson College. She earned a bachelor’s degree in chemistryand mathematics from Pomona College in Claremont, CA.
Julia E. Seaman 是圣弗朗西斯科的加利福尼亚大学药理基因组学的在读博士生,Babson学院的调研小组的统计学顾问。她曾获得了位于Claremont, CA,Pomona 学院的化学和数学的双学位。
