您还没有绑定微信,更多功能请点击绑定

第三十三篇 Find the best way to handle missing data in surveys

本帖最后由 小编D 于 2012-1-31 13:34 编辑

本篇文章由haochenxuehappy翻译 zzelva校稿


Find the best way to handle missing data in surveys
调查中处理缺失值的最佳方法

ImputationExplanation
关于推算的说明

by I. Elaine Allenand Julia E. Seaman
The United States recently completed its 23rd federalpopulation census. The first census was mandated by the U.S. Constitution andcarried out under Thomas Jefferson in 1790. Until 1950, the census wasconducted in person or by telephone, so the risk of missing data was minimal.
美国最近完成了其第23届联邦人口普查。第一次人口普查的实施是在1790年依据宪法授权在Thomas Jefferson执政时期进行的。直到1950年,人口普查采用亲自登记或打电话的方式,所以缺失值的风险降被降低到最小。

Since the census changed to a mail-response format,nonresponse rates for overall response, as well as unanswered questions, haveincreased. This problem is not isolated to the mail-in census, however, andaffects most surveys—especially large-scale surveys—regardless of the format.
自从人口普查采用邮件的形式,无回复的比率以及无回答问题的数量开始增加。这个问题不仅出现在邮件回复人口普查上,并且影响到大多数的调查——尤其是大规模、大范围的调查的形式——无论什么形式的调查。

For example, statisticians and organizers of one largeannual survey on entrepreneurship in the United States1 encounteredtwo issues related to nonresponse that affected the quality of the research:
比如,在美国,一个企业家的大规模年度例行统计,其中的专家和调查组织者,就遇到了两个影响调查质量的无回答问题:

  1. The overall refusal rate and the refusal rate for specific questions in random digit dialing (RDD) surveys continue to increase. In the 2008 survey, more than 25,000 calls were made to get 4,000 responses.
  2. 总体拒访率和对特定问题的无回答率在随机电话调查中持续上升。在2008年的调查中,为了得到4000份回答,需要多进行25,000多个的访问。

  1. The inability to reach cell phone-only users in the United States—in which RDD-sampling of cell phones is not permitted by law—produces a biased demographic.

In addition, there were too few respondents in the 18 to 35 age groups when compared to the U.S. age distribution, thus requiring an oversampling of the 18 to 35 age groups.
  1. 电话随机抽样调查,针对仅使用手机的用户在美国是不被法律允许的。因此联系到个人电话的使用者产生了人口统计上的偏差。同时,向对于美国的年龄分布,18-35岁年龄组的回答者不足,导致我们需要对该年龄组进行超密度采样。

Thankfully, there are te chniques to address missing values insurvey data. To deal with missing or underreported groups in the overall surveyresults, weights can beused to produce a representative sample for a given population. More compleximputation methods also have been developed to fill in missing data forspecific questions. These methods can prove to be more intricate. In addition,using such techniques comes with implications that can affect statisticalanalyses.

令人欣慰的是,技术可以填补调查中缺失值所导致的偏差。在处理缺失值或总体调查的无回答时,我们可以使用权重来为特定的总体产生代表性的样本。为了填补某些特定的缺失值问题,还有一些更加复杂的推算方法——这些方法更加错综复杂,而且这类技术也可能会暗含影响统计分析的信息。
Types of missing values缺失值的类型
Missing values in a survey can be classified by the degreeof randomness of the missing information. The easiest—and strongestassumption—that can be made is that the data are "missing completely atrandom." This means noother information in the survey can help the researcher fill in the missingdata. Statistically, there is not enough information in the respondent’scompleted data to create a conditional probability to improve the missing data.
在调查中空缺值可以按照空缺信息的随机程度来分类。最简单并且是最强有力的假设就是数据资料的缺失是“完全随机的”。这意味着调查者没有其他的信息可以辅助填补缺失值。从统计上来讲,在被统计者完成的数据当中没有足够的信息去用条件概率来改善该缺失值。

In this case, a random value from another respondent’sresults can be used to fill in the holes. This assumption is unlikely to becompletely satisfied, and a better imputation of the value can be obtained byusing some of the respondent’s data.
在这种情况下,可以使用其他被调查者的回答的随机值来填补空缺。这种假设有些差强人意,还有一个更好的方法,利用其他的受调查者的回答来推算。

Another strong assumption is "data are missing atrandom." This assumptionrequires variables that can conditionally help fill in missing data and offer arange of values that provide a better model of the missing information.
另一个强有力的假设是“资料随机空缺”。这个假设要求变量可以在一定条件下帮助填补缺失值并且提供一个值域来建立一个关于缺失值的更优的模型。

For example, consider imputing missing ages of respondentsbased on the marked level of completed education: Ages 20-23 are equally likelyto be used for college graduates, while ages 17-20 are common for high schoolgraduates. Values in these ranges are chosen for the missing data based on thevariable of highest completed education.
例如,考虑以受调查者的教育水平来推算年龄缺失值:20-23岁常等同于大学毕业生,17-20岁通常是高中毕业生。基于最高学力,使用上述数值范围来填补年龄的缺失值。

"Not missing at random" is the most likely typeof data to be imputed. Knowing the other available data of the respondent, theresearcher can impute the missing value—such as imputing an area code based onthe respondent’s zip code—with a high degree of probability.
一种非常可能的情况是“非随机空缺”的推算。在知道其他受调查者的回答时,调查者可以以较高的正确概率来推算缺失值,比如基于受调查者的邮政代码推算其区域代码。

With any imputation procedure, bias in the analyses shouldbe minimized while maximizing the use of available information for theresearcher and giving reasonable estimates of variability and error.
任何推算程序,分析偏差都应该最小化,同时最大化对可用信息的使用,并且对变异性和误差做合理的估计。

Nonstatistical imputation非统计性推算
The following techniques use values from other respondersor intelligent guessing to fill in missing data:
以下的技术用其他被调查者的数据或者合理的猜测来填补空缺资料:

Deletion of the respondentor pairwise deletion: These are the simplestways to deal with missing data. But they can eliminate usable data along withthe missing data and can produce biased results.
删除受调查者或成对删除:这些是处理缺失值最简单的方式。但是它们会同时删除有用的数据,导致结果的偏差。

During analysis, there is the option of deleting thecomplete case, deleting the variable for all cases, or pairwise deletion, inwhich all available data are considered for estimation and contribute tostatistical summaries but may create different sample sizes between differentanalyses. Pairwise deletion, while not deleting entire respondents, may producebias if the respondents with partial data are markedly different from thosewith complete data.
在分析过程中,可以选择删除该case,删除该变量,或者成对删除——所有的变量数据被纳入统计推断,并纳入统计概要,但是这可能造成不同分析之间的样本大小有差异。对于成对删除(不删除全部回答),如果部分回答的受调查者与那些完全回答的受调查者有显著的不同,可能会导致偏差。

Hot-deck procedures: This technique uses the actual responses provided by otherrespondents in a study as the basis for assigning answers for missinginformation from a particular respondent. The easiest way to implement thisoverall imputation is to take a random respondent and enter their value for themissing data. A better way is to use a hot-deck procedure withincharacteristics that are known for the respondent with missing data.
Hot-deck程序:这项技术应用其他受访者的真实回答作为处理某一特定受访者缺失值的基础。最简单的方法是采用一个随机答案作为缺失值的替代。更好的办法是使用一个hot-deck程序——在具备该受访者的特征的回答群体中寻找缺失值的替代。

For example, if gender, ethnicity and years of schoolinghave been completed but age is missing, a random respondent with the samegender, ethnicity and years of schooling is chosen from the respondents whomatch, and that respondent’s age is entered for the missing data.
比如,如果性别,种族和教育年限的信息都具备却惟独少了年龄,那么一个与之具有同样性别,种族和上学时间的随机受调查者就会从其他的受调查者中被选择出来,该受调查者的年纪就会被填补入空缺的资料数据中。

Variants of this include hierarchical procedures in whichmatching variables are ranked so gender and years of schooling are moreimportant than ethnicity when imputing age. The matches in which ethnicity isdifferent but the important variables match exactly can be used to fill in themissing data.
分层的程序使得匹配的变量可以排序,在推算年龄的过程中,性别和教育年限在计算年纪时比种族要更加重要。即便种族不同,但是其他重要的变量匹配精准,就可以用来的填补缺失值。

The U.S. Census Bureau has used this technique for imputingmissing values. In addition, John Stiller and Donald R. Dalzell have publisheda macro for implementing these techniques in SAS software.2
美国人口普查局已经应用此项技术来推断缺失值。此外,John Stiller 和Donald R. Dalzell公布了一个用于在SAS软件中执行此项技术的宏。

A related imputation technique, the cold-deck procedure, issimilar but uses statistical summaries. We’ll discuss this later in the column.
另一个相关的推算技术cold-deck程序,与之相似,但运用的是统计概要。我们将在本专栏稍后讨论。

Interpolation andextrapolation: This technique estimates the missingdata through algebraic interpolation and, if data are assumed to take on acertain shape or distribution, using a function to impute the missing values.
内插法和外插法:这项技术通过代数内插,或者当假设数据在某一固定的形状或分布的函数,用公式来推算缺失值。

Deductive imputation: This may be a qualitative or quantitative technique. Qualitatively,and useful for small surveys, the researcher may be able to read the results ofthe respondent and, with a high degree of confidence, impute the missing value.
推演计算:这可以是一个定性的或定量的技术。定性地,用于小调查,研究者可以去读出受调查者的结果,且有较高的置信度来推算缺失值。

For example, given a respondent’s address, the researchermay be able to impute ethnicity or home ownership basedon the researcher’s knowledge of the area. Time consuming and nonprobabilistic,this method cannot be justified statistically.
例如,给出一个调查者地址,研究者可能根据调查者对某一地区的认知推算其种族或住宅所有权。这种方法消耗时间,且非概率性的,因此不能从统计上证明是合理的。

Statistical imputation统计计算

The following techniques are designed to minimize bias,variance or both:
下面的技术用于减小偏差、变异或都两者皆有:

Substituting the mean orcold-deck procedures: These are easy and justifiableimputation techniques. Simple mean substitution will fill in any missing datafor a particular variable with the mean of that variable from the entirepopulation. Complex mean substitutionwill fill in missing data for a respondent with the mean of the variableconditionally related to the missing data, similar to the hot-deck technique.
平均值替代或冷-甲板程序:非常简单合理的推算方法。简单的平均值替代法,可以通过改变量的总体平均值,为任意变量填补任何缺失值。复杂平均值替代法,通过与该缺失值相关的条件变量的平均值,来填补缺失值。与hot-deck技术相似。

For missing age values, for example, the overall mean ageis imputed as a simple mean substitution. Using the mean age for all Asianfemale, high school graduate respondents for missing age data among thatdemographic is a more complex imputation procedure. In some cases, a level ofrandomness or stochasticity is achieved by adding a random value based on the agedistribution.
例如对于缺失年龄值,总体平均年龄可以作为简单的平均值来替换。复杂替换则使用所有高中学历亚洲女性的平均年龄,替代该人口学分组的缺失值。在很多情况下,随机性的水平是通过根据年龄分布添加的随机数值完成的。

Problems with this technique arise in the calculation ofthe number of degrees of freedom or standard errors of any analyses as theimputed data are included as the respondent’s data. They are, in fact,statistical measures.
此项技术的问题是,在计算自由度或者标准误差时,推算数据会被包括在有效回答中,但事实上,它们是统计估计。

By increasing the number of degrees of freedom ordecreasing the standard error, the result of this technique is more likely toproduce statistically significant results. Many statistical software packagesallow easy mean substitution for missing data. Some allow for subgroup meansubstitution derived from important conditional variables.
通过增加自由度或减少标准误差,该技术的结果就更可能导致统计学显著。许多统计软件允许缺失值的简单的平均值替换。某些允许从重要的条件变量中得来的子群平均值替换。

Regression and stochasticregression techniques: These implement alinear (or, theoretically, a nonlinear) model to predict the missing data. Forthese methods, a model is fit to the data predicting the variable with themissing values using all nonmissing data, and the predicted missing values areimputed.
回归和随机回归技术:通过实施线性(或理论上地,非线性)模型来预测缺失值。这些方法,是以所有非缺失数据建立一个模型,来预测缺失值。

An appealing result of this technique is that regressionmethods will result in not only a predicted value, but also a confidence boundfor this value. The researcher can then substitute the mean and the extremesinto the missing data to examine their effects on other analyses.
这项技术中一个非常吸引人的成果是回归的方法不仅会得出预期值,还会有这个值的置信区间。调查者便可以用平均值和极值带入缺失值来检查对分析的影响。

It is also an easier method than identifying the importantvariables related to the variable with missing data and calculating the relatedmeans, which may come from an extremely small group. Similar to the meansubstitution method, however, this method will increase the degrees of freedomin analyses, and any resulting statistical tests will more likely bestatistically significant.
这也是一个向对于确定与缺失值变量相关的重要变量,以及计算相关平均值的方法来说更加简单的办法。前者的信息可能来自于一个极端的小群组。与平均值替代法相似,但这个方法会增加分析的自由度,以及任何由此产生的统计性测试会更加显著。

Decision trees: This method, used for missing value substitution3 forsupervised machine learning techniques in data-mining applications, is based onprobabilities calculated using categorical (or variables binned to becomecategorical) variables. They are statistical but rely on machine-learningalgorithms instead of researcher-created models.
决策树:这种方式,是数据挖掘方面的有监督的机器学习技术,基于类别数据(或者可以归为类别数据)的概率计算。他们是统计的但依靠机器学习算法来代替研究者创造的模型。

While it may be a statistical technique, this method isdesigned for large data sets in which statistical testing is not appropriate.Clearly, if statistical methods were applied, it would suffer from the sameincreased likelihood of statistical significance as the methods mentionedearlier.
然而那可能是一个统计学的技术,这个方法设计的是为了适用于那些统计测试不太合适的大数据集合。显然,如果应用统计方法,它将会像前面提到的一样,增加统计学的显著性。

Table 1 illustrates all these techniques using the 2008U.S. Global Entrepreneurship Monitor (GEM) Survey.4 The actual ageof the respondent, 25, was removed to test the different analytical methodsagainst the known value. The overall imputed values for her age vary from 22 to48 years, with most methods within three years.
表格1 描述了在2008美国全球企业家观察的调查中运用的所有技术。受调查者的实际年龄25岁被隐藏,来测试不同分析方法的结果与真实值之间的差异。总的推算对她的年龄从22到48岁不等,大多数都在三年内。

Table 1





The results show:
• This is data not missing at random.
• Statistical and nonstatistical techniques can be equally accurate.
结果显示:
• 数据缺失不是随机的
• 统计的和非统计的技术可以同样准确

The actual mean age in the GEM Survey was 48 (range of 18to 99), the female mean age was 43 (range of 18 to 78), and the mean for afemale college graduate with two years of work experience was 24 (range of 19to 25
在全球企业家调查中实际的平均年龄为48(范围从18到99),女性的平均年龄是43(范围从18到78),一位大学学历、两年工作经验的女性企业家平均值年龄24(范围从19到25)。

Full disclosure
全面公开

These techniques provide simple to complex methods forimputing missing data. The techniques vary from completely researcher-intensivetechniques to totally machine-learning driven techniques.
这些所提供的技术,从简单到复杂都是为了计算出缺失值。技术从完全的调查研究者的技术到完全的机器驱动技术区分开来。

All of these methods can be extended from one missing valueto a multiple imputation technique for multiple missing values per respondent.But, beware of using multiple imputation: The greater the percentage of imputeddata in the sample, the larger the error bars must be drawn around anyinferences made on analysis results.
所有方法都可以从单一缺失值扩展到复合缺失值推算。但是,小心使用复合算法:样本中的推算值所占比率越高,通过分析得出的推论的错误越多。

Remember, the methods used and the percentage of missingdata imputed must be disclosed as part of the assumptions in any resultsreported. When used wisely, however, techniques correcting for missing data canbroaden analyses and strengthen results.
要记住的是,方法的使用和缺失值计算的百分比必须被披露在报告的假设中。明智地运用,缺失值修正技术可以扩大分析和增强结论。

I. Elaine Allenis research director of the Arthur M. Blank Center for Entrepreneurship,director of the Babson Survey Research Group and professor of statistics andentrepreneurship at Babson College in Wellesley, MA. She earned a doctorate instatistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
I. Elaine Allen是Arthur M. Blank中心企业家调查项目的调研总监,Babson调研小组的主管以及位于Wellesley, MA的Babson学院里统计学和企业家方面的教授。她在位于Ithaca, 纽约的Cornell大学曾获得过博士学位。Allen是美国质量协会的成员之一。

Julia E. Seaman is a doctoral student in pharmacogenomics at the University ofCalifornia, San Francisco, and a statistical consultant for the Babson SurveyResearch Group at Babson College. She earned a bachelor’s degree in chemistryand mathematics from Pomona College in Claremont, CA.
Julia E. Seaman 是圣弗朗西斯科的加利福尼亚大学药理基因组学的在读博士生,Babson学院的调研小组的统计学顾问。她曾获得了位于Claremont, CA,Pomona 学院的化学和数学的双学位。
对“好”的回答一定要点个"赞",回答者需要你的鼓励!
已邀请:

小编D (威望:9) (广东 广州) 互联网 员工 - 记住该记住的,忘记改忘记的。改变能改变的,接受不...

赞同来自:

您好!
三十三这篇翻译怎么样了,校稿好了吗?如完成后,直接发我邮箱吧1005@6sq.net

5 个回复,游客无法查看回复,更多功能请登录注册

发起人

扫一扫微信订阅<6SQ每周精选>