您还没有绑定微信,更多功能请点击绑定

第二十九篇 Consider simple parametric tests to find an outlier’s significance

本帖最后由 小编D 于 2012-5-10 15:14 编辑 _

本篇文章由http://www.6sq.net/space-uid-429121.html翻译  liphking校稿



第二十九篇 Consider simple parametric tests to find an outlier’s significance 考虑简单参数检验来确定一个异常值的显著性

by Julia E. Seamanand I. Elaine Allen
朱莉亚 E.西蒙 I.伊莱恩·艾伦

Even in the most basic introductory statisticscourses, we teach students that outliers in a dataset can pose significant problems. We often teachthat visually examining the data can help identify outliers. Beyond detection,however, few statistics textbooks devote much time to the subject ofstatistically assessing outliers and their effect on final analyses.
即使是在最基础的统计学导言课程里,我们都会教导学生在一个数据集合中,异常值能引出很多重要的问题。我们常常教导学生对数据采用直观的审查方法有助于辨别异常值。然而,在审查数据后,很少统计学相关的教材会花大量精力去考虑异常值的统计评估及它们对最后分析结果的影响。

Students and researchers alike find outliersdifficult to recognize. When outliers are identified, there is no clear set ofstatistical tools or tests available to find an outlier’s significance. Thereare several outlier tests for parametric data, and we’ve applied them tochemical assay data to showcase the method.
学生和研究者均发现异常值难以辨别,而当异常值确认之后,又没有一套清晰的统计工具或检测方法能够确定一个异常值的显著性。有几种用于异常变量的测试方法,我们通过将它们运用于化学实验分析来展示。

Keep or remove?保留或删除?

Upon discovering a suspected outlier, theinitial temptation has always been to eliminate the points from the data—withappropriate rationale—and simplify the analyses to make the results easy toexplain. This method can be subjective, however, and may miss intricacies ofthe data. When there is more than one outlier or more than two variables in theanalysis, the problem becomes more complex. Removing an outlier also can havelarge effects on any analysis of the data.
当发现一个可疑的数值时,我们总是倾向一开始就以适当的理由将它从数据中剔除,以简化分析过程从而使用结果易于解释。然而,这种方法是如此之主观以至于使我们可能丢失数据中那些复杂信息。当数据分析中出现一个以上异常值或两个以上的变量时,问题将会变得更复杂。在任何一种数据分析中,随意的删除一个异常值都会对结果造成很大的影响。

A good example to illustrate how outliers canaffect an analysis, and even go undetected in an analysis, is Francis J.Ansombe’s regression models, which are almost identical for four data sets thatare markedly different.1 Table 1 shows all analyses of the model.
弗朗西斯 J 阿森比 的一组回归分析模型能很好的说明一个异常值如何影响到一次分析,甚至在分析中不被发现。这组模型几乎无差别的用于四组有显著差别的数据。表1列出了所有的分析模型。


Table 1表1


回归方程:Y1=3.00+0.500X1
预测值系数标准误差T检验P值
常数3.0001.1252.670.026
X10.50010.11794.240.002
Syx=1.23660校正系数=66.7%




Plotting theindividual data sets shows how different they are, even with the sameregression number. In particular, the plots show how much one outlier point caninfluence the analysis. From the plots of two of the data sets in Figures 1 and2, it is clear the same regression line should not fit the points equally well,as they do, and that an outlier is evident in each plot.
即使是同一回归数,通过对单组数据进行作图也能显示出它们的差别之巨。通过作图,尤其能展示出一个异常值对分析结果的影响。在图形1和2 中,两组数据的图形非常清楚的表明相同的回归线并不能对这些点拟合的同样好,并且,一个异常的数据在图中明显的显示出来。


Figure 1

Figure 2




It should be notedthat some statistical software programs (for example, Minitab) report outliersin linear regression through the identification of highly standardizedresidual values as a default in their standard output forregression. Some software programs also have an option to provide plots ofresiduals versus the dependent values and probability plots of standardizedresiduals. This can help further identify outliers, but it may not be enough tostatistically justify removing the data and may still miss some outliers.
注意,一些统计软件(如Minitab)在线性回归中报告出的异常值,是通过识别高度标准化的残差值作为回归的默认标准输出的。有些软件有这样的可选项,能提供残差和相关值的对比图和标准残差的概率图,这能更好的帮助我们识别异常值,但仍然不足以确定统计数据中的异常值是否可剔除,同时这种方法也可能漏掉异常值。

Simple outlier tests简单的异常值检测方法

The majority of parametric outlier tests look atsome measure of the relative distance of a particular data point to the mean ofall the data points and assesses what the probability is that a particularpiece of data occurred by chance. Most tests are designed to look at individualor specific points, but several can be generalized to examine multiple datapoints, usually pairs. In addition, pairs of points or _n_-tuples of points may represent combinations ofvariables and may be difficult to identify with a simple test.2 大多数的参数异常值检测通过衡量特定数据点和所有数据点平均值的相对距离来评估这个数据有多大概率是随机出现的。绝大部分的测试方法是用于对单独的或特定的数据作分析,但其中有几种方法能被广泛地用于分析多个数据点,通常是成对数据点。另外,具有成对数据或n-数组可能代表了变量的组合,可能会难以用简单的测试来识别。

Most parametric tests are generalizations orextensions of the original work by F.E. Grubbs,3 who derived several simpleparametric tests that are used most frequently when testing for outliers.Grubbs tests can be given as follows (in which _xi_denotes an individual data point, _s_ is the samplestandard deviation and _n_ isthe sample size):
绝大多数的数据检测方法是在F.E. Grubbs,3的基础上的通用化和扩展,他推导出了几种常用于异常值测试的简单而有效的参数测试方法。格拉布斯测试法可以用以下公式表示(Xi表示一个单个的数值,S为样本的标准偏差,n为样本容量):


, looks for outliers in single points of data,
此公式用于在单点数据中查找异常值,

, finds outliers at the minimum and maximum of a distribution, and
在一个分布的最大值和最小值上找异常值,

, finds pairs of outliers at either extreme.
此公式用于在极限值中查找成对的异常值。

Dixon’s Q test is similar to _G2 _fora small number of observations (between 3 and 25), and Rosner’s test is ageneralization of Grubbs test to detect up to _k_ outliers whenthe sample size is 25 or more.4
狄克逊的Q检验方法与G2在运用于小容量样本中(3到25)异常值的检验时是类似的,而罗斯纳检验方法是格拉布斯检验方法用于样本容量在25个以上的k个异常值的检验的通用化的形式。

Grubbs test (_G1_)example格拉布斯检验示例

Weused the simplest form of a Grubbs test to remove outliers in infrared (IR)spectroscopy research data. IR spectroscopy was taken from mixtures of threeorganic compounds in solution, and the outliers needed to be removed beforeusing the results in later chemometric analysis.
我们使用格拉布斯检验中最简单的形式来剔除红外光谱分析中的异常数据。红外光谱的数据是从三种有机物的混合溶液中分析得来的,在进行进一步的化学统计学分析之前必须将其中的异常数据剔除。

Thepurpose was to create a statistical model based on the spectra that can be usedto determine unknown concentrations of compounds from an IR spectroscopy. Bysystematically removing the outliers, we start with cleaner data that will giveus a better model and, ultimately, better results.
这样做的目的是要基于光谱来建立统计模型,从而能够从红外光谱中确定未知的混合物浓度。通过有系统的剔除异常数据,使我们能运用这些更少干扰的数据来得到一个更好的模型,最终,也能得到一个好的结果。

Themixtures were run in triplicate and included 1,501 data points of the IRspectrum. The samples were scanned in 2 cm-1 increments from 450cm-1 to 4,400 cm-1. The analyzed spectral region for all samples was 600 cm-1to 3,500 cm-1. Prior to making a chemometric model to predict unknownconcentration values, the spectra sets were validated and examined foroutliers. All the spectra in the updated data set were mean-centered before analysis.
这些混合物三等分进行试验,红外光谱分析产生了1501个数据。对这些样品进行从450cm-1到4400cm-1的2cm-1增量扫描。所有样品分析出的光谱区间从600cm-1到3500cm-1。在建立化学度量模型来预测未知浓度值之前,需对光谱进行验证并测试以剔除其中的异常数据。数据组经过处理后,所有的光谱在分析之前都是向均值集中的。

Outlierswere identified using the Grubbs test. As shown in _G1_ above,this was done by finding the standard deviation for each data point between thetriplicate spectroscopy values and then calculating the overall averagestandard deviation and the overall standard deviation of the data points’ standarddeviations for the triplicate. For each group of triplicates, these overallstandard deviations were used in the Grubbs test.
通过使用格拉布斯检验来识别出异常数据。如上面的G1所示,通过找到三组光谱数据每个数据点的标准差,然后计算总体的平均标准差和三组数据标准差的总体标准差对于每一个相似的数组,这种总体计算的标准偏差在格拉布斯检验中运用。

Whenan overall triplicate standard deviation was rejected, the three runs withinthe triplicate were analyzed using a jackknife technique. A single run wasremoved from the triplicate of the outlier group if it significantly loweredthe overall standard deviation of the group. The Grubbs test was repeated asneeded. All statistical tests were done at the 95% confidence level. 当一个三等分试验的标准差被拒收时,其中的三次试验要使用折叠技术分别进行分析。如果一个单独计算的数组的标准差能显著的降低总体的标准偏差,那么就可从相似的异常数组中将其剔除。根据需要,可循环使用格拉布斯检验法。所有的统计测试方法都在95%置信区间中完成。

Inour IR data, the overall average for one triplicate group was 2.653, with astandard deviation of 2.888, and we calculated a Grubbs test statistic, _G1_,of 5.22. With a _Gcrit_ from a Grubbs table of 1.91, _G1_ isgreater than _Gcrit_, the null hypothesis is rejected, and the sampleis declared an outlier.
在我们的红外光谱数据中,其中一组相似数组的总体平均值是2.653,标准偏差为2.888,那我们通过G1的格拉布斯统计检验结果为5.22。查阅格拉布斯表中的Gcrit,其值为1.91,G1比Gcrit大,空假设被拒绝,那么这个样本为异常数组。

Werecalculated the overall standard deviation with one spectrum removed to findwhich remaining two reduced it the most. After finding and removing the mostdifferent triplicate, the overall standard deviation of the sample dropped to0.04, confirming the outlier behavior of the eliminated spectrum.
在剔除一组光谱数据后我们重新计算总体的标准差,来找出剩下的两组中哪一组影响最大。在找出并剔除最大不同的相似数组后,样本的总体标准偏差降到了0.04,由此也确定了光谱数组中被剔除的是异常数组。

Options foranalyses分析方法的选择

Examinationand detection of outliers is a key part of any data analysis. Analyses thatinclude data that are unusually large or small compared to the rest of the dataset run the risk of estimating models that are not representative or thatintroduce variability. Analyses that exclude these values without testing theirsignificance as outliers may seriously bias a model.
在任何数据分析中,测试并查找异常数据是数据处理中关键的一步。分析时如果包含了那些相对非常大或者非常小的数据,就有可能估计出一个不具有代表性的模型或者引入波动。数据分析如果忽略了对异常数据进行测试的价值,那么异常数据可能会导致一个模型的严重背离。

Parametrictests should be used when there are sufficient data available, sufficientprecision in the data and no genuinely long tails on the distribution thatwould identify successive outliers when a Grubbs test is applied. A Grubbs testis easy to use and apply and, along with the graphical display of the data, canidentify whether extreme data should be examined separately.
当有足够多的数据时,应该进行参数的测试;当运用格拉布斯测试时,足够精确的数据和在数据的分布中中没有真正的长尾时,就能够成功的识别出异常数据。格拉布斯测试是一种易用和好用的方法,能很好的和图形一起使用,能识别出是否极限数据应该被隔离分析。




Julia E. Seaman is a researcher at Genentech in South San Francisco, CA. She earned a bachelor’s degree in chemistry and mathematics from Pomona College in Claremont, CA.
I. Elaine Allen is director of the Babson Survey Research Group and professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
g2.jpg g3.jpg Figure 2.jpg Figure 1.jpg g1.jpg table1.jpg
对“好”的回答一定要点个"赞",回答者需要你的鼓励!
已邀请:

liphking (威望:2) (上海 青浦区) 电信通讯 工程师

赞同来自:

:D

1 个回复,游客无法查看回复,更多功能请登录注册

发起人

扫一扫微信订阅<6SQ每周精选>