
第二十九篇 Consider simple parametric tests to find an outlier’s significance

本帖最后由 小编D 于 2012-5-10 15:14 编辑 _

本篇文章由http://www.6sq.net/space-uid-429121.html翻译  liphking校稿

第二十九篇 Consider simple parametric tests to find an outlier’s significance 考虑简单参数检验来确定一个异常值的显著性

by Julia E. Seamanand I. Elaine Allen
朱莉亚 E.西蒙 I.伊莱恩·艾伦

Even in the most basic introductory statisticscourses, we teach students that outliers in a dataset can pose significant problems. We often teachthat visually examining the data can help identify outliers. Beyond detection,however, few statistics textbooks devote much time to the subject ofstatistically assessing outliers and their effect on final analyses.

Students and researchers alike find outliersdifficult to recognize. When outliers are identified, there is no clear set ofstatistical tools or tests available to find an outlier’s significance. Thereare several outlier tests for parametric data, and we’ve applied them tochemical assay data to showcase the method.

Keep or remove?保留或删除?

Upon discovering a suspected outlier, theinitial temptation has always been to eliminate the points from the data—withappropriate rationale—and simplify the analyses to make the results easy toexplain. This method can be subjective, however, and may miss intricacies ofthe data. When there is more than one outlier or more than two variables in theanalysis, the problem becomes more complex. Removing an outlier also can havelarge effects on any analysis of the data.

A good example to illustrate how outliers canaffect an analysis, and even go undetected in an analysis, is Francis J.Ansombe’s regression models, which are almost identical for four data sets thatare markedly different.1 Table 1 shows all analyses of the model.
弗朗西斯 J 阿森比 的一组回归分析模型能很好的说明一个异常值如何影响到一次分析,甚至在分析中不被发现。这组模型几乎无差别的用于四组有显著差别的数据。表1列出了所有的分析模型。

Table 1表1


Plotting theindividual data sets shows how different they are, even with the sameregression number. In particular, the plots show how much one outlier point caninfluence the analysis. From the plots of two of the data sets in Figures 1 and2, it is clear the same regression line should not fit the points equally well,as they do, and that an outlier is evident in each plot.
即使是同一回归数,通过对单组数据进行作图也能显示出它们的差别之巨。通过作图,尤其能展示出一个异常值对分析结果的影响。在图形1和2 中,两组数据的图形非常清楚的表明相同的回归线并不能对这些点拟合的同样好,并且,一个异常的数据在图中明显的显示出来。

Figure 1

Figure 2

It should be notedthat some statistical software programs (for example, Minitab) report outliersin linear regression through the identification of highly standardizedresidual values as a default in their standard output forregression. Some software programs also have an option to provide plots ofresiduals versus the dependent values and probability plots of standardizedresiduals. This can help further identify outliers, but it may not be enough tostatistically justify removing the data and may still miss some outliers.

Simple outlier tests简单的异常值检测方法

The majority of parametric outlier tests look atsome measure of the relative distance of a particular data point to the mean ofall the data points and assesses what the probability is that a particularpiece of data occurred by chance. Most tests are designed to look at individualor specific points, but several can be generalized to examine multiple datapoints, usually pairs. In addition, pairs of points or _n_-tuples of points may represent combinations ofvariables and may be difficult to identify with a simple test.2 大多数的参数异常值检测通过衡量特定数据点和所有数据点平均值的相对距离来评估这个数据有多大概率是随机出现的。绝大部分的测试方法是用于对单独的或特定的数据作分析,但其中有几种方法能被广泛地用于分析多个数据点,通常是成对数据点。另外,具有成对数据或n-数组可能代表了变量的组合,可能会难以用简单的测试来识别。

Most parametric tests are generalizations orextensions of the original work by F.E. Grubbs,3 who derived several simpleparametric tests that are used most frequently when testing for outliers.Grubbs tests can be given as follows (in which _xi_denotes an individual data point, _s_ is the samplestandard deviation and _n_ isthe sample size):
绝大多数的数据检测方法是在F.E. Grubbs,3的基础上的通用化和扩展,他推导出了几种常用于异常值测试的简单而有效的参数测试方法。格拉布斯测试法可以用以下公式表示(Xi表示一个单个的数值,S为样本的标准偏差,n为样本容量):

, looks for outliers in single points of data,

, finds outliers at the minimum and maximum of a distribution, and

, finds pairs of outliers at either extreme.

Dixon’s Q test is similar to _G2 _fora small number of observations (between 3 and 25), and Rosner’s test is ageneralization of Grubbs test to detect up to _k_ outliers whenthe sample size is 25 or more.4

Grubbs test (_G1_)example格拉布斯检验示例

Weused the simplest form of a Grubbs test to remove outliers in infrared (IR)spectroscopy research data. IR spectroscopy was taken from mixtures of threeorganic compounds in solution, and the outliers needed to be removed beforeusing the results in later chemometric analysis.

Thepurpose was to create a statistical model based on the spectra that can be usedto determine unknown concentrations of compounds from an IR spectroscopy. Bysystematically removing the outliers, we start with cleaner data that will giveus a better model and, ultimately, better results.

Themixtures were run in triplicate and included 1,501 data points of the IRspectrum. The samples were scanned in 2 cm-1 increments from 450cm-1 to 4,400 cm-1. The analyzed spectral region for all samples was 600 cm-1to 3,500 cm-1. Prior to making a chemometric model to predict unknownconcentration values, the spectra sets were validated and examined foroutliers. All the spectra in the updated data set were mean-centered before analysis.

Outlierswere identified using the Grubbs test. As shown in _G1_ above,this was done by finding the standard deviation for each data point between thetriplicate spectroscopy values and then calculating the overall averagestandard deviation and the overall standard deviation of the data points’ standarddeviations for the triplicate. For each group of triplicates, these overallstandard deviations were used in the Grubbs test.

Whenan overall triplicate standard deviation was rejected, the three runs withinthe triplicate were analyzed using a jackknife technique. A single run wasremoved from the triplicate of the outlier group if it significantly loweredthe overall standard deviation of the group. The Grubbs test was repeated asneeded. All statistical tests were done at the 95% confidence level. 当一个三等分试验的标准差被拒收时,其中的三次试验要使用折叠技术分别进行分析。如果一个单独计算的数组的标准差能显著的降低总体的标准偏差,那么就可从相似的异常数组中将其剔除。根据需要,可循环使用格拉布斯检验法。所有的统计测试方法都在95%置信区间中完成。

Inour IR data, the overall average for one triplicate group was 2.653, with astandard deviation of 2.888, and we calculated a Grubbs test statistic, _G1_,of 5.22. With a _Gcrit_ from a Grubbs table of 1.91, _G1_ isgreater than _Gcrit_, the null hypothesis is rejected, and the sampleis declared an outlier.

Werecalculated the overall standard deviation with one spectrum removed to findwhich remaining two reduced it the most. After finding and removing the mostdifferent triplicate, the overall standard deviation of the sample dropped to0.04, confirming the outlier behavior of the eliminated spectrum.

Options foranalyses分析方法的选择

Examinationand detection of outliers is a key part of any data analysis. Analyses thatinclude data that are unusually large or small compared to the rest of the dataset run the risk of estimating models that are not representative or thatintroduce variability. Analyses that exclude these values without testing theirsignificance as outliers may seriously bias a model.

Parametrictests should be used when there are sufficient data available, sufficientprecision in the data and no genuinely long tails on the distribution thatwould identify successive outliers when a Grubbs test is applied. A Grubbs testis easy to use and apply and, along with the graphical display of the data, canidentify whether extreme data should be examined separately.

Julia E. Seaman is a researcher at Genentech in South San Francisco, CA. She earned a bachelor’s degree in chemistry and mathematics from Pomona College in Claremont, CA.
I. Elaine Allen is director of the Babson Survey Research Group and professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
g2.jpg g3.jpg Figure 2.jpg Figure 1.jpg g1.jpg table1.jpg

liphking (威望:2) (上海 青浦区) 电信通讯 工程师



1 个回复,游客无法查看回复,更多功能请登录注册

