您还没有绑定微信,更多功能请点击绑定

第二十四篇 Influence and Effect

本帖最后由 小编D 于 2012-1-31 13:32 编辑

你好,我是小编H。请对以下文章有校稿兴趣的组员留下你的预计完成时间,并发短信息联系小编H,以便小编登记翻译者信息以及文章最终完成时的奖惩工作。
本文翻译:chengguo0740,xy_persist

Know how to measure the effect a data point has on a statistic知道如何估计一个数据点在统计量上的影响
by Robert L. Mason and John C. Young
罗伯特L.梅森和约翰C.杨编写

Being able to determine the effect a data point has on summary statistics provides useful insight into the construction of better parameter estimators.
能确定一个数据点在概括统计的影响为更好的参数估计的构建提供了有用的见解

Consider, for example, a person with an annual income of more than $1 million in a room with nine others with annual incomes in the $25,000 to $50,000 range. The average income for this group of 10 is greater than $100,000, but this is not a useful summary statistic of the typical income of most members of the group.
比如,考虑一个年收入超过1百万美元的人和9个年收入在25,000和50,000美元之间的人在同一间屋子里,这个10人组的平均收入比100,000美元多很多,但这不是一个该组大多数成员代表性收入的有用的概括统计量。


From this example, you observe that the inclusion of an observation far removed from the bulk of the observations in a sample can have a great effect on the estimated overall mean. If included, the outlying observation actually pulls the average value of the group toward it.
从这个例子,你注意到在一个样本中那个远离绝大部分、包含了特殊数据的观察值会对总体的平均值有很大影响。如果被包括在内,这个偏离的观察值确实会拉动组的平均值靠近该值。


In another example, consider the group of bivariate observations contained in the circle in Figure 1. Such a circular region of data points indicates that the correlation between the two variables, x1 and x2, is close to zero; that is, no linear relationship exists between the two variables.
再举另外一个例子,考虑两变量的组被包含在图1的圆圈中。数据点的圆形区域指出两组变量的相互关系。X1和X2接近零点,也就是说,两个变量之间没有线性关系。


Figure 1 图表1 用一个单个的观察值来增加相互的关系


Observe the labeled point in the upper right-hand corner of the plot but outside of the circle. As the distance between this point and the mean of the group of points in the circle increases along the drawn 45° line, the correlation between the two variables will increase and approach its maximum value of one. Thus, this single outlying observation can distort the estimated value of the true correlation.
观察在圆的右上角但是在圆外部的那个标记的点。由于在这个点以及圆内点群的平均值之间的距离沿着图中45°线增加,这两个变量的相互关系也将增加并且接近两者之中的最大值。因此,单个的、偏离的观察值能歪曲真正关系的估计值。


The variance, σ2, of a variable x is defined as the average of the squared deviation of that variable from its population mean, µ. Consequently, the square of the distance that an outlying observation is from its mean, that is, (x – µ)2, can have a great impact on the estimated value of the variance parameter.
变量X的方差σ2被定义为X偏离其总体均值的平方。所以,一个偏离的观察值与其均值距离的平方,就是说(X-X均值)2会对方差参数估计值有很大的影响


For example, including the outlying point in Figure 1 with the circular group of points in the plot will increase the variances of x1 and x2. This occurs because the outlying point causes the data to be spread wider in both dimensions.
In two dimensions, scatter plots can be constructed to show how one or more data points can change the estimate of the means, the variances and the correlation coefficient between the variables.
例如,在图表1中包含了偏远点的圆形点群将会增加X1和X2的方差。这种情况的发生是因为远离的点导致数据向两个范围扩展。在这两个范围中,可以建立散点图来说明一个或多个数据点怎样改变均值的估计值、方差和两变量间的相关系数。


For example, Figure 2 contains four observations, labeled A, B, C and D, which are removed from the bulk of the data enclosed in the ellipse. The inclusion of points A or C will not affect the correlation coefficient because both support the linear trend in the data.
例如,图表2中包含4个观察值,标记为A, B, C 和D,这几个值与椭圆围住的绝大部分数据远离。包含了点A和C的数据不会影响相关系数,因为这两点都支持线性趋势。


Figure 2 图表2 四个观察值将会改变均值、方差和/或相互关系




Including point A, however, will increase the variances and decrease the means of x1 and x2, while including point C will increase the variances and the means of x1 and x2. The inclusion of points B or D will affect the correlation coefficient between the two variables because both lie in directions opposite of the linear trend of the data. In addition, including point B will decrease the mean but increase the variance of x1, while including point D will increase the mean and variance of x1.
包含点A将增加方差而且减少X1和X2的均值;然而包含点C将增加方差以及X1和X2的均值。包含点B或D将会影响两变量间的相关系数因为这两个点都在线性数据趋势的相反方向。另外,包含点B将减少均值但是增加X1的方差,包含点D将增加均值以及X1的方差。


The influence function 影响函数
Several mathematical procedures exist for determining the effect an observation has on these particular estimates. One popular procedure1 used in developing robust statistical estimators is based on developing an influence curve or influence function for use as a measure of the effect that an observation has on the parameter being estimated.
几种数学上传统的做法用来决定一个观察值在这些特定的估计上的影响。一种被用于发展健全统计估计的流行程序是建立在一条影响曲线或者影响函数,(这种曲线或函数)用来作为一个观察值对被估计参数影响的一种测量。


When applied to the mean, the influence function is exactly what you would expect: a measure of the difference between an observation and the mean:
当被运用于均值的时候,这种影响函数恰恰就是你所期望的:测量观察值和均值间的差异。


x – µ. Likewise, when applied to the variance, you obtain the expected answer that the influence function is the squared distance between an observation and the mean: (x – µ)2 – σ2.
x – µ同样地,当被运用于方差,你得到的期望中的答案是:影响函数是观察值和均值间差值的平方σ2


When applied to the sample correlation coefficient, r, between two variables, x1 and x2, the contours of the influence function are a set of hyperbolae given by the formula
当被运用于样本相关系数,在X1和X2两变量之间,影响函数的轮廓是一套公式规定的双曲线。

in which y1 and y2 are the studentized values of x1 and x2, and c is a chosen constant value. The selection of the value of c for drawing these contours is arbitrary (and chosen to include the bulk of the points), but nevertheless serves to identify observations removed from the data swarm.
在(上面的公式中)y1和y2是变量X1 和X2的观测值,c是一个常数。为了画出这些轮廓,c的数值的选择是随意的(被选择的值包括绝大部分数据点),仅仅为识别观察值远离数据群提供服务。




Superimposing these hyperbolic contours over the corresponding scatter plot for y1 and y2 allows you to determine which observations are having the greatest effect on the estimate of the correlation coefficient. Points inside the hyperbolae will influence function values greater that +c or less than –c. Points outside the hyperbolae will have influence function values between –c and +c. Figure 3 shows an example of these contours for a case in which c = ± 2.7 and r = 0.81.


将这些双曲线轮廓叠加在对应y1和y2的散点图上能使你确定哪些观察值对估计相关系数有最大的影响。双曲线内部
的点将影响函数值大于+c或者小于-c。双曲线外部的点将使函数值在-c和+c之间。图表3是一个当c = ± 2.7 及 r = 0.81时双曲线(轮廓)的例子展示:


Figure 3 相关系数的影响函数图




Detailed procedures exist for interpreting the data points in relation to the contour plots.2,3 Those points located on the side of the data swarm but inside the hyperbolae, such as point A in Figure 3, will decrease the value of the correlation coefficient. Those points located within the hyperbola on the ends of the data swarm, such as points B and C in Figure 3, will increase the correlation coefficient.
有详细的步骤来解释数据点与轮廓图的关系,这些点在数据群的一侧但是在双曲线的里面,例如图表3中的点A,就将减少相关系数值。那些位于双曲线内部、在数据群末端的点,例如图表中的点B和点C,将增加相关系数。


Influence functions also can be used for detecting outliers in a bivariate sample.4 For example, any point located within the hyperbolae, such as points A, B and C in Figure 3, are subject to removal. In addition, the influence function value in (1) can be computed for any other observation in the sample.
影响函数也能应用于查明在一个双变量样本中的远离点。例如,位于双曲线内部的任何一点,比如图表3中的点A、B和点C,(它们)倾向于远离(双曲线)。另外,在(1)中的影响函数值也能被用来估算在样本中的其他观察数据。


Influence function example 影响函数例子
Figure 4 contains the scatter plot of 212 observations selected at random from a bivariate normal distribution in which the variables are standardized with a correlation coefficient of 0.812. In this form, the correlation is the same as the covariance between the two variables. These observations are represented in Figure 4 by the points within the ellipse, excluding the one labeled point 2.
图表4包含了从一个双变量、方差是标准差、相关系数为0.812的标准分布中随机选取的212个观察值的散点图。在这种形式中,相关系数与两变量间的协方差相同。在图表4中这些观察值用椭圆内部的点表示,除了那个被标记出的点2.


Figure 4 图表4 数据散点图、椭圆和双曲线


For illustrative purposes, two additional observations, point 1 with coordinates (-2, 2) and point 2 with coordinates (1, 1), have been added to the plot in Figure 4. Point 1 is outside the data swarm and inside the hyperbolae, indicating it could be an outlier and could possibly have influence on the computation of sample statistics.作为解说性的目的,两个另外的观察值,坐标为(-2, 2)的点1及坐标为(1, 1)的点2也被加入到图表4中。点1在数据群的外侧但是在双曲线内侧,指出它是一个远离点并且可能对估算样本统计量有影响。
This is verified in Table 1 by comparing the sample correlation coefficient values obtained with and without this point (while ignoring point 2 Including point 1 decreases the pairwise correlation between the two variables from 0.812 to 0.776. In addition, Table 1 includes the effect on the sample means and variances of y1 and y2. For y1, the absolute value of the mean increases and the standard deviation slightly increases when point 1 is included. For y2, the absolute value of the mean decreases, but the standard deviation slightly increases when point 1 is included.
这个在列表1中通过对比含有这个点以及不含这个点的样本相关系数值证实。然而忽略点2而包括点1。两变量间成对的相互关系从0.812减少到0.776. 另外,列表1包含了对y1和y2样本均值和方差的影响。对于y1,当包括点1时,均值的绝对值增大且标准偏差略有增大。对于y2,均值的绝对值减小,但是当点1被包含的时候标准偏差略有增加。

Table 1 列表1 点2的概括统计量

The coordinates of point 1 are (-2, 2 )Including this point in the original sample, the value of the correlation coefficient is r = 0.776. These coordinates and the correlation coefficient are needed to compute the value of the influence function for point 1 using the earlier equation. The computed value is -7.1, which is less than the chosen value of c = -2.7. This result independently confirms what we see in Figure 4, namely that point 1 is inside the hyperbolae and has a decreasing effect on the correlation coefficient.
点1的坐标是(-2,2)原来的样本中包含这个点,相关系数值r = 0.776。这些坐标和相关系数对于用最早的方程式来计算点1的影响函数值是必需的。计算值是-7.1,这个数值小于选定的常数c = -2.7。这个结果独立地证明我们在列表4中看到的,即点1在双曲线内部并且对相关系数有减少方面的影响。


In contrast, point 2, with coordinates (1, 1), is contained within the data swarm (elliptical region) and outside of the hyperbolae in Figure 4. Thus, this point should have minimal effect on the sample estimates. This is confirmed when examining the results in Table 2.
相比之下,在图表4中,点2,坐标为(1, 1),被包含在数据群内部(椭圆地带)且在双曲线外侧。因此这个点可能对样本估计有极微小的影响。这个会在列表2中检查结果的时候得到确认。


Table 2 列表2 点1的概括统计量




When the summary statistics and the correlations are computed with and without this point (while excluding point 1), small differences are noted in all of the statistics. These results also are confirmed by the small value of the influence function for this point. The computed value using the earlier equation is 0.2, which is between –c = -2.7 and +c = +2.7, and is therefore outside the hyperbolae in Figure 4.
当在包含和不包含这个点估计概括统计量和相互关系时(当不包括点1时),在所有的统计量中,小的差异是显著的。这些结果也被这个点影响函数的小的估计值所证明。用最早的方程式的计算值是0.2,这个值在–c = -2.7 和+c = +2.7之间,因此在图表4中双曲线的外部。


As can be seen from these examples, the influence function of a statistic is a key component in robust estimation because it helps you assess the influence that an observation has on the estimation of a statistic. It also is important in detecting outliers in the data.
从这些例子中我们可以看出,在完全估计中,统计量的影响函数是一个关键的组成部分。因为它帮助你评估一个观察值对统计量估计的影响。它在发现数据中远离的点上也是很重要的。


References 参考文献
  1. Frank R. Hampel, "The Influence Curve and its Role in Robust Estimation," Journal of the American Statistical Association, 1974, pp. 383–393.
  2. 弗兰克R.汉姆普,“在完全估计中影响曲线和它的作用”,美国统计协会期刊,1974年,383-393页。

  1. Susan J. Devlin, Ramanathan Gnanadesikan and John R. Kettenring, "Robust Estimation and Outlier Detection With Correlation Coefficients," Biometrika, 1975, pp. 531–545.
  2. 苏珊J. 德弗林,罗曼森 Gnanadesikan和约翰R.凯特宁,“完全估计和远离点相关系数探测”,生物统计学,1975年 ,531-545页。

  1. Michael R. Chernick, "The Influence Function and its Application to Data Validation," American Journal of Mathematical and Management Sciences, 1982, pp. 263–288.
  2. 迈克尔R.柴内克,“影响函数及它在数据确认上的应用”,美国数学和管理科学杂志,1982年,263-288页。

  1. Chernick, "The Influence Function and its Application to Data Validation," American Journal of Mathematical and Management Sciences, see reference 3.
  2. 柴内克“影响函数及它在数据确认上的应用”,美国数学和管理科学杂志,参见引文3。

Robert L. Mason is an institute analyst at Southwest Research Institute in San Antonio. He has a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of ASQ and the American Statistical Association.
罗伯特L.曼森是圣地亚哥西南部研究机构的一位学会研究员。它在南卫理公会大学获得统计学博士学位,并且是ASQ和美国统计协会的研究员。

John C. Young is a retired statistics professor at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.
约翰C.杨是美国莱克查尔斯州麦克尼斯州立大学退休的一位统计学教授,他在南卫理公会大学获得统计学博士学位。

table2.jpg table1.jpg firger1.jpg Figure4.jpg Figure3.jpg Figure2.jpg 公式.jpg
对“好”的回答一定要点个"赞",回答者需要你的鼓励!
已邀请:

小编D (威望:9) (广东 广州) 互联网 员工 - 记住该记住的,忘记改忘记的。改变能改变的,接受不...

赞同来自:

你好,请问第二十四篇你有在校稿吗

6 个回复,游客无法查看回复,更多功能请登录注册

发起人

扫一扫微信订阅<6SQ每周精选>