翻译小组

【翻译】第四十一篇 Leverage supplemental information to enhance data collectio

本帖最后由小编D 于 2011-12-15 17:41 编辑

你好，我是小编H。请对以下文章有翻译兴趣的组员留下你的预计完成时间，并发短信息联系小编H，以便小编登记翻译者信息以及文章最终完成时的奖惩工作。感谢支持翻译组！

**A Sample Plan
Leverage supplemental information to enhance data collection**

by Christine M. Anderson-Cook and Lu Lu

Often, there are situations in which you might want to draw a representative sample from a finite population to characterize some aspects of its distribution.

Recall that a representative sample means the sample is typical of the overall population and shares many of the same characteristics of the population as a whole. More technically, a representative sample can mean each element in the population has an equal probability of appearing in the chosen sample.

In a recent production setting, there was interest in characterizing an attribute of the final product—the density—without completely inspecting all items. The standard practice for obtaining a good estimate of this product attribute was to select a simple random sample from the final parts.

Each day, a sample was taken and measurements were obtained to summarize the day’s production. Measuring the density of the parts, however, is costly and time consuming, so there was interest in increasing the precision of the estimates without increasing the sample size of 16 by somehow making the sampling procedure more efficient.

First, a few details about the current sampling plan: To obtain a simple random sample (SRS), you could number all the parts in the population (in this example, a day’s production) and use a random number generator (available in almost any statistical software program) to select a sample of the desired size from the population.

This is a fair and easy way to ensure all items are equally likely to be selected, which, in turn, increases the likelihood of obtaining a sample that is a good characterization of the population with similar attributes.

Enhancing precision
Returning to the problem of interest: Is there a way of obtaining greater precision for the product densities without increasing the expense of sampling and testing? It turns out that early in the production process, a preliminary measurement was taken that was somewhat correlated with the final weight.
Figure 1 shows a scatterplot of the final density, Y, against this measurement, X, for a particular day in which complete inspection was performed. Figure 2 shows a histogram of the distribution of X for that same day with 300 units.

Figure 1

Figure 2

Long-term data indicated the correlation between the two measurements was approximately 0.74. Obtaining the measurements, X, is cheap and is already done as part of an established process control program. In addition, tracking the parts through the process is straightforward.

A common approach to improve the quality of the sample in survey sampling is stratified sampling (STS Using some supplementary information, the population is divided into strata, which are subpopulations that are known proportions of the population. The number of elements selected from each stratum is then chosen to maintain the constant probability of each element being included in the overall sample.

Under this sampling design, the sample units are self-weighting, and the sample mean is an unbiased estimate of the population mean. When the strata are formulated as homogeneous groups with group means differing as much as they can across groups, the sample mean estimator tends to have more precision than using SRS.1-3 We can adapt this approach to our production process to try to leverage some advantage.

To illustrate the approach, the data in Figures 1 and 2 show how you might implement a good stratification algorithm. From the histogram in Figure 2, partition the sample, using the total size of the population with measurements X, into groups with equal numbers of parts per group.

For example, if you wanted to create four groups of 75, group the 75 units with the smallest values ofX. The next smallest values of X would comprise the second group, and so on. Therefore, each of the four groups represents one-fourth of the total population for that day.

Then, sample one-fourth of the total sample size from each group and combine them to create a sample the same size as the simple random sampling. In this example, you want a sample size of 16, so four units are randomly sampled from each of the four subpopulations.

More elaborate process
Granted, you have made the sampling procedure more complicated by needing to know the distribution of the Xs and tracking the units through the remainder of the process. But have you improved the precision?

Table 1 shows the results from a simulation based on the particular day’s data (shown in Figures 1 and 2) to demonstrate the benefits of using this more elaborate sampling process to obtain the density estimate. Because complete inspection was performed on this day, we know the true values of the population characteristics.

Table 1

To test the different sampling strategies, we repeatedly drew samples of 16 from the same population and calculated the mean, median and 10th percentile—a quality metric of interest.

Table 1 reports the average value of these quantities of interest across a large number (10,000) of samples, as well as the standard deviation of the measures.

Because all of the methods (simple random sample, two strata and four strata) produce representative samples, you would expect all of them to give unbiased results. This appears to be true, as the mean of each of the quantities of interest across the many samples is close to the true value from the population. Where there is a noticeable difference between the approaches is with the standard
deviation for the quantity of interest.

In each case, as we move from one to two to four strata, the precision of our estimates improves. Notice that the same pattern of reduced standard deviation occurs for the mean, median and the 10th percentile, showing that you’re likely to see improvements regardless of which characteristic of the distribution is important for a given application.

So how did this happen? When you draw a simple random sample, all of the items are equally likely to be selected. But for any particular sample, you might have slightly more large values or slightly more small values.

By stratifying, you enforce that it becomes less likely to get a badly misbalanced sample with too many units from any one group. This helps make all of the samples more similar, which translates into greater consistency of the estimated quantities, and hence, more efficiency of the sampling strategy.

As you increase the number of strata, you increase the amount of control about where you are getting the Y values. This restricts the size of your sample-to-sample variation. Clearly, this requires more information and is slightly more complicated to implement, but it can further improve the precision.

STS vs. SRS
Now, the X value you had to work with was only moderately correlated with the density, Y, with correlation of 0.74. Table 2 shows the improvement of efficiency by using STS compared with SRS—which is measured by the relative size of standard deviations of using STS compared to SRS—for populations with different magnitude of correlation between X and Y.

Table 2

For example, with a correlation of 0.91, the standard deviation of the estimate of the mean using STS is about half the size (0.54) of its counterpart using SRS.

You can see that as the correlation gets stronger (closer to 1 or -1), the amount of decrease in the standard deviations improves because the X’s grouping into strata matches the Y’s grouping by size more exactly.

Also note that the different characteristics of the distribution of Y achieve different gains with increasing magnitude of correlation, with the central characteristics (mean and median) improving more than the tails of the distribution. Therefore, the more information for the explanatory variable to predict the final density, the more advantage there is to perform stratification sampling.

Hence, this application achieved the desired goal of keeping the same sample size. But by choosing the units with a more complicated sampling plan, precision estimates increased.

This is advantageous because the major expense in the sampling process was in the testing of the characteristic of interest. A more complicated sampling plan was relatively cheap in time and effort relative to the cost of measuring more units.

The trade-off between a more difficult sampling plan (creating the groups and tracking the parts) and the gains of increased precision must be balanced differently in different applications.

But knowing that leveraging additional information can provide a useful advantage does allow you to consider more options.

Christine M. Anderson-Cook is a research scientist at Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario. Anderson-Cook is a fellow of the American Statistical Association and a senior member of ASQ.

Lu Lu is a postdoctoral research associate at Los Alamos National Laboratory. She earned a doctorate in statistics from Iowa State University in Ames, IA.