## 2020

STATISTICS ROUNDTABLE

# A Sample Plan

## Leverage supplemental information to enhance data collection

by Christine M. Anderson-Cook and Lu Lu

Often, there are situations in which you might want to draw a representative sample from a finite population to characterize some aspects of its distribution.

Recall that a representative sample means the sample is typical of the overall population and shares many of the same characteristics of the population as a whole. More technically, a representative sample can mean each element in the population has an equal probability of appearing in the chosen sample.

In a recent production setting, there was interest in characterizing an attribute of the final product—the density—without completely inspecting all items. The standard practice for obtaining a good estimate of this product attribute was to select a simple random sample from the final parts.

Each day, a sample was taken and measurements were obtained to summarize the day’s production. Measuring the density of the parts, however, is costly and time consuming, so there was interest in increasing the precision of the estimates without increasing the sample size of 16 by somehow making the sampling procedure more efficient.

First, a few details about the current sampling plan: To obtain a simple random sample (SRS), you could number all the parts in the population (in this example, a day’s production) and use a random number generator (available in almost any statistical software program) to select a sample of the desired size from the population.

This is a fair and easy way to ensure all items are equally likely to be selected, which, in turn, increases the likelihood of obtaining a sample that is a good characterization of the population with similar attributes.

### Enhancing precision

Returning to the problem of interest: Is there a way of obtaining greater precision for the product densities without increasing the expense of sampling and testing? It turns out that early in the production process, a preliminary measurement was taken that was somewhat correlated with the final weight.

Figure 1 shows a scatterplot of the
final density, *Y*, against this measurement, *X*, for a particular day in
which complete inspection was performed. Figure 2 shows a histogram of the
distribution of *X* for that same day with 300 units.

Long-term data indicated the correlation between the two
measurements was approximately 0.74. Obtaining the measurements, *X*,
is cheap and is already done as part of an established process control program.
In addition, tracking the parts through the process is straightforward.

A common approach to improve the quality of the sample in survey sampling is stratified sampling (STS). Using some supplementary information, the population is divided into strata, which are subpopulations that are known proportions of the population. The number of elements selected from each stratum is then chosen to maintain the constant probability of each element being included in the overall sample.

Under this sampling design, the sample units are self-weighting,
and the sample mean is an unbiased estimate of the population mean. When the
strata are formulated as homogeneous groups with group means differing as much
as they can across groups, the sample mean estimator tends to have more
precision than using SRS.^{1-3} We can adapt this approach to our
production process to try to leverage some advantage.

To illustrate the approach, the data in Figures 1 and 2 show how
you might implement a good stratification algorithm. From the histogram in
Figure 2, partition the sample, using the total size of the population with
measurements *X*, into groups with equal numbers of parts per group.

For example, if you wanted to create four groups of 75, group the
75 units with the smallest values of *X*. The next smallest values of
*X* would comprise the second group, and so on. Therefore, each of
the four groups represents one-fourth of the total population for that day.

Then, sample one-fourth of the total sample size from each group and combine them to create a sample the same size as the simple random sampling. In this example, you want a sample size of 16, so four units are randomly sampled from each of the four subpopulations.

### More elaborate process

Granted,
you have made the sampling procedure more complicated by needing to know the
distribution of the *Xs* and tracking the units through the remainder of the process. But
have you improved the precision?

Table 1 shows the results from a simulation based on the particular day’s data (shown in Figures 1 and 2) to demonstrate the benefits of using this more elaborate sampling process to obtain the density estimate. Because complete inspection was performed on this day, we know the true values of the population characteristics.

To test the different sampling strategies, we repeatedly drew samples of 16 from the same population and calculated the mean, median and 10th percentile—a quality metric of interest.

Table 1 reports the average value of these quantities of interest across a large number (10,000) of samples, as well as the standard deviation of the measures.

Because all of the methods (simple random sample, two strata and
four strata) produce representative samples, you would expect all of them to
give unbiased results. This appears to be true, as the mean of each of the
quantities of interest across the many samples is close to the true value from
the population. Where there is a noticeable difference between the approaches
is with the standard

deviation for the quantity of interest.

In each case, as we move from one to two to four strata, the precision of our estimates improves. Notice that the same pattern of reduced standard deviation occurs for the mean, median and the 10th percentile, showing that you’re likely to see improvements regardless of which characteristic of the distribution is important for a given application.

So how did this happen? When you draw a simple random sample, all of the items are equally likely to be selected. But for any particular sample, you might have slightly more large values or slightly more small values.

By stratifying, you enforce that it becomes less likely to get a badly misbalanced sample with too many units from any one group. This helps make all of the samples more similar, which translates into greater consistency of the estimated quantities, and hence, more efficiency of the sampling strategy.

As you increase the number of strata, you increase the amount of
control about where you are getting the *Y* values. This restricts the
size of your sample-to-sample variation. Clearly, this requires more
information and is slightly more complicated to implement, but it can further
improve the precision.

### STS vs. SRS

Now,
the *X* value you had to work with was only moderately correlated with
the density, *Y*, with correlation of 0.74. Table 2 shows the improvement of
efficiency by using STS compared with SRS—which is measured by the
relative size of standard deviations of using STS compared to SRS—for
populations with different magnitude of correlation between *X*
and *Y*.

For example, with a correlation of 0.91, the standard deviation of the estimate of the mean using STS is about half the size (0.54) of its counterpart using SRS.

You can see that as the correlation gets stronger (closer to 1 or
-1), the amount of decrease in the standard deviations improves because the *X’*s
grouping into strata matches the *Y’*s grouping by size more
exactly.

Also note that the different characteristics of the distribution
of *Y* achieve different gains with increasing magnitude of correlation,
with the central characteristics (mean and median) improving more than the
tails of the distribution. Therefore, the more information for the explanatory
variable to predict the final density, the more advantage there is to perform
stratification sampling.

Hence, this application achieved the desired goal of keeping the same sample size. But by choosing the units with a more complicated sampling plan, precision estimates increased.

This is advantageous because the major expense in the sampling process was in the testing of the characteristic of interest. A more complicated sampling plan was relatively cheap in time and effort relative to the cost of measuring more units.

The trade-off between a more difficult sampling plan (creating the groups and tracking the parts) and the gains of increased precision must be balanced differently in different applications.

But knowing that leveraging additional information can provide a useful advantage does allow you to consider more options.

### References and note

- William
G. Cochran,
*Sampling Techniques*, third edition, Wiley, 1977. More comprehensive discussions of stratified sampling are available in*Sampling Techniques*, as well as the sources listed in references 2 and 3. - Morris
H. Hansen, William N. Hurwitz and William G. Madow,
*Sample Survey Methods and Theory*, Wiley, 1953. - Carl
Erik Särndal, Bengt S. Swensson and Jan H. Wretman,
*Model Assisted Survey Sampling*, Springer, 1992.

**Christine M. Anderson-Cook** is a research scientist at Los Alamos
National Laboratory in Los Alamos, NM. She earned a doctorate in statistics
from the University of Waterloo in Ontario. Anderson-Cook is a fellow of the
American Statistical Association and a senior member of ASQ.

**Lu Lu** is a postdoctoral research
associate at Los Alamos National Laboratory. She earned a doctorate in
statistics from Iowa State University in Ames, IA.

Featured advertisers