This month’s question
I am about to start collecting data for an analysis and am wondering what is the right sample size?
Sadly, there is no single magical sample size that is right for every situation. Choosing the right sample size should be thought of as a trade-off between most effectively spending available resources (such as money, time and effort) and satisfactorily answering the question of interest. There are several important considerations when deciding on the right sample size for your data collection scenario:
- This should go without saying, but experience has taught me that having a clearly articulated goal for the data collection is an essential—but often neglected—part of deciding how much data to obtain. In addition to optimizing the sample size, it also ensures that the experiment has focus and a quantifiable measure of whether it was successful.
There are three common goals of many experiments: estimating a characteristic of the data’s distribution or a statistical model parameter, testing a hypothesis about that characteristic or parameter, or predicting a new observation. For each choice, more data lead to a more precise result. For example:
- More data = narrow confidence intervals for the characteristic or parameter.
- More data = additional power1 for the test (for example, the ability to detect smaller differences in the characteristic or parameter’s values).
- More data = narrower prediction intervals for the new observation.
- Depending on the goal of the experiment described earlier, it is helpful for the experimenter to quantify the requirements for considering the experiment a success. How narrow must the confidence interval for the characteristic or parameter be? What size of difference is required to detect in a hypothesis test? How precisely must you predict new observations? This component provides some baseline for how much money, time and effort you are willing to invest to achieve a particular level of precision.
Note that to obtain an absolute value for these questions, it often is necessary to quantify the natural variability of the quantity under consideration. This can be achieved with a pilot study to obtain some preliminary data.
- As the complexity of the experiment increases with more model parameters, a larger sample size generally is needed to support the analysis. Design of experiment experts, for example, talk about a saturated2 design being one where there is one observation for each model parameter to be estimated. Supersaturated3 designs have more model parameters than observations and generally can not estimate all parameters without a specialized analysis that requires additional assumptions.
For many experiments, staying away from saturated or supersaturated designs is advised, unless resources to run the experiment are expensive.
- There are diminishing returns as the sample size increases. For many situations, the precision of an estimated quantity is related to sample size through a fixed relationship. For a sample size of N, the standard deviation often is c/√(10). Here, the constant, c, depends on the planned analysis and quantity being estimated. Hence, increasing the sample size from 10 to 20 should yield an improvement in precision of c/√(10) - c/√(20) = 0.093, or a 9.3% deduction in the standard deviation.
But if the sample size is increased from 100 to 110 (the same incremental increase in cost), the improvement is just c/√(100) - c/√(110) = 0.0047, or a 0.47% deduction in the standard deviation. Figure 1 shows a curve with a typical shape for how the standard deviation reduces as a function of sample size.
- The nature of the response also has an effect on the amount of data needed. Continuous measurements are richer in information than categorical data (for example, pass/fail). Hence, if the response is not continuous, larger sample sizes often are recommended to estimate quantities of interest adequately.
- When estimating or evaluating a distribution’s characteristics, some characteristics are easier to estimate than others. Estimating the central tendency of a distribution with the mean or median, for example, can require only a small amount of data. If interest lies in the spread (variance, standard deviation and interquartile range), this requires a bit more data. If interest lies in the tails of the distribution (the fifth, 10th, 90th or 95th percentile), the amount of required data increases rapidly the further that you get from the center of the distribution.
Find the right balance
When planning data collection, it is beneficial to have good information about the cost of obtaining the data and how sample size improves precision. Finding the right balancing point that is acceptable on cost and precision will guide the final choice of the sample size. Lastly, data collection does not happen in isolation. When assessing the cost of obtaining the data, it is helpful to think about the trade-off between collecting data for this study versus where else those resources might be used. Many times, data collection has a sequential nature, so reserving a sufficient portion of the budget for later objectives also should be considered as the study is planned.
- Paul D. Ellis, The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of Research Results, Cambridge University Press, 2010.
- R.L. Rechtschaffner, “Saturated Fractions of 2n and 3n Factorial Designs,” Technometrics, Vol. 9, No. 4, 1967, pp. 569-575.
- Bradley Jones, Dennis K.J. Lin and Christopher J. Nachtsheim, “Bayesian D-Optimal Supersaturated Designs,” Journal of Statistical Planning and Inference, Vol. 138, No. 1, 2008, pp. 86-92.
This response was written by Christine M. Anderson-Cook, statistician and scientist, Los Alamos National Laboratory, Los Alamos, NM.