What and When To Randomize
by Christine Anderson-Cook
One of the most highly stressed principles of statistical design of experiments is the need for proper randomization. Unfortunately, it is sometimes misunderstood and misapplied.
The motivation for randomization is to remove some of the subjectivity from the experiment and to offer protection from systematic but unknown or unaccounted for factors affecting the value of the response. For example, if you are interested in assessing the effect of an adjustment to your process, it would be a mistake to separately obtain all the data from one condition and then all the data from the other condition.
Suppose, unknown to you, there is a warm-up period for one of the machines involved in the process. The effect of the warm-up period on your response of interest would affect only the data from the first condition, and its effect would be confounded with the difference between the two conditions you are trying to assess. If, however, you randomized the order in which the data were collected from the two conditions, then the unknown warm-up period would influence both conditions, not just a single one.
Therefore, when you perform the analysis, the warm-up effect might increase the variance of the measurements you obtained—perhaps making it harder to find a significant difference between the two conditions—but it would likely not systematically bias your results to lead you to a false conclusion.
This practical protection from unknown causes through randomization is also the theoretical basis for the validity of any inference or testing you might perform. Since you knew ahead of time there were natural differences between the experimental units—the units to which you were applying the two treatments, condition one and condition two—randomization as-sured you that, on average, there would be relatively little difference between the experimental units before receiving the treatment. Any significant differences you saw could therefore be attributed to the difference between the conditions.
Three Common Mistakes
Appropriate randomization dramatically improves the quality of the data collected and allows you to make valid inferences about the causality of your treatments influencing the values observed for the response. As you try to implement an experiment with randomization, however, you could easily make one of several common mistakes and thereby make your experiment invalid or, at the very least, less effective.
Mistake one: randomizing which data to collect,
not which observational units get which chosen input level.
Consider a not entirely fictitious story from my consulting
experience where an experimenter was interested in studying the
effect of temperatures between 100 and 200 degrees on the
response. The scientist came to see me after running the
experiment involving 30 observations, proudly saying he had
randomized the temperatures. Figure 1 shows the range of
temperatures he used to collect his experimental data.
While his intentions may have been good, his execution significantly reduced the experiment’s effectiveness. Because he did not control which temperatures to select and had randomly selected 30 temperatures, the experiment was more difficult to run. There were no observations at the extremes of the range of interest—neither 100 nor 200 degrees had been selected—and the uneven spacing of the temperatures would not provide optimum information about the relationship between the explanatory variable and the response. Also, by not measuring any temperatures twice, no measure of pure error was possible.
This experiment would have been more
effectively run had the experimenter consciously selected the
particular temperatures he wanted to consider. For example, if he
was unclear about the shape of the relationship between the input
and output before running the experiment and was worried about
detecting a phase change in the response at a particular
temperature, he should have selected an equally spaced design
with replicates to measure natural variability at different
(see Figure 2).
However, if he thought the relationship would
be more continuous with only a moderate amount of curvature, then
a design with just three equally spaced temperatures would
provide maximum power for detecting differences between the
levels of the input variable (see Figure 3).
Once the appropriate levels of the input factor—temperature—had been chosen, then his randomization step would have involved determining which temperature each of the 30 experimental units would receive. The choice of factor levels should be made based on current understanding of the process and on what the nature of the relationship will likely look like. It should never be left to chance through randomization.
Mistake two: choosing a randomization approach that does not ensure balance or protection against changes in the amount of data collected. Consider a simple experiment in which you want to compare the relative effect of two drugs. The patients arrive into the study at different times and, therefore, are assigned to receive one of the two drugs at random.
One choice for randomization would be to flip a coin each time a new patient arrives and assign him or her to a particular drug based on whether the coin showed heads or tails. This, however, might result in some unbalanced results in terms of the number of patients receiving each drug. A better approach would be to devise an assignment schedule based on the number of subjects planned for the study.
For example, if you know the study is designed to continue until 120 patients have been included, then the assignment schedule might include a balanced randomization for the first 20 patients, then the next 20 and so on. The randomization for the first 20 patients might look something like this: 2 1 2 2 1 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2. By doing the randomization in groups of 20, the number of patients receiving each drug will not be too unbalanced if the study is terminated early or preliminary results are needed.
The best randomization for this experiment might have separate randomization assignments for patients based on other demographic information, which would ideally be balanced across the allocation to drugs. For example, based on a patient profile, it would be easy to stratify the patients into four categories based on gender and whether their condition is severe or mild. This would lead to four randomization schedules: male-severe, female-severe, male-mild and female-mild.
This way, you could consciously include known factors that might affect the performance of the drugs in the study and not have to adjust for them after the experiment. Generally, if there are known factors that might affect the treatment and can be measured before the assignment to treatments, then you should include this stratification in the design of the experiment.
Mistake three: implementation and subsequent
analysis of the experiment does not match the intended design
protocol. Suppose you are interested in running a simple
experiment to study the effect of two factors, each at two
levels, on your response of interest. Your software package
yields the order in Table 1 for your 22 factorial experiment with
Notice the level of factor one does not change between observations one and two and observations five and six. Similarly, for factor two, the level does not change between observations two, three and four and observations seven and eight.
If you collect the first two observations and do not reset the level of factor one between the runs—because it might appear to be introducing less variability and is simpler to run—then response values for this pair of observations will be correlated with each other. The same is true for observations five and six, observations two, three and four, and observations seven and eight. This is called an inadvertent split-plot and requires a different, more complicated analysis to correctly estimate the error terms in the model.1
If resetting a factor for each experimental
unit will be too costly or difficult, then you should select a
different design. Let’s say it was not practical to change
the level of factor one for each run. You might choose a
split-plot experiment to reduce the number of level changes
required. For example, it might only be practical to change the
level every second run. The factor(s) that has restrictions on
the number of changes is called the whole plot factor(s), and the
one that will be reset for each observation is called the subplot
factor(s). In this case, a superior design is the one shown in
Table 2 (p. 61).
The randomization for this experiment actually occurred in two separate instances:
- The order to run the four whole plots was randomized (A, B, C, D or C, B, D, A).
- The order to run each of the two observations within each whole plot was randomized (1, 2 or 2, 1).
The analysis of this split-plot experiment will include two error terms: one for the variability associated with changing the whole plots and one for the variability associated with different observations.2,3
Randomization is an essential component of running a good statistically designed experiment. To choose the levels of factors to be examined, you need to understand the relationship being studied and the number of observations for each combination to examine. Randomization will remove subjectivity and bias as to which experimental units get which treatments after you have selected the proper combinations.
If you know of factors that might influence the response, you should account for them in the design whenever possible. And if you don’t reset the levels of each factor for each observation, your analysis should reflect this, and you should likely choose a split-plot design to improve the characteristics of the design.
With its inclusion in many statistical software packages, randomization has become much easier and more accessible. Just remember when and how to use it, and you’ll run even better experiments.
- Jitendra Ganju and James M. Lucas, “Detecting Randomization Restrictions Caused by Factors,” Journal of Statistical Planning and Inference, 1999, pp. 129-140.
- Jennifer D. Letsinger, Raymond H. Myers and Marvin Lentner, “Response Surface Methods for Bi-Randomization Structure,” Journal of Quality Technology, 1996, pp. 381-397.
- G. Geoffrey Vining, Scott M. Kowalski and Douglas C. Montgomery, “Response Surface Designs Within a Split-Plot Structure,” Journal of Quality Technology, 2005, pp. 115-129.
CHRISTINE ANDERSON-COOK is a technical staff member of Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario. Anderson-Cook is a Senior Member of ASQ.