Practical vs. Statistical Priorities
Ramifications of randomization during data collection for designed experiments
by Christine M. Anderson-Cook and Lu Lu
Randomization during data collection in a designed experiment is an effective strategy for reducing system bias and ensuring independent observations,1 but at what cost should it be fully implemented? Recently, we encountered a scenario that highlights an important aspect of collaboration2 among statisticians, scientists and engineers—the idea of compromise for the sake of achieving wins for all team members.
In a multimillion-dollar experiment with a fixed amount of time on the test equipment, each experimental run involved resetting the input values and waiting for the continuous process to reach a new equilibrium. The temperature of the furnace could be reset, for example, but it took time for the test chamber to reach the new temperature and rest at that temperature for a while until all components in the chamber settle at that new setting.
Similarly, a critical gas flow component also required a settling period before data could be collected that would be representative of that input setting value. In all, the experiment involved five factors, of which four were time-consuming and costly to adjust, with the duration of time needed to reach a new equilibrium being dependent on how far the previous setting was from the new one for the combination of these four factors. For example, changing from 150° to 200° would reach equilibrium faster than changing from 100° to 200°.
In this experiment, a sequential design of experiments3 approach was used, with data from small batches of experimental runs being collected and analyzed. With updated model results from each batch, the next batch of data to be collected was chosen with the goal of reducing the uncertainty in the estimated responses across the input space of interest. In this experiment, for example, a week of runs were specified at a time, with time on the weekend used to update the model estimates and decide which runs to implement the following week.
So fundamental questions in designing each batch were not only to choose which input combinations to run, but also determine the order they should be collected, which determined the total time needed to complete all the runs. With an ineffective choice of run order, the number of runs that could be collected in a week of experimentation might be as little as half of the possible number of the most efficient run order. Figures 1 and 2 shows a comparison of two different run orders: Choosing adjacent runs as in Figure 1 allows seven data points to be collected, while a less advantageous run order from a randomized experiment in Figure 2 allows just five runs to be executed in the same length of time.
So how important should we make randomization, when there is such a high potential cost associated with run order? Recall from an earlier Statistics Spotlight column4 that the benefits of randomization are to offer protection from changing conditions not manipulated during the experiment, to encourage resetting of input factor levels to guard against systematic bias, and to provide a basis for statistical theory to judge the significance level of observed effects.
But for this experiment, these benefits came at the direct cost of getting less data. Imagine the conversation with the engineers: “Yes, we could get 12 observations this week, but instead, I would like to advocate for complete randomization of the run order, and that might yield between six and eight runs.” That would be a difficult negotiation given the high total cost of the experiment and a shortage of available data to model the five-dimensional space of interest.
We think it is important that statisticians are perceived not as gatekeepers who obstruct the goals of the experimenters, but a part of the team that helps facilitate a good solution. So are there some compromises that we can make when faced with restrictions on randomization that still allow us to plan for an acceptable statistical analysis, while honoring the priorities of our collaborators? Another way of thinking about this is: “Can some of the objectives of randomization be achieved through alternate approaches when there is a strong necessity for other considerations?” Split-plot designs5-7 often are an excellent option to allow for planned designed experiments of a fixed size that incorporate limited resetting of hard-to-change factors. In this case, however, four of five factors were considered hard-to-change, and the total number of runs that could be implemented depended directly on which run order is selected.
The good news was that the experimenters had sufficient experience with the equipment and were able to predict fairly accurately how long it would take the runs to reach the new equilibrium given the starting and target settings. The form of this relationship was approximately proportional to the scaled distance between the two input locations after scaling each factor appropriately.
For example, a change of 20° in temperature would take about as much time to reach equilibrium as a 5 kg/hour change in flow rate. Then the Euclidean distance in the scaled input space was recommended by the subject matter experts as a good approximation to the total time for reaching equilibrium between any two input locations. Hence, a change of both 20° and 5 kg/hour would take approximately √2 times as long because (total time)2 = (time for temp change)2 + (time for flow change)2.
To find the best design for a given week, we constructed multiple possible designs of different sizes. First, we identified a set of experiment combinations for a particular design size that would be targeted for each batch to maximally improve the worst-case prediction precision as measured by the width of the confidence interval of the surface at any input location. Next, we worked to find the ideal order of runs of each design size. A simple search exploring all possible run orders (n! for a design of size n) was constructed to find the one that minimized the distance between sequential runs, and the time to run the experiment was compared to the available time at the test facility for that week. The size of the experiment was varied until we found the largest design that we could run in the week.
This approach made the experimenters happy. They were maximizing the amount of data they were able to get each week. But how did we mitigate the fact that no randomization was performed? First, we wanted to find an alternative to randomization that helped guard against some lurking factors that might change over the course of the experiment that might not be assumed to be fixed. Here, the statisticians negotiated some checks—a replicate in each batch in which the same input conditions were run at the beginning and end of the batch. This provided an assessment of the natural variability as well as a limited ability to check for systematic changes.
Because the run path in each batch always went back to the starting run, the number of permutations need to be evaluated to determine the run order reduced to (n-1)!. After the shortest path was identified, the location of the two runs (at the beginning and end) was selected. The location of the starting run was intentionally varied across different batches to average out the potential systematic bias.
In addition, one replicate was added between batches. This provided some ability to check whether there were big differences between the weeks of the experiment. While limited, this was still a hard sell to the experimenters. Running the same set of input conditions twice meant that some other input combinations could not be explored. But given the concessions to optimize the run order for each batch, this was something that they accepted.
Second, randomization serves to protect against miss-set factor levels affecting multiple runs. An approach to at least do some checking of this includes monitoring how often settings for each factor were held fixed between runs. For example, if runs one and two involved changing three of the five factors, it is helpful to note that two of the factors were not reset for that transition. Based on this information, we can perform some exploratory data analysis to see whether there were patterns in the residuals of the fitted model associated with which factors were not reset.
Finally, randomization serves as a justification for formal evaluation of the statistical significance of different effects. Hence, with our experiment that does not include any randomization, we interpreted the results of hypothesis tests quite skeptically. We already had low power for the test because of the relatively small overall sample size, but we still were reluctant to interpret formal results for the model fitting as very reliable. Given the lack of randomization, the p-values for each factor were not considered literally compared to a nominal cutoff like 0.05. A few factors were highly significant (for example, p-values of magnitude 10-6), and for these we were willing to believe that factor was likely influential on our response.
So, what are some of the key lessons from this example?
- For some experiments, there can be substantial costs to trying to implement standard randomization. In general, statistical rigor can be often associated with some increased cost in time, budget and efforts. Taking cost into consideration and using quantitative measures for assessment can facilitate a more realistic solution.
- Part of being a good collaborator is balancing the statistically-ideal solution with some of the practical constraints and priorities of the experimenters.
- This balancing is made possible by understanding what the fundamental objective is of some of the key statistical priorities, such as randomization, blocking and replication. By understanding these objectives, it often is possible to devise ways of at least partially achieving some of them with alternative approaches.
Implementing a successful experiment or study almost always involves considering and balancing multiple objectives. In this case, getting additional data was an important priority to be valued in conjunction with statistical rigor.
- Christine M. Anderson-Cook, “At Random: The Rationale Behind Randomization, and Options When Up Against Constraints,” Quality Progress, March 2018, pp. 48-53.
- Christine M. Anderson-Cook, Lu Lu and Peter A. Parker, “Effective Interdisciplinary Collaboration Between Statisticians and Other Subject Matter Experts,” Quality Engineering, Dec. 27, 2018, https://tinyurl.com/cac-lu-park-qualtyeng.
- Fritz B. Soepya, et al., “Sequential Design of Experiments to Maximize Learning from Carbon Capture Pilot Scale Testing,” Proceedings of the 13th International Symposium on Process Systems Engineering, Mario R. Eden, Marianthi G. Ierapetritou and Gavin P. Towler, eds., Vol. 44, 2018, pp. 283-288.
- Anderson-Cook, “At Random: The Rationale Behind Randomization, and Options When Up Against Constraints,” see reference 1.
- Bradley Jones and Christopher J. Nachtsheim, “Split-Plot Designs: What, Why, and How” Journal of Quality Technology, Vol. 41, No. 4, 2009, pp. 340-361.
- Peter Goos, The Optimal Design of Blocked and Split-Plot Experiments, Springer, 2012.
- Peter A. Parker, Christine M. Anderson-Cook, Timothy J. Robinson and Li Liang, “Robust Split-Plot Designs,” Quality and Reliability Engineering International, Vol. 24, No. 1, pp. 107-121.
Christine M. Anderson-Cook is a research scientist in the Statistical Sciences Group at Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario, Canada. Anderson-Cook is a fellow of ASQ and the American Statistical Association. She is the 2018 recipient of the ASQ Shewhart Medal.
Lu Lu is an assistant professor in the department of mathematics and statistics at the University of South Florida in Tampa. She was a postdoctoral research associate in the statistical sciences group at Los Alamos National Laboratory. She earned a doctorate in statistics from Iowa State University in Ames, IA.