Best Bang for Your Buck-Part 2
Choosing between different-sized designs
by Christine M. Anderson-Cook and Lu Lu
In last month’s Statistics Roundtable,1 we considered four potential designs of varying sizes and a variety of their attributes for an engineering problem involving a screening design with seven factors. The engineer’s original request was to find a design with 14 runs that could estimate all of the main effects, as well as have some potential for finding important two-factor interactions and possibly curvature in the underlying surface.
Our assessment was that the proposed size of design was potentially a bit too ambitious for the experiment’s goals, so we explored three other competing designs that had slightly larger sizes but were able to improve several aspects of the design. The four designs were constructed in JMP software2 as:
- 14r: A 14-run D-optimal design.
- 15rCR: A 15-run design (consisting of the 14-run design above with one center point added).
- 15DSD: A 15-run definitive screening design.3
- 16r: A 16-run D-optimal design.
In the earlier column, we compared the designs based on standard design optimality criteria (D-, A-, I- and G-optimality), the power to detect statistically significant effects, the correlation structure, the precision for predicting new observations and the ability to assess curvature. Table 1 summarizes the results from the comparisons, with colors highlighting the relative ranks for the various designs.
In reviewing the table, it’s clear that no design is the universal winner for all categories: The 16-run design (16r) has the most top ranks, but also has the highest associated cost. Not only are the rankings of the designs within a category important, but it also is beneficial to look at the range of results to see whether the observed differences are likely to be of practical importance.
For example, the standard errors for the main effects range in size from 0.250s to 0.314s, in which s is the standard deviation that describes the natural variability of the underlying relationship. This translates into a confidence interval for any of the main effect parameters that is about 25% wider for the 15-run definitive screening design (15DSD) relative to the 16r. The cost of the 16r is about 14% higher than the 14-run design (14r). Clearly, the range of differences and their practical impacts for different criteria varies.
Because there is no universal winner, what we value most should drive which of the designs makes the most sense for the particular priorities in the experiment. Table 2 describes scenarios in which each of the designs might be the best available choice for the experimenter:
- The 14r is the cost-focused choice, in which we can now quantitatively see what we are sacrificing in terms of power, prediction variance, correlation structure and ability to assess curvature.
- The 15-run design (15rCR, the 14r with one center run added) has the primary advantage of being able to at least see whether curvature might be an issue.
- The 15DSD has ideal correlation structure between main and two-factor interactions, as well as the ability to separately estimate curvature. It sacrifices power and prediction variance, however, to achieve these advantages.
- The 16r has the ideal correlation structure for estimating the main effects, the best power and prediction variance throughout the design space, with its only weaknesses being its higher cost and inability to see whether curvature is present.
Careful consideration should be given to what we think is likely for the underlying relationship between the response (or responses) and the seven input factors. If we think curvature has a reasonable likelihood of being important, the 15DSD becomes a more suitable choice. If we think that some of the effects we are trying to detect are likely to be small, the 16r rises in our preferences with sufficient power to detect the effects.
Hence, what choice emerges as the best fit for our experiment should be based on some detailed discussions about different scenarios and talking through trade-offs. It might be that we are concerned about curvature and sufficient power for smaller effects, in which case a more difficult choice must be made for the one experiment that can be run. The 15rCR might be a compromise solution by allowing an informal test of potential curvatures and also not sacrificing too much of power for detecting small main effects relative to 16r.
Presenting your case
When we talked through the different options with the engineer, the larger 16r appeared to be the closest match for his needs. Having better power to evaluate the various effects was seen as more important in this first screening experiment. That’s because when a smaller subset of most-important factors was selected, there were likely to be opportunities to explore curvature with subsequent experiments.
So, the next step for the engineer was to present a compelling argument to his boss explaining reasons to change the budget for the screening experiment. This is often where science (what is the best thing to do) and logistics (cost and time constraints that make it difficult to do what is best) come into conflict.
Not surprisingly, in many different business settings, it is quite common to be trying to negotiate for additional resources to improve a product, process or study. The following strategies have worked well for us in making this argument in a compelling way to increase the likelihood of an improved outcome:
- Begin with a discussion about how the original budget or experiment size was selected. Often, if the choice was made arbitrarily with no firm basis, this can provide a starting point for reconsidering the choice, now in the presence of more information.
- Have data on several choices that are larger than the original proposed budget. In talking through the pros and cons of various designs, there can be subtle pressure in looking at multiple alternatives that are all better than the original design. We used this approach with our example when we considered three designs that are all slightly larger than the first proposed 14-run design.
- Discuss multiple aspects of design performance to make the case. Our consideration of power, correlation and prediction variance adds weight to the case for the larger design because it performs better in many aspects.
- Graphical summaries can be powerful and effective for conveying information and messages in an intuitive and compact way. In addition to graphics, we should make as many summaries of the different designs as quantitative as possible. For example, it helps not only to see the color map of the correlation (refer to last month’s column for the plots) with its improved structure for the 16r over the 14r, but it’s also helpful to have numbers that back up those differences, as shown in Table 1, to help consolidate that advantage.
- Create a table similar to Table 2 that outlines how to think about the different strengths and weaknesses in the larger context. If the advantages for the smallest design consist only of the "cheapest cost with sacrifices in X, Y and Z," this makes clear what is being sacraficed for the sake of economy.
- Create a table similar to Table 1 that provides all quantitative summaries for different aspects of a decision and associates individual options with their relative performance by ranking and numerical quantities. It is always helpful and convincing to back your opinion or suggestions with facts when negotiating with decision makers.
- In the case of experiments that try to extract too much from a small sample size, disproportionate gains are often possible. In the current experiment, for example, we are able to obtain a substantial improvement in the correlation structure with the addition of the two additional runs, and the difference between the 15DSD and 16r in terms of power was a 25% improvement for a 7% increase in budget (16/15 = 1.07).
- Experiments are often run with the aim of collecting data on multiple responses. A larger design can offer robustness of performance to different underlying relationships for the different responses. For example, it might be that the first response has a number of smaller main effects that the additional power of the 16r design can help identify. The second response might have several two-factor interactions, so having a better correlation structure available can help distinguish between the contributing effects. The fact that we are aiming to have the design perform well across several different responses can make a strong case for building in some robustness.
- Talk about the trade-offs between larger and smaller designs in the context of sequential experimentation. The discussion that there will be other experiments later in the study of a given product or process that can leverage understanding gained at the early stages can make a compelling argument for investing in a good experiment that sets up future investigations well. There is often a substantial cost to missing an important effect early in the screening stages because that factor might not be explored or considered later. Hence, having adequate power to identify key effects can have an important impact on later stages.
- When additional resources are obtained for a larger experiment, use this opportunity to document the benefits gained. If the new larger design allowed for exploration of curvature and it was discovered that curvature was important, for example, make sure this story becomes part of the discussions in the next negotiation—that is, say something such as, "Remember when we expanded the experiment on X to consider curvature … well, that led to this beneficial outcome." Similarly, it is helpful to look for stories to illustrate the opposite scenario—that is, "Remember when we ran a very small first experiment and we missed Y … Well, that led to a costly additional study."
Even after trying any or all of these strategies, there have been many times when our attempts to obtain additional resources have failed. It can be consoling to realize that the best quantitative case was made for justifying the larger experiment, and the decision maker was well aware of the choices that had been made with its associated trade-offs. The good news is that it is sometimes possible to prevail—and an improved experiment can be run.
- Christine M. Anderson-Cook and Lu Lu, "Best Bang for Your Buck-Part 1," Quality Progress, October 2016, pp. 45-47.
- JMP, version 13, SAS Institute Inc., 2016.
- Bradley Jones and Christopher J. Nachtsheim, "A Class of Three-Level Designs for Definitive Screening in the Presence of Second-Order Effects," Journal of Quality Technology, 2011, Vol. 43, No. 1, pp. 1-15.
Christine M. Anderson-Cook is a research scientist in the Statistical Sciences Group at Los Alamos National Laboratory in Los Alamos, NM. She earned a doctorate in statistics from the University of Waterloo in Ontario. Anderson-Cook is a fellow of ASQ and the American Statistical Association.
Lu Lu is an assistant professor in the department of mathematics and statistics at the University of South Florida in Tampa. She was a postdoctoral research associate in the statistical sciences group at Los Alamos National Laboratory. She earned a doctorate in statistics from Iowa State University in Ames, IA.