A Matching Pair
Connecting data collection to the planned analysis
by Christine M. Anderson-Cook
As a student in my very first design of experiments (DoE) class, one of the key messages I learned was that data to be collected needed to match what the planned analysis required.
In hindsight, this sounds quite obvious—like before you go shopping, you figure out what you need in the item to be purchased. Before you lead or attend a meeting, you have a plan for what a good outcome would be so that you can effectively steer things in appropriate directions.
This simple guidance is sometimes ignored, however, leading to wasted resources and the inability to extract the information required to answer key questions. Perhaps a few simple examples would help illustrate how this matching process should be incorporated into data collection strategies:
- Sampling: A common mistake in survey sampling is to collect data from an incomplete or wrong population. Say the goal of a survey is to understand customer satisfaction for a new service offering. If the data are compiled based on only those customers who complain about problems (or alternately, only those who return for the service a second time), it would be easy to have a distorted impression of the true customer satisfaction level.
- Interactions: In DoE, we often are interested in the main effects of changes in the input on the response as well as interactions. Interactions capture differences in the amount of change in response as one input factor moves across its levels, depending on what the level of another input factor is set at. Figure 1(a) shows an example of an interaction in which the response increases as X1 moves from low to high if X2 is at a low level, but decreases if X2 is at a high level. A one-factor-at-a-time design, such as the one shown in Figure 1(b), is unable to estimate any interactions between X1 and X2. After all, you cannot estimate two separate lines with only three points. The factorial design shown in Figure 1(c), however, can estimate the interaction.
- Curvature: In another DoE scenario, we may be interested in estimating or testing for curvature in the response as the levels of an input factor change. In this case, it is essential to use more than two levels to estimate those features. Figure 2(a) shows how a two-level design (shown with blue dots) cannot distinguish between either of the curves shown. If a third observation is added at an intermediate level of the input (red dot), however, there is ability to estimate the curvature. If the relationship is more complicated (as with the dashed lines in Figure 2(b)), however, additional levels of the input might be needed to further distinguish between alternatives. The key message here is that the number of levels of the input must reflect how complicated a relationship is anticipated between the input and the response.
Two approaches to consider
In DoE, there are two main approaches that can help achieve the match between data collection and analysis to help ensure that the experimenter can answer the questions of the study. These are model-based design and design-based analysis. The former focuses on what that analysis will look like when data are available and creates a design that meets the analysis needs. The latter focuses on having the best available design structure and develops tailored analysis approaches, which may be computationally more demanding.
Model-based design: When I was taught DoE, my training focused on model-based design, which says that the experimenter determines what is the most complicated model that must be evaluated and creates a design that can support estimation of all of the model parameters. If we want to be able to estimate interaction between X1 and X2 (as discussed earlier), for example, we would choose the model:
Y = β0 + β1X1 + β2X2 + β12 X1X2 + ε
and make sure that the design we select can adequately estimate all the parameters: β0, β1, β2, β12. Similarly, for estimating curvature, we would select a model with suitable complexity, such as:
Y = β0 + β1X1 + β11(X1)2 + ε
Y = β;0 + β1X1 + β11(X1)2 + β111(X1)3 + ε
and construct and evaluate designs based on their ability to estimate the parameters in this model. In addition to just being able to estimate the model that you think is likely, it is also helpful to be able to test assumptions.1 So in the case in which you think a relationship is likely to be well-modeled with a straight line, it might be a good strategy to allow yourself the ability to test that assumption with an additional level of the input to see whether there is in fact any curvature.
Design-based analysis: The second emerging approach to achieving the desired match between data and analysis—design-based analysis—focuses on using a specialized class of designs with exceptional properties, and adapting the planned analysis to leverage those advantageous properties of the design.
There are two examples of this: definitive screening designs2,3 and group-orthogonal supersaturated designs.4 In these cases, the experimenter would select one of these designs that has the right objectives for his or her experiment.
For example, definitive screening designs are compact designs that can estimate main effects, interactions and some quadratic effects, as long as there is some effect sparsity.5
Group-orthogonal supersaturated designs allow for exploration of a large number of potential input factors, but again rely on not all of the effects being active to work well.
The advantage of design-based analysis is that the design structure is predetermined, and the properties of the design are known to be very good for the set of goals that they focus on.
So, if you have a problem that matches the strengths of these designs, the design with the tailored analysis (generally available in statistical software) can be an efficient way to gain understanding of the results.
Regardless of whether you opt for model-based design or design-based analysis as the solution, it’s important to think about what the goals of the data collection are and how the data that will be obtained will help answer the questions that come from those goals.
In addition to just being able to get some answer to the question, it also is important to think about the power6,7 of the design to not only estimate the model parameters, but also to estimate them with adequate precision for the needs of the study.
A carefully designed data collection plan can help ensure that the subsequent analysis is able to satisfy the needs of the study and can ensure a productive use of resources.
- Christine M. Anderson-Cook, “A Matter of Trust,” Quality Progress, March 2010, pp. 56-58.
- Bradley Jones and Christopher J. Nachtsheim, “A Class of Three-Level Designs for Definitive Screening in the Presence of Second-Order Effects,” Journal of Quality Technology, Vol. 43, No. 1, 2011, pp. 1-15.
- Bradley Jones and Christopher J. Nachtsheim, “Effective Design-Based Model Selection for Definitive Screening Designs,” Technometrics, Vol. 59, No. 3, 2017, pp. 319-329.
- Bradley Jones, Ryan Lekivetz, Dibyen Majumdar, Christopher J. Nachtsheim and Jonathan W. Stallrich, “Construction, Properties and Analysis of Group-Orthogonal Supersaturated Designs,” Technometrics, Sept. 17, 2019, https://tinyurl.com/jones-techno-designs.
- C.F. Jeff Wu and Michael S. Hamada, Experiments: Planning, Analysis and Optimization, Wiley, 2009, p. 173.
- Christine M. Anderson-Cook and Lu Lu, “Best Bang for Your Buck—Part 2,” Quality Progress, November 2016, pp. 50-52.
- Christine M. Anderson-Cook and Lu Lu, “Best Bang for Your Buck—Part 1,” Quality Progress, October 2016, pp. 45-48.
Christine M. Anderson-Cook is a research scientist in the Statistical Sciences Group at Los Alamos National Laboratory in Los Alamos, New Mexico. She earned a doctorate in statistics from the University of Waterloo in Ontario, Canada. Anderson-Cook is a fellow of ASQ and the American Statistical Association. She is the 2018 recipient of the ASQ Shewhart Medal.