 3.4 PER MILLION

# Test Drives and Data Splits

## How techniques like deleting one point at a time can test prediction model generality

by Joseph D. Conklin

Prediction models are one of a Six Sigma practitioner's best friends for improving processes. The more complicated and persistent the quality problem, the more useful prediction models can be.

By prediction model, I mean any statistically derived relationship between the measures of the problem variable (the dependent variable) and the key potential drivers (independent variables).1

In general, a model is any simplified but sufficiently detailed representation of an existing process that allows its normal workings to be more easily visualized and analyzed. By better understanding how the process behaves, potential drivers for improvement are more easily identified.

A model might succeed in identifying good drivers but fail to say precisely how they should be arranged and manipulated to improve the process.

For this purpose, we need something with the potential to predict what the process can achieve in the future. This is the role of prediction models.

The most direct route to a prediction model is a designed experiment.2 The analysis following a designed experiment normally leads to an equation with the dependent variable on the left and the independent variables on the right.3

The equation is the form of the prediction model. To generate predictions, simply substitute values for the independent variables and see what comes out of the equation for the dependent variable.

The result will be a prediction in the broadest sense if the values for the independent variables reflect settings at which the process has not been run in the past. If the past settings have not proven adequate, don't be surprised if improvement occurs beyond them.

### Taking it for a test drive

Before acting on these predictions, however, the Six Sigma practitioner should test drive the model to see how well it generalizes. Some prediction models work well when restricted to the values for the independent variables included in the designed experiment but fail miserably otherwise.

A model of this character is too restrictive to be practical. A model with generality can accept values somewhat outside the settings of the designed experiment and predict with reasonable accuracy.4 Models of this character point to the largest number of options for improving the process.5

The best way to test drive a prediction model is to run the process at settings other than those of the original experiment and compare the results with what the model predicts should happen. When the predicted and actual results agree throughout many repeated attempts of this nature, we can conclude the prediction model generalizes well.

When time or resources don't favor additional runs of the process for testing, the next best method to test drive the prediction model is with the data from the original designed experiment. This technique is implemented by using data-splitting techniques.

In data splitting, some portion of the data from the designed experiment is set aside. The model is built on the remaining portion of the data. Once available, the model is used to predict the outcomes for the portion set aside. Again, the sign of generality is the agreement of the predicted results with the actual ones.

This approach comes at a cost. When written as an equation, the keys to the prediction model are the coefficients-the values we multiply by the values of the independent variables to come up with a prediction. The less data available to build the model, the less certain we are about the precise values of the coefficients. Depending on the particular application we are involved with, however, the cost might be minimal.

### How much is enough?

When considering data splitting, the natural question is how much data to set aside. The answer partly depends on how the outputs of the process vary under changing conditions. It also depends on the type of prediction model we are trying to fit.6

As a useful introduction to the concept of data splitting, we will concentrate on the "one point at a time" deletion method.7 In this method, we start with a data set consisting of n available observations.

Suppose we have built a prediction model using all n observations, and we want to see if it generalizes well. We start by setting aside the first observation and estimating a prediction model based on observations 2 to n. Then we use the model to predict the value of the dependent variable we measured at observation one.

After this is done, we resume the process by bringing observation one back in, setting aside observation two, and repeating as before. We continue until all the observations have been set aside. In the process, we will have constructed n possible prediction models each based on n-1 observations. As a result, we have a prediction for each of the dependent variables observed in our data set.

### Confidence interval

Associated with all the predicted values is a confidence interval. A confidence interval spans the entire range of values that, based on the available data, can be considered reasonable predictions for a given observation. The statistical formulas for these intervals incorporate how much uncertainty we are willing to tolerate in the prediction. Increased certainty comes at the price of a wider interval-the range of values we are forced to consider for reasonable increases.

A simple test for generality is to check the confidence interval for the predicted value of each observation in our data set. A minimum condition for generality is for the confidence interval to include the actual value we observed in the designed experiment. The more observations for which this fails to be true, the more evidence we have against generality.

To illustrate this check, consider the data set in Table 1.8 Suppose our process involves metal forming, and the amount of a key trace element affects the strength of the finished product. We set up a simple designed experiment consisting of 36 batches with varying amounts of the trace element.9 We sample each batch and measure how much force is required to break the metal. The amount of the trace element is the independent variable X. The breaking force is the dependent variable Y. We would like to build a model that predicts the breaking strength as a function of how much of the trace element we use. Possessing such a model enables us to better tailor the metal forming process to individual customer requirements for the finished product.

For this example, a simple linear regression equation of Y on X provides our prediction model.10 Using all 36 data points, the equation is

Y = 9.29162 + 4.02197*X.

The calculations behind this equation are shown below. The calculations use the first 10 data points from Table 1 for illustration.

Table 2 repeats the values of X and Y and adds the results for the prediction model and confidence limits. The lower and upper confidence limits form the boundaries of the confidence interval. The calculations for the prediction model and confidence limits are illustrated for the first 10 observations of Table 1 in "Example Calculations," found at www.qualityprogress.com. Note that the actual values for Y are inside their respective confidence intervals for all 36 observations. The prediction model has passed a minimum test for generality, so the model builder can move on to more complex tests.11

### When model fails minimum test

To motivate discussion of what to do when this minimum test is not passed, consider whether the Y values were as shown in Table 3. Once again, the predicted values of Y and the confidence limits are provided. The predicted values were obtained the same way as in Table 2. A given observation is set aside, a simple linear regression is fit to the remaining observations, and the resulting regression is used to predict the value of Y for the observation set aside. For the new data set, we see four observations (numbers 1, 5, 17 and 24) in which the actual value of Y falls outside the confidence interval for the predicted value. This gives us a reason to question the generality of the prediction model.

There are several strategies the Six Sigma practitioner can consider in deciding what to do next. One strategy not recommended is to immediately proceed to the use of the prediction model as is.

1. Review the records of the designed experiment for the runs in question. Using the available information, verify whether any unusual conditions might have led to unusual values for the dependent variable.
2. Analyze the residuals for all the runs in the designed experiment. Look for patterns suggesting that an important variable has been left out or estimated inadequately.12
3. Repeat the designed experiment at the runs in question (resources permitting). Verify whether the initial outputs repeat and should be considered typical for the process under these conditions.
4. Re-estimate the prediction model using robust regression techniques. Robust techniques are more resistant to unusual or outlying observations. See if the confidence intervals based on the robust model still lie outside the observed values of the dependent variable for the runs in question.13
5. Extend the designed experiment by adding runs that permit estimation of a more complex prediction model.14
6. The runs in question might be a sign that a single model is not sufficient over the entire set of process conditions that interest us. The data set from the designed experiment-or, alternatively, a sufficiently large historical data set of process values-might lend itself to cluster analysis.15 Cluster analysis seeks to divide a data set into a set of homogeneous parts.16 If such an analysis is successful, each part becomes a candidate for fitting a prediction model uniquely suited to it.

Once the Six Sigma practitioner obtains a prediction model with reasonable generality, the next step in process improvement becomes easier to take. Of the many possible paths to the future, the ones with the most promise should be easier to discern.

### References and Notes

1. For a helpful explanation about building and using models in the physical sciences, see Applied Regression Analysis, third edition, by Norman R. Draper and Harry Smith (John Wiley, 1998). Prediction models also have a role when addressing quality problems heavily involving human factors. Models for these problems tend to draw on techniques developed in the social sciences. A flavor for some of these can be found in David W. Hosmer and Stanley Lemeshow's Applied Logistic Regression, second edition (John Wiley, 2000).
2. Designed experiments are beyond the scope of this article. A useful introductory reference is Douglas C. Montgomery's Design and Analysis of Experiments, sixth edition (John Wiley, 2005). As traditionally understood in the field of quality, designed experiments imply physical manipulation of people, material, equipment and environmental conditions in the location where the process actually operates. Advances in computer technology in recent decades make it more and more practical to construct digital representations of a process that can be manipulated by software to derive prediction models. The term "simulation," depending on the context, might refer only to the construction part or to both the construction and manipulation together.
3. A single equation might be insufficient for a particular prediction model. Mathematical constructs involving vectors and matrices can be used to express prediction models engaging more than one dependent variable simultaneously. This article focuses on the single equation case to introduce the important concepts as simply as possible.
4. This does not mean the Six Sigma practitioner can expect to find a prediction model that works well at any arbitrarily large distance from the original settings of the designed experiment. After moving out so far, any model can be expected to fail. If it is necessary to understand how the process works in a region where a prediction model breaks down, it might be time to run a new designed experiment in that region.
5. I use the term "generality" for models with some predictive flexibility beyond the confines of the original designed experiment. This is not the same concept as "robustness." While distinct concepts, the two can coexist in the same model. Robustness refers to the ability to depart from the traditional statistical assumptions for a designed experiment and still yield useful information, either at the original settings of the designed experiment or somewhat beyond them.
6. The class of data-splitting techniques called cross-validation deals with these questions in detail. For a more detailed discussion of cross validation, refer to section 4.2 of Raymond H. Myers' Classical and Modern Regression With Applications, second edition (Duxbury, 1990).
7. This is a simple data-splitting technique with wide applicability, but it is not necessarily the first or best one to use every time a prediction model is assessed for generality.
8. The data in Table 1 are coded.
9. This is an example of a one-factor designed experiment. For more information on these and other type of designed experiments, see Montgomery, Design and Analysis of Experiments, reference 2.
10. For more information on simple linear and other forms of regression, see Meyers, Classical and Modern Regression With Applications, reference 6.
11. These include the cross-validation techniques discussed in section 4.2 of Meyers, Classical and Modern Regression With Applications, reference 6. As a general rule, the Six Sigma practitioner should err on the side of more generality tests.
12. The residuals are the differences between the actual values of the dependent variable and the values predicted from the model. For a detailed discussion of residual analysis, see Draper and Smith, Applied Regression Analysis, reference 1.
13. For more information about robust regression, see Peter J. Rousseeuw and Annick M. Leroy's Robust Regression and Outlier Detection (John Wiley, 1987).
14. For examples of building up a designed experiment to estimate more complex models, see Raymond H. Myers and Douglas C. Montgomery's Response Surface Methodology: Process and Product Optimization Using Designed Experiments, second edition (John Wiley, 2002).
15. Using cluster analysis or any statistical method on historical data that has not been gathered under a well-maintained regimen of controls is not a fruitful path to enlightenment. Before using historical data, the Six Sigma practitioner should ask whether the basics of sound process management-stable measuring systems, specifications reflecting process capability, standard operating procedures and thorough operator training-are in place.
16. For an introduction to cluster analysis, see Leonard Kaufman and Peter J. Rousseeuw's Finding Groups in Data: An Introduction to Cluster Analysis (John Wiley, 1990).

## Example Calculations-Part I

Example Calculations for Prediction Model Using First 10 Observations From Table 1

Simple linear regression equation is of the form Y = mX + b. Table A will illustrate calculations for m and b using the first 10 observations for X and Y from Table 1.

Need corrected sum of squares for X (CSS-X) = [(3) - ((1) 2 / # of observations)] / (# of observations - 1 ) = [29,809.84 - (495.42 / 10)] / (10 - 1) = [29,809.84 - (245,421.116 / 10)] / 9 = [29,809.84 - 24,542.116] / 9 = 5,267.724 / 9 = 585.3027

Need corrected sum of cross products for X and Y (CSCP - XY) = [(5) - ((1) (2) / # of observations )] / (# of observations - 1 ) = [126,448.46 - ((495.4)( 2,101.5 ) / 10] / (10 - 1) = [126,448.46 - (1,041,083.1 / 10)] / 9 = [126,448.46 - 104,108.31] / 9 = 22,340.15 / 9 = 2482.239

Formula for m is CSCP-XY / CSS-X = 2,482.239 / 585.3027 = 4.24095

Formula for b is ((2) / # of observations) - (m)((1) / # of observations) = (2,101.5 / 10) - (4.24095)( 495.4 / 10) = 210.15 - (4.24095)(49.54) = 210.15 - 210.09666 = 0.05334

Equation is Y = 0.05334 + 4.24095*X

### Table A

 Observation X Y X2 Y2 XY 1 56.9 226.5 3,237.61 51,302.25 12,887.85 2 37.1 174.9 1,376.41 30,590.01 6,488.79 3 66.2 297.4 4,382.44 88,446.76 19,687.88 4 90.1 389.0 8,118.01 151,321.00 35,048.90 5 57.5 233.4 3,306.25 54,475.56 13,420.50 6 54.5 218.1 2,970.25 47,567.61 11,886.45 7 57.9 231.4 3,352.41 53,545.96 13,398.06 8 53.1 236.8 2,819.61 56,074.24 12,574.08 9 12.2 54.5 148.84 2,970.25 664.90 10 9.9 39.5 98.01 1,560.25 391.05 Sum 495.4(1) 2,101.5(2) 29,809.84(3) 537,853.89(4) 126,448.46(5)

## Example Calculations-Part II

Example Calculations for Approximate 95% Confidence Interval for Observation 1 From Table 1

A 95% confidence interval does not mean there is a 95% chance the particular interval you calculate captures the true value. It means if you calculate several intervals by this formula under similar conditions over an indefinitely long period of time, 95% of the time the interval will capture the proper value. The 95% is a statement of the long-term expected average performance of the calculation method.

Set aside observation one and estimate the prediction model using observations two to 10. Repeats the calculations from Sidebar 1, modified as follows in Table A.

Need corrected sum of squares for X (CSS-X) = [(3) - ((1) 2 / # of observations)] / (# of observations - 1) = [26,572.23 - (438.52 / 9)] / (9 - 1) = [26,572.23 - (192,282.25 / 9)] / 8 = [26,572.23 - 21,364.694] / 8 = 5,207.536 / 8 = 650.9420

Need corrected sum of cross products for X and Y (CSCP- XY) = [(5) - ((1)(2) / # of observations )] / (# of observations - 1 ) = [113,560.61 - ((438.5)( 1,875.0 ) / 9] / (9 - 1) = [113,560.61 - (822,187.5 / 9)] / 8 = [113,560.61 - 91,354.166] / 8 = 22,206.45 / 8 = 2775.806

Formula for m is CSCP - XY / CSS - X = 2,775.806 / 650.9420 = 4.26429

Formula for b is ((2) / # of observations) - (m)((1) / # of observations) = (1875.0 / 9) - (4.26429)( 438.5 / 9) = 208.3333 - (4.26429)(48.7222) = 208.3333 - 207.7656 = 0.5677

Equation based on observations 2 to 10 is Y = 0.5677 + 4.26429*X

Now, compute the predicted value Y = 0.5677 + 4.26429*X and residuals for observations two to 10. The residuals are the differences between the actual values of Y and the predicted values. Square the residuals and sum to obtain another important quantity for calculating the approximate 95% confidence limits.

### Table A

 Observation X Y X2 Y2 XY 2 37.1 174.9 1,376.41 30,590.01 6,488.79 3 66.2 297.4 4,382.44 88,446.76 19,687.88 4 90.1 389.0 8,118.01 151,321.00 35,048.90 5 57.5 233.4 3,306.25 54,475.56 13,420.50 6 54.5 218.1 2,970.25 47,567.61 11,886.45 7 57.9 231.4 3,352.41 53,545.96 13,398.06 8 53.1 236.8 2,819.61 56,074.24 12,574.08 9 12.2 54.5 148.84 2,970.25 664.90 10 9.9 39.5 98.01 1,560.25 391.05 Sum 438.5(1) 1,875.0(2) 26,572.23(3) 486,551.64(4) 113,560.61(5)

## Example Calculations-Part III

Example Calculations for Approximate 95% Confidence Interval for Observation 1 From Table 1

Quantity (6), the sum of the squared residuals, is a summary measure of how well the model fits. Other things being equal, small values of (6) are preferred to larger ones. Since there are ways to make (6) small at the expense of other important properties, (6) should not be the only criterion for judging a model.

From (6), we derive a quantity called the root mean square error (RMSE) as follows:

RMSE = [(6) / (# of observations - 2)]1/2 = [1231.9192 / (9 - 2)] 1/2 = [1231.9192 / 7] 1/2 = [175.9885] 1/2 = 13.2661.

The formula for the approximate 95% confidence interval is predicted Y +/- (2)(RMSE)[(1 + (1 / * of observations) + ((X - ((1)/ # of observations))2 )/ (3))]1/2.

A more precise figure could be obtained by replacing 2 with a value from an important statistical distribution called the t distribution. For the number of observations in this example, the right value from the t distribution is 2.36.

For the full set of observations in Table 1, the right value of the t distribution is 2.03. This is close enough to 2 for our purposes in this article.

The predicted value of Y for observation 1, the one that was left out, is predicted Y = 0.5677 + (4.26429*56.9) = 0.5677 + 242.6381 = 243.2058.

The approximate 95% confidence interval for observation 1 is 243.2058 +/- (2)*(13.2661)[ 1 + (1/9) + ((56.9 - (438.5 / 9))2 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + (1/9) + ((56.9 - (438.5 / 9))2 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + 0.1111 + ((56.9 - (438.5 / 9))2 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + 0.1111 + ((56.9 - 48.7222)2 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + 0.1111 + ((8.1778)2 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + 0.1111 + (66.8764 / 26,572.23)]1/2 =

243.2058 +/- 26.5322 [1 + 0.1111 + 0.0025)]1/2 =

243.2058 +/- 26.5322 [1.1136]1/2 =

243.2058 +/- 26.5322 [1.0553] =

243.2058 +/- 27.9994

The approximate 95% lower confidence limit is 243.2058 - 27.9994 = 215.2064. The approximate 95% upper confidence limit is 243.2058 + 27.9994 = 271.2052. The actual value of Y for observation 1 is 226.5, in between the confidence limits. The test for generality is passed for this observation.

### Table A

 Observation X Y Predicted Y Residual(Y - Predicted Y) Residual2 2 37.1 174.9 158.7729 16.1271 260.0847 3 66.2 297.4 282.8637 14.5363 211.3041 4 90.1 389.0 384.7802 4.2198 17.8065 5 57.5 233.4 245.7644 -12.3644 152.8778 6 54.5 218.1 232.9715 -14.8715 221.1617 7 57.9 231.4 247.4701 -16.0701 258.2478 8 53.1 236.8 227.0015 9.7985 96.0106
 Observation X Y Predicted Y Residual (Y - Predicted Y) Residual2 9 12.2 54.5 52.5920 1.9080 3.6403 10 9.9 39.5 42.7842 -3.2842 10.7858 Sum 1231.9192(6)

JOSEPH D. CONKLIN is a mathematical statistician at the U.S. Department of Energy in Washington, DC. He earned a master's degree in statistics from Virginia Tech and is a senior member of ASQ. Conklin is also an ASQ certified quality manager, quality engineer, quality auditor and reliability engineer. Very useful,clear and compact dscription of how a model can be analysed and tested. The references and notes are very helpful to all involved in modelling and quality improvement
--Somchart Roong-In, 04-17-2008

### Average Rating Out of 1 Ratings