 ## 2020

STATISTICS ROUNDTABLE

by Robert L. Mason and John C. Young

Models are often developed in industry to characterize and explain a process because they can show how process variables are interconnected and interrelated. Historically, two particular methods have been used to construct models to characterize an industrial process. Each method not only uses different procedures to determine the model coefficients, but also uses the resultant models in different ways (see Table 1). ### Theoretical Models

The first model building method is derived from the mathematical theory and physical laws that govern the process. This type of model is primarily used in the design and construction of the processing unit. It also can be used as a performance model in certain situations.

Theoretical models, although correctly based on underlying mathematical principles, often do not perform well in applications. This is because the higher level mathematics used to develop them often must be oversimplified in application and thus may not account for a substantial amount of variation in the data. And, unless individually corrected, such models also fail to account for the individual idiosyncrasies of a processing unit.

### Empirical/Regression Models

The second approach to model building is data based and uses statistical procedures. With this approach, a model is empirically fit using a baseline data set that represents good operations for the processing unit. The simplest version of this model is linear in form and is obtained using common regression techniques.1

The regression approach requires knowledge of statistical methods and experimental design. Regression models are appealing because little process information is required in the development and implementation stages. In many process situations, regression models outperform theoretically designed models and, unlike their mathematical counterparts, allow for adjustments for the inherent differences that exist between supposedly identical units.

### Simple Regression Model

The simplest form of a regression model, used to relate a response variable, y, to a set of predictor variables, p, is shown as y = βo + β1x1 + … + βpxp + ε (equation one).

In this equation, x1, x2, .., xp represent the predictor variables, and β1, β2, …, βp represent their respective unknown coefficients. The intercept is represented by βo, and the error is denoted by ε.

Good regression models depend on the strength of the linear relationships between the response and predictor variables and on the availability of a good historical or baseline data set. The stronger the linear relationships, the better the model.

A major side benefit of the regression model approach is that it allows the user to explain the variation of the response variable in terms of the predictors. This is useful when you are studying how a quality or output variable is related to a set of process variables. If the regression is good, you can account for the variation in the response variable by examining the predictor variables.

Consider a process in which electricity, measured by megawatt (mw) production, is being generated as a source of power for industrial use. High energy steam is produced in a boiler system and used to run a turbine generator. The high temperature, high pressure steam turns the turbine generator and produces the electricity. Latent steam from the turbine is changed back into water in a condenser and is returned to the boiler.

The mw production of the turbine is modeled using steam flow (stmfl), steam temperature (stmtp), condenser temperature (cdtp) and absolute pressure of the condenser (cdpr) as the predictor variables. Using a baseline set of data collected during a period of good operations, an estimate of the regression model in equation one is obtained using regression techniques1 and is given by mw = -21.09 + 0.113 stmfl + 0.029 stmtp – 0.036 cdtp – 0.330 cdpr (equation two).

The squared correlation coefficient for this equation is R2 = 0.991. This value indicates 99.1% of the variation in mw production can be accounted for by examining the four production variables included in equation two. Further analysis indicates steam flow is the most critical production variable, followed by steam temperature. Next in order of importance are the condenser temperature and condenser pressure variables. Determining the most important variables that contribute to the variation gives you an initial place to look when there are disruptions in the process.

### Another Use

A regression model, such as the one in equation two, also can be used to predict values of the response variable. Being able to predict with a high degree of accuracy a future value of an important process variable is useful when monitoring a process. Many applications that involve prediction actually depend on the size of the error made in the prediction. This particular type of error is referred to as the residual error and is defined as the difference between the actual observed value of the response variable, ya, and the predicted value of the variable, yp, as obtained from the regression equation.

The residual error for a given sample point with coordinates (xa, ya) for a straight line regression equation is depicted in Figure 1. Note the error is measured vertically—parallel to the y-axis, not perpendicular to it. A small residual value implies the regression equation provides a good prediction. In other words, the predicted value is close to the actual value. A large residual indicates poor prediction and, in some situations, indicates a process upset. The residuals from an estimated regression model provide an excellent multivariate control chart3 statistic. The associated charting procedure is simple to construct and implement. For example, reconsider the steam turbine example in which the fuel used in producing the steam in the boiler is now being monitored by examining the residual error of an estimated input-output (I/O) model for the steam turbine system. The I/O model for the boiler is given by fuel = βo + β1 stmfl + β2 stmtp + β3 cdtp + β4 cdpr + ε (equation three).

Using a good baseline data set, an estimate of the model in equation three is given by fuel = -86.29 + 1.22 stmfl + 0.08 stmtp + 0.37 cdtp – 0.01 cdpr (equation four).

Note the estimated model expresses the input fuel to the system as a function of the output steam production as measured by steam flow and steam temperature. The condenser also is considered an integral part of the system because it is used to return the water to the boiler. Thus, the condenser temperature and the absolute pressure of the condenser also are included as predictor variables in equation three.

A typical residual error plot of the fuel usage for steam production for this control chart example is presented in Figure 2. The residual errors plotted in the graph are obtained for each observation by taking the difference between the actual observed fuel value and the fuel amount predicted by the estimated I/O model given in equation four. The R2 statistic for this regression equation is 99.3%. It indicates equation four provides an excellent fit to the data because it explains 99.3% of the fuel variation.

The estimated standard error of prediction—the average amount of error contained in each prediction for this equation—is 8.29. This is a small, single digit error value relative to the fuel values measured in the hundreds of units per hour. The control limits for the plotted residual errors in Figure 2 are established at ± three standard errors or ± 3(8.29). The I/O regression in equation four provides an excellent estimate of the relationship between fuel usage and the predictor variables given by mw production, steam production, steam temperature, condenser temperature and condenser pressure. When the residuals vary between the specified upper and lower control limits, you can conclude the run conditions of the unit match the good operation conditions of the baseline data. If a residual value lies outside the limits, you can conclude an upset condition has occurred in the process.

The logic behind this charting procedure is simple. If the present operational conditions match the baseline and you have a good model as judged by the R2 value and the size of the standard error of prediction, then the regression model will accurately predict the fuel used in steam production with a small residual error. Thus, a large residual (in absolute value) that plots outside the control limits will result when the operating conditions differ radically from the baseline conditions. A group of successive small residual values, denoted by a sequence of residuals with the same sign either above or below the mean line of zero, will also result when minor upset conditions are present.

The methodology described in this column depends on your ability to construct good regression models using baseline data. If this is not possible, the residuals may be too variable to detect trend changes. To ensure a go od fit, obtain a stable historical data set that is not contaminated by outliers, missing data or strong linear relationships between subsets of the predictor variables. It is also useful to choose important input variables that reflect the process operations.

### REFERENCES

1. T.P. Ryan, Modern Regression Methods, John Wiley & Sons, 1997.
2. Ibid.
3. R.L. Mason and J.C. Young, “Improving the Sensitivity of the T2 Statistic in Multivariate Process Control,” Journal of Quality Technology, 1999, pp.155-165.

ROBERT L. MASON is an institute analyst at Southwest Research Institute in San Antonio. He received a doctorate in statistics from Southern Methodist University in Dallas and is a Fellow of ASQ.

JOHN C. YOUNG is president of InControl Technologies and a professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.

### Average Rating Out of 0 Ratings