The Statistical Engineer
Tools and techniques to help process engineers do their jobs
by Robert L. Mason and John C. Young
Many engineers working in processing industries often are overwhelmed by the amount of data available to them. Until recently, most industries collected only a small amount of information on their processes. Process engineers had few observations on a small number of critical variables that helped make decisions on how the process was to be operated.
That situation has changed. With the development of electronic data-gathering systems, such as distributed control systems, process engineers now have many observations available on a multitude of variables. They also can store these observations electronically for study and review at any time.
This task of gathering and maintaining observations has added an extra dimension to a process engineer’s job. It has created a new role—statistical engineer—that entails being able to transform the data observations into useful process information.
Laying the foundation
To do this, process engineers need a certain degree of proficiency in using the appropriate statistical tools for analyzing many types of data problems. These problems can range from selecting random data samples to designing statistical experiments.
Process engineers also might need to know how to construct and apply various prediction (regression) equations. These analyses usually include the application of statistical process control procedures, covering univariate and multivariate samples. In addition, there is a need in all areas of application to be able to test a statistical hypothesis.
In short, the increase in process data has led process engineers to seek more training in statistics.
This result immediately generates questions: "What type of statistical background is necessary to produce statistical engineers?" and "Where can such training be obtained?" It is doubtful this level of training is offered in an undergraduate engineering degree program at a local university. Most university engineering degree curriculums are already filled with required engineering courses and include limited time for external course electives, such as in statistics.
Often, the educational requirements for this new role can be met by obtaining a master’s degree in applied statistics or attending statistics-oriented seminars and quality programs, such as those involving Six Sigma.
Statistical courses recommended for someone serving as a statistical engineer can vary. A first course (for example, applied statistics I) usually contains data presentation (frequency tables, histograms and box plots), descriptive statistics (mean, median, mode, range, variance, standard deviation, quartile and quantile), basic normal probability theory and statistical hypothesis testing. Several one-sample testing procedures—such as the binomial test, normal test and the t-test—are detailed, along with confidence intervals. The appropriate tests of hypotheses for (one-sample) variance problems are also included.
A continuation of this course (for example, applied statistics II) will expand the coverage to two-sample tests of hypotheses for the mean (for example, two sample t-test and paired t-test) and the two-sample variance test (for example, the F-test). The chi-square goodness of fit test through the one-way analysis of variance (ANOVA) procedure will also be covered.
An important topic covered in this second course is the correlation between two variables. Not only is the technical definition of correlation presented, but graphical techniques are also introduced to recognize when correlation exists between two variables.
Figure 1 shows a scatterplot of the shoe size versus height of 85 male college students. In this plot, the height variable increases as the shoe size increases. In general, this second set of applied statistics courses emphasizes the appropriate application of these procedures.
We cannot overemphasize the importance of these two foundational courses, which lay the groundwork for the development of other statistical procedures. For example, it is easy to develop Shewhart charting procedures (univariate statistical process control) by establishing confidence intervals on the population mean. Also, good coverage of the testing of hypotheses of equal means for the two-sample problem provides a natural lead-in to ANOVA procedures.
ANOVA procedures constitute a second important area of statistical concentration for a statistics engineer. This topic is important not only in learning how to compute different ANOVA procedures (for example, one-way and two-way), but also to understand experimental design concepts, such as completely randomized design, randomized block design and factorial design.
For example, in a block design, similar experimental units are grouped together to form a homogenous group called a statistical block. Units between blocks, however, are different. By combining several blocks in an experiment, you can improve the efficiency of the design because a source of variation—the blocks—has been accounted for in the experiment.
Consider the nine different circles in Figure 2. Using a block design, you arrange the circles into three blocks, with three similar circles within each block, shown in Figure 3.
Notice the variation within the three blocks is minimal. The variation between the blocks, which is due to the heterogeneity of the blocks, can be included as a source of variation in the ANOVA.
After learning the concept of blocking, the process engineer usually understands how to control the effect of an extraneous variable on the total variation of an experiment.
For example, by holding the temperature constant for each block, the variation of a temperature component can be removed from the total process variation. The process engineer quickly realizes this concept can be extended and that more than one variable can be blocked. This may be the first time the engineer has been exposed to the concept of changing the value of more than one variable in an experiment.
Multivariate techniques a must
A third area of study that is as important as the first two is applied multivariate techniques.
The first multivariate technique is multiple regression analysis. This procedure is necessary if you are going to examine many variables that are interrelated and move together as a group.
Examples include the many variables associated with certain chemical processes and the numerous variables involved with the generation of electricity. In both cases, the variables form a multivariate group—that is, an interrelated group of variables that move together and are not independent of one another.1
There are numerous applications of multiple regression techniques in an industrial process. Through these applications, the engineer learns how to construct the best prediction equation for the variable of interest in terms of the observations taken on the other related variables.
For example, you might be interested in predicting the amount of fuel used to produce a certain amount of electricity. Also, through residual analysis,2 you learn how to examine the variable of interest with the effect of another variable removed. This technique also can be extended to remove a time dependency (autocorrelation) within an individual variable. This can be an extremely valuable tool when analyzing autocorrelated data.
Another important use of regression analysis is the study of total variation of a system and how an individual variable contributes to the total variation.
An additional benefit gained from studying multiple regression analysis is that it encourages the student to think in terms of many variables—not just one. It also reinforces the concept that more than one variable can be changed at a time. This opens the door to examine applying other useful multivariate techniques.
One of these is principal component analysis,3 and another is discriminant analysis. Discriminant analysis is a multivariate technique that allows you to determine group membership for a particular observation. Based on a credit report, for example, it can be determined whether an individual is a good credit risk. This procedure has wide application in the process industry in making a decision as to the quality of the product.
Add SPC skills
If a fourth statistical competency is to be added for the statistical engineer, it should be in the area of statistical process control (SPC). Many companies are involved with some form of quality control either at the univariate or multivariate level.4 Applying SPC techniques can solve many day-to-day problems, reduce cost and produce a better quality product.
A good understanding of basic applied statistical procedures, ANOVA and experimental design techniques, multiple regression analysis and other applied multivariate techniques, and SPC will provide the best statistical background for a statistical engineer.
You might think we’re trying to make statisticians out of engineers. This is not the case. Statistical engineers are engineers first, and they continue to be primarily concerned with the job of running processing units. Statistics can provide an additional set of tools to help engineers accomplish this goal.
- Robert L. Mason and John C. Young, Multivariate Statistical Process Control with Industrial Applications, ASA-SIAM, 2002.
- Robert L. Mason and John C. Young, "A Remedy Using Residuals," Quality Progress, September 2009, pp. 52-54.
- Robert L. Mason and John C. Young, "Multivariate Tools: Principal Component Analysis," Quality Progress, February 2005, pp. 83-85.
- Mason and Young, Multivariate Statistical Process Control with Industrial Applications, see reference 1
Robert L. Masson is an institute analyst at Southwest Research Institute in San Antonio. He has a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of ASQ and the American Statistical Association.
John C. Young is a retired professor of statistics at McNeese State University in Lake Charles, LA. He received a doctorate in statistics from Southern Methodist University.