A Unified Approach
Bringing together statistical engineering and data science to solve problems
by L. Allison Jones-Farmer and Roger Hoerl
The emerging discipline of statistical engineering1-4 focuses on how to link and integrate multiple methods—including those outside of statistics—in an overall approach to address complex problems.
The discipline of data science has emerged to meet the challenges we, as a society, face with the influx of data that results from a dramatic increase in technology to manage, process and store data. It is difficult to remember a world without email, smart phones, or the ability to “Google” a person or a phrase. Further, this technology revolution has, to a large degree, “flattened” much of the world, in terms of providing people in developing countries unprecedented access to knowledge and information.5
The ability to acquire, store, process and analyze data has grown roughly exponentially for decades (Moore’s law), resulting in larger data sets and more sophisticated methods to analyze this data. There are several methods involved in such analysis, such as statistics, analytics, artificial intelligence (AI), management science, data mining, operations research and machine learning. The term “data science” often is used to represent a combination of these analysis approaches with modern computer science and IT for data acquisition, processing and analysis.
Given the existence of such an established set of analytics approaches, what exactly is the value added by statistical engineering? In this column, we’ll attempt to answer this by using a real problem.
The International Statistical Engineering Association (ISEA) defines statistical engineering as the systematic integration of statistical concepts, methods and tools—often with other relevant disciplines—to solve important problems sustainably.6 The focus of statistical engineering is on integrating methods to solve large, complex and unstructured problems in a way that provides a solution that works within the broader context of an organization.
The problem of understanding the relationship between neighborhood blight removal and property value is a real, messy problem that Allison Jones-Farmer, one of the column’s authors, worked on. Solutions were based on a partnership with a local county government and a team of student researchers. This is an example of a statistical engineering problem for several reasons.
First, neighborhood renewal is a high-impact problem. Blight removal refers to the demolition of blighted properties. In this case, residential properties that are dilapidated or beyond repair are often demolished to reduce crime and improve the value of the existing properties in the neighborhood. The study’s goal was to understand whether there is a relationship between blight removal and property values to guide community action. The technical and political aspects of blight removal, property value and home sales over time are complex issues.
The data for the blight removal study were curated from local government agencies. While most data are available from the county auditor, the data exist in multiple data tables that must be reformatted for consistency and carefully merged. Many discussions among researchers, government officials and colleagues in analytics, information systems, economics and political science were held. Key discussion points included definitions of blight, proximity, market value and what variables should be used as controls. The inclusion and exclusion criteria for properties are complex and politicized.
There are no right answers for how to proceed with most of the issues with this study. Although there is little academic literature on this type of problem, most of the solutions are limited and there is plenty of room for improvement. Throughout the process, decisions must be made, justified and carefully documented. Clearly, this is not a textbook applied statistics problem. But how should such problems be approached?
The following is a high-level overview of the steps completed to frame the problem and prepare the data for analysis:
- Work with the client to understand the study’s goal. In this case, the client consists of county government officials who are funding neighborhood blight removal.
- Work with local government entities to obtain the data. Standardize and reformat the data for consistency. Define the key measures such as property value and blight removal. Define the control variables.
- Use geographic information systems to geocode the data and develop a system to summarize the spatial and neighborhood characteristics of each property in the county.
The descriptions of each of the first three steps are an oversimplification of the work that was involved. Each step required several months of work. After these steps were completed, the team was ready to model the data. These steps followed:
- Research appropriate statistical methods to model the spatial and temporal aspects of the data while estimating the relationship between blight removal and property value.
- Develop a statistical model to estimate the relationship between blight removal and market value.
- Develop materials and whitepapers to be delivered to the client. Work with the client to use the results of the statistical model to direct future neighborhood renewal initiatives.
As the steps illustrate, statistical engineering is about solving big problems sustainably using statistical methods in thoughtful and responsible ways—typically by integrating multiple methods and even multiple disciplines. It is about engineering solutions.
Statistical engineering often requires bringing together people from several different disciplines at several different levels, often in multiple organizations, to work on a problem together. This study included participants with expertise in statistics, information systems, geographic information systems and economic modeling. All were required to find a solution that worked for this problem.
Statistical engineering and data science
A challenge in clearly articulating the difference between statistical engineering and data science is that there are multiple definitions of data science in use today. Virtually all of them include a mixture of statistics, machine learning and methods from computer science and IT.
Many data science problems are large, complex and at least initially unstructured. In this sense, they could be considered statistical engineering problems as well. For data science problems, the underlying approach used to fit the model is often predictive rather than explanatory.7, 8 In the data science para-digm, data typically are plentiful and models are developed and validated empirically rather than theoretically, using the common task framework9 that employs benchmark data sets or cross-validation methods.
Although the goal of predictive modeling is to select an optimal model based on measures of optimality, such as lift, gain or root mean squared error, it is still imperative to select a model that is sustainable and works within the context of a broader organization. Therefore, the problem-solving skills needed to engineer a solution for large, complex and unstructured problems are the same.
Statistical engineering, while grounded in history, is a new discipline that has emerged out of a collaboration of industrial and academic statisticians and engineers, working together to build a unified approach to solving such problems. This is the main aspect of statistical engineering that can provide value to the data science community.
Roger Hoerl and Ronald Snee describe the phases of a statistical engineering project using a model similar to the one given in Figure 1.10 Although not intended as a “cookbook” for problem solving, a generic model such as this can serve as the general guide to help practitioners develop a tailored approach for many complex problems.
Hoerl and Snee11 further point out that smart statisticians and engineers (for example, Ronald Fisher, William Gosset, George E.P. Box and John Tukey) have been solving large, complex and unstructured problems for a long time. What has been missing, however, is documentation of exactly how these people approached their problems and found solutions.
Figure 1 is an attempt to do just this, at a high level, based on the existing literature in statistical and engineering problem solving and the authors’ own experience. This model could be taught to statisticians and engineers in academia, speeding up the learning curve to solve large, complex and unstructured problems.
Many other methods exist for solving data-related problems, and there is no universal, best method for solving problems. Our view, however, is that most approaches are tactical rather than strategic, in that a reasonably well-formulated problem is typically assumed.
We find the comprehensive approach given in Figure 1, when augmented with the additional detail on how to deploy this approach in practice,12 to be much more strategic.
Developing evidence-based strategic approaches for solving complex problems is the primary focus of the field of statistical engineering and is its most important value-add to applied statistics and data science.
Developing continuing education curricula for existing data practitioners in statistics, engineering, analytics and data science also is a key ISEA initiative. In addition, ISEA collaborates with universities to develop curricula to educate the next generation of problem solvers.13
References and notes
- Roger W. Hoerl and Ronald D. Snee, “Statistical Engineering: An Idea Whose Time Has Come?” The American Statistician, Vol. 71, No. 3, pp. 209-219.
- Ronald D. Snee and Roger W. Hoerl, “Proper Blending: The Right Mix Between Statistical Engineering and Applied Statistics,” Quality Progress, June 2011, pp. 46-49.
- Alexa DiBenedetto, Roger W. Hoerl and Ronald D. Snee, “Solving Jigsaw Puzzles: Addressing Large, Complex, Unstructured Problems,” Quality Progress, June 2014, pp. 50-53.
- Roger W. Hoerl and Ronald D. Snee, “Guiding Beacon: Using Statistical Engineering Principles for Problem Solving,” Quality Progress, June 2015, pp. 52-54.
- Thomas L. Friedman, The World is Flat, Ferrar, Straus and Giroux, 2005.
- International Statistical Engineering Association (ISEA), https://isea-change.org.
- Leo Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science, Vol. 16, No. 3, 2001, pp. 199- 231.
- Galit Shmueli, “To Explain Or to Predict?” Statistical Science, Vol. 25, No. 3, 2010, pp. 289-310.
- David Donoho, “50 Years of Data Science,” Journal of Computational and Graphical Statistics, Vol. 26, No. 4, 2017, pp. 745-766.
- Hoerl and Snee, “Statistical Engineering: An Idea Whose Time Has Come?” see reference 1.
- This approach is available on the ISEA website at https://isea-change.org.
- For more information, visit ISEA’s website at https://isea-change.org.
L. Allison Jones-Farmer is the Van Andel professor of analytics at Miami University in Oxford, OH. She holds a doctorate in applied statistics at the University of Alabama in Tuscaloosa. Jones-Farmer is a senior member of ASQ.
Roger W. Hoerl is a Brate-Peschel associate professor of statistics at Union College in Schenectady, NY. He has a doctorate in applied sta-tistics from the University of Delaware in Newark. Hoerl is an ASQ fellow, a recipient of the ASQ’s Shewhart Medal and Brumbaugh Award, and an academician in the International Academy for Quality.