Back to the Future

by Ronald D. Snee and Jason J. Kamm

Analyzing process data has long played a significant role in problem solving, troubleshooting and improvement. Sometimes we take this analysis for granted.

Consider the emergence of powerful IT, easy to use statistical software and recent regulatory trends. Tee include the Food and Drug Adminis-tration’s process analytical technology (PAT) based on applying statistical techniques to large volumes of data. We might assume that analysis will continue the same way it always has—only it will be bigger, better and faster.

Using a familiar analogy, managing the torrents of data in today’s technology age that pour from electronic databases is like trying to drink from a fire hose. While statistical software is useful, it cannot guide and focus our investigations. At the same time, trends like PAT might lead us to believe that regulators simply want more data rather than a better understanding of processes.

More than ever, we need to use the kind of critical thinking and deeper process knowledge that predates the explosion of technology and software. In other words, it’s back to the future.

By identifying and applying key aspects of the data mining process (see the sidebar “Some Key Considerations in Mining Process Data”) and then using them in novel ways, we can turn data into information and information into knowledge.

Developing Useful Process Models

You cannot improve a process you don’t understand. Process understanding—the ability to accurately predict a process’s future performance—is essential for developing and sustaining process improvements. Such understanding is a fundamental goal of troubleshooting, problem solving and other types of process studies.

That understanding is greatly enhanced by developing models in the form Y = f (X) where Y is the process outputs that measure the performance of the process, and the X’s are process inputs, controlled process variables and uncontrolled process variables.1 A simple, three-step process helps guide the building of these models:

  1. Organize and analyze available data to identify key variables (X’s) driving process performance.
  2. Validate the predictability of the model by checking the model predictions with the actual measurements of process performance when they become available.
  3. When the results of the existing data analysis are inconclusive or the existing data inadequate for developing the needed process understanding, use the key variables already identified and the available process understanding to design one or more experiments to better establish and test cause and effect relationships.

Given the large volume of data in electronic databases, it is critical when building Y = f (X) models to use the right data and understand the degree of difficulty required to get the data into a database that is amenable to analysis.

Begin by observing the process, conducting focus interviews, reviewing important documentation, collecting data and brainstorming hypotheses. Unfortunately, many skip the process observation and interviews and go straight to data analysis. Not surprisingly, they often come up empty-handed.

In developing these models, it is useful to evaluate the data from three perspectives:

  • Process context: Examine the data relative to the context and subject matter of the process that produced it. Check for reasonableness and atypical results.
  • Graphical: Plot the data in various ways to identify key variables and atypical data points.
  • Analytical: Construct statistical models to identify important variables.

Data collected without the aid of such statistical design are suspect for many reasons: important variables have not been measured or varied over a wide range, and the measurement variation might be unknown or large; data errors are present and data records incomplete; and process variables are correlated, which can produce misleading, if not erroneous, results.2

Designed experiments typically avoid those problems. For that reason, when analysis of the existing data yields inconclusive results or the existing data are inadequate for developing process understanding, you should turn to the key variables identified in the analysis of existing process data and process knowledge. You can use that data to design experiments to better establish, verify and confirm cause and effect relationships.

Constructing the Database

The data being analyzed must also be in the right format for meaningful analysis. Because data cannot be analyzed across databases, it must be collected in a single database. Any two-dimensional array (row X columns) is a database. All databases need to be driven down to this format to enable the application of statistical techniques.

To achieve that, create a two-dimensional database in which the rows consist of different sets of process conditions in order of time (days and weeks) and tests, and the columns consist of X’s and Y’s in the data table or matrix.

Consider the experience of a mid-sized global pharmaceutical manufacturer of a blockbuster product that formed a project team to review a process that spanned many unit operations. The team assembled a database to relate finished product release characteristics (such as dissolution) to the process parameters of each unit operation and to raw material characteristics.

Unit operations included high shear granulation, drying, two blending steps and tablet compression. There were also numerous raw materials: the active pharmaceutical ingredient (API), binders, lubricants and others.

Clearly, the problem was factorial. For example, one final lot of tablets (with associated Y’s related to dissolution) could come from more than two or three blended lots, which could come from four or five granulation or drying lots, which could have as many as 10 lots of each raw material.

Correlating the dissolution (Y) back to the many X’s quickly became a nightmare. The team was able to line up the X’s with the Y’s through a laborious process using weighted averages of the process and raw material variables.

During the entire improvement project, assembling this database proved to be the team’s most difficult task. Thereafter, analysis of the data was straightforward. Clearly, it’s better to build such databases as you go and record history as it happens rather than trying to recreate it later.

Identifying the Critical X’s

For models of the form Y = f (X) to be useful, you must identify the critical X’s—the process inputs and controlled and uncontrolled process variables that have the greatest effect on the Y’s (the process output variables).

Some Key Considerations In Mining Process Data

  1. Think in terms of Y = f (X) models. The key X’s will lead to effective problem solving and process improvement.
  2. Enhance process understanding by observation, walkthroughs and operator interviews.
  3. Assemble a database—from historical records and experimentation as needed—that enables the effective identification of the critical X’s.
  4. Be aware of the limitations of historical data and the power of designed experiments.
  5. Use qualitative methods (cause and effect matrix) as well as quantitative methods (analysis of Cpk statistics for X variables) to prioritize the list of X variables.
  6. Analyze and evaluate the data from three key perspectives: process, graphical and analytical.
  7. Test the model’s validity by comparing the model predictions with measures of future process performance or conducting designed experiments to confirm the model’s results. —R.S. and J.K.

Two of the most effective techniques for doing so are:

  1. Cause and effect (C&E) matrix analysis.
  2. Assessment of process capability of the process variables (Xs).

The C&E matrix, widely used in Six Sigma projects,3uses the longstanding prioritization matrix concept.4 First, the process is flowcharted and the X variables identified for the entire process and for each process step. Then the C&E matrix uses expert opinion and any available data to assess the relationships between the X’s and the process output variables (Y’s). The relationships are quantified and ranked in a list of X variables from most important to least important. The goal is to leverage knowledge of the process to find the small number of critical X’s—typically three to six.

The assessment of process capability, another prioritization tool, relies not only on process knowledge but also leverages data. The process capability (Cpk) is measured on the X’s as well as the Y’s, and the X’s are prioritized based on their overall variability relative to their specifications.

Although the concept of capability is often linked to finished product measurements, Cpk measurements can also be used as a diagnostic for parameters (both X and Y variables) that might have high degrees of variability. Parameters that are incapable of meeting their specifications (low Cpk values) might provide a clue to the root cause of a process problem, making troubleshooting far more efficient and productive.

For example, a major pharmaceutical manufacturer was experiencing difficulties with the dissolution rate of one of its key products. An excessively rapid dissolution rate could be lethal for the patient, while an overly slow rate could delay relief.

A capability and control analysis of 11 process parameters over 16 lots of raw material generated 176 Cpk values ranging from -1 to 24.69. The values greater than 1.33 indicated a robust process with little variability in that parameter. Focusing on those parameters could be deferred in searching for the root cause of the dissolution problem.

Instead, the team focused on processes with low values (Cpk < 1), which were not meeting the process specifications that had been set for such X variables. They also watched processes with borderline values (Cpk 1 </= Cpk < 1.33). With the capability of each process prioritized, the team soon identified the process parameters most likely to be at the root of the problem.

Some might say this is an inappropriate use of Cpk statistics because the assumptions might be faulty. But as we applied Cpk in many process improvement situations, we found it useful not as a hard decision tool but as a means of attaining greater quantitative knowledge of processes. As with any statistical tool, be aware of its limitations but don’t forgo the real value it can offer just because it’s not perfect.

Conducting Range Finding Experiments

Small range finding experiments can add great value. For example, these feasibility trials performed at the beginning of a design of experiments (DoE) will ensure that all trials can be executed so the experiment isn’t compromised at the outset. Without such preliminary experiments, you might find during the DoE that some parts of the experiment aren’t feasible.

Range finding experiments can be particularly important when a process is hampered by capacity constraints due to high customer demand or when you might get only one shot at running an experiment. You cannot afford to stop an experiment due to runs that aren’t executable and restart it—you must get it right the first time.

Consider the example of a pharmaceutical high shear mixing unit operation in which a one-factor, range-finding experiment was conducted to find feasible operating space. When the process was scaled up, several batches came out overgranulated. Several hypotheses were generated about the possible cause: endpoint power, spray addition rate, plow speed and API particle size. Then the specification range for each was determined, as shown in Table 1.

Table 1

Although this might appear to be a logical place to initiate a DoE to begin mapping out an acceptable operating space for the process, there could be situations in which unusable granulations would result. For example, a problem might occur when running a combination of low endpoint (shorter granulation time), long spray time and fast plow speed. This might create a batch in which the endpoint was achieved before all of the granulation solution was added, rendering the batch useless in the analysis.

A team brainstormed potential combinations that would theoretically produce a usable batch, while at the same time using the maximum range for each variable. Because there was more latitude with the endpoint variable, it was determined that using spray time and plow speed at the proposed limits would be essential.

Table 2 shows how conducting two feasibility trials helped determine that using the low endpoint plus 5% enabled the use of a long spray time and a fast plow speed. As a result, the team was able to set the final levels for the variables to be used in the subsequent DoE.

Table 2

As with all of the other techniques discussed here for mining process data, the key to using range finding is to apply the kind of critical thinking and deep knowledge of processes that we had no choice but to use before the IT explosion.

When we combine those tried and true techniques with today’s IT power and use them wisely, we have the potential to increase process understanding exponentially.


  1. R.W. Hoerl and R.D. Snee, Statistical Thinking—Improving Business Performance, Duxbury Press, 2002.
  2. Ibid.
  3. R.D. Snee and R.W. Hoerl, Leading Six Sigma—A Step by Step Guide Based on the Experience With General Electric and Other Six Sigma Companies, FT Prentice Hall, 2003.
  4. Michael Brassard, The Memory Jogger Plus+, GOAL/QPC, 1989.

RONALD D. SNEE is principal of process and organizational excellence and lean Six Sigma initiative leader at Tunnell Consulting in King of Prussia, PA. He has a doctorate in applied and mathematical statistics from Rutgers University in New Brunswick, NJ. Snee has received the ASQ Shewhart and Grant Medals and is an ASQ fellow.

JASON J. KAMM is managing consultant of process and organizational excellence at Tunnell Consulting in King of Prussia, PA. He has master’s degree in statistics from the University of New Hampshire. He is a senior member of ASQ and a certified Six Sigma Black Belt.

Average Rating


Out of 0 Ratings
Rate this article

Add Comments

View comments
Comments FAQ

Featured advertisers