Inquiry on Pedigree
Do you know the quality and origin of your data?
by Ronald D. Snee and Roger W. Hoerl
The media frequently report on examples of situations in which results from statistical studies are not reproducible. A recent article in the New York Times reported how a sophisticated study went wrong—not due to poor analysis, but rather because of poor data quality.1 Genomic studies at Duke University showed promise in directing cancer treatment, but when patients weren’t achieving the positive outcomes expected, two statisticians were called in to reexamine the research.
"Dr. (Keith) Baggerly and Dr. (Kevin) Coombes found errors almost immediately. Some seemed careless—moving a row or column over by one in a giant spreadsheet—while others seemed inexplicable. The Duke team shrugged them off as ‘clerical errors.’ In the end, four gene signature papers were retracted. Duke shut down three trials using the results. (Lead investigator) Dr. (Anil) Potti resigned from Duke…His collaborator and mentor, Dr. (Joseph) Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling."2-3
The analysis was not the primary issue in this case. Data quality was. The lesson learned is to always carefully consider proper data collection and, wherever possible, proactively collect data that answer the key questions about the process. It is a poor practice to rely on whatever data happen to be available or to assume sophisticated analytics can overcome poor data quality. Most statistical textbooks address data quantity, but few discuss the critical issue of data quality.
Deep understanding of data
Much of our quality technology and statistics literature assumes that data are what is needed to solve the problem and are of good quality. Textbooks further teach us to assume all data are "random samples." In practice, we know this isn’t always the case and, in fact, it is the exception rather than the rule. Fellow Statistics Roundtable columnists Necip Doganaksoy and Gerry J. Hahn properly discussed the challenges of getting the right data at the beginning of a study.4 But what do we do when the data are already in hand?
In the world of farm animals, horses and other livestock, if you want to assess and predict the quality of an animal and how it will perfom, you look at its pedigree. Triple Crown-winning horses often produce winning offspring. Similarly, assessing the pedigree of the data can help you avoid accepting poor quality data at face value, as well as performing the wrong analysis of the data. This means evaluating:
- The science, engineering and structure of the process or product from which the data were collected.
- The data collection process used to obtain and prepare the data for analysis.
- How the measurements were made.
Understanding the data pedigree is critical to ensure the data quality is known and understood. Data collected without controls and careful administration of the data collection process often contain erroneous results, mistakes in data values and missing data. The fact the data reside in electronic files says nothing regarding the quality of the data. Data mining as practiced seems to be making these tenuous assumptions.5-6
Knowing how the data were collected also is critical to performing the correct analysis of the data. The data structure and sources of variation are easily identified. The form of the model that best fits the structure and situation becomes more apparent (crossed versus nested factors, quantitative versus qualitative factors and responses, and factor levels).
Poor quality data with or without process, sampling and testing understanding almost certainly contain:
- Erroneous results.
- Models that have poor prediction accuracy.
- Results that can’t be reproduced by other investigators.
Reproducibility is more than just the use of a wrong analysis. The Duke study is a classic example of this. The data pedigree issue is also critical to success.
In general, observational data often have reproducibility issues. Observational data are observed under very specific circumstances, but people try to generalize the results too broadly. Some (not all) of the conclusions from the famous Framingham Heart Study, done entirely observationally, were refuted after randomized trials were done. In that study, for example, the more saturated fat people ate, the lower their serum cholesterol, which is clearly not consistent with medical understanding of diet and cholesterol.7
There’s another example in which observational data revealed higher death rates for pipe smokers than cigarette smokers.8 Surprised investigators dug a little deeper and discovered cigar smokers tend to be much older than cigarette smokers. In other words, the higher death rates were driven by age, not pipes.
Such potentially erroneous inferences are all too common with observational data. We note in passing that all data collected on manufacturing and service processes without the benefit of a carefully designed data collection process (for example, using a designed experiment) should be viewed as observational data with the limitation discussed earlier.
What should you look for?
Consider these actions when you’re looking at your data’s overall pedigree:
Assess data quality. The following examples give a closer look at what statisticians and quality professionals often do in their daily work. The first story relays a data quality issue.
The ambient air quality standard for carbon monoxide (CO) was 9 ppm (eight-hour average), not to be exceeded more than once per year. Thus, the second-highest value over an eight-hour period in a year was being used to assess the air quality in the vicinity of the sampler. This raised concerns because the second-highest value is highly variable due to sampling variation, meteorological variation and traffic volumes.
It had been reported that at the Denver sampling station in 1971, the second-highest CO value was 35 ppm with the maximum value of 39 ppm, well above the standard. Researchers thought it prudent to study the hourly data used to compute the second-highest value.9 A plot of the hourly CO values for the period in question showed 10 consecutive hourly readings of 39 ppm, with four out of the next six hourly readings at 39 ppm and the remaining two readings at 36 ppm.
This small amount of variation over a 16-hour period is not typical of variation in hourly CO readings and does not represent an accurate characterization of the air quality in the area of the sampler. It is highly probable these data are the result of equipment malfunction. A similar problem was found in the CO data from Cincinnati in 1968.10
Assess the measurement process. When evaluating data quality, you should always think about the measurement process: how the measurements were made and who made the measurements. Operator differences are a common occurrence. Operator fatigue could result in using shortcuts in measurement procedures and data that are recorded incorrectly (for example, transposed digits and test randomization not used). The measurement gauge could be used out of calibration, but that’s unknown to the operator and produces incorrect data. Different operators also could round off the results differently.
For example, an improvement project was shut down in the measure phase because it was discovered the measurement instrument had not been calibrated for two years. After calibration, the product problems completely disappeared (zero defects) and resulted in $157,000 of savings per year in scrap. Case closed.
Understand how the process operates. The next case involves the need to deepen your understanding of the data pedigree to properly analyze and interpret the results of an experiment. The initial analysis of an experiment to evaluate a second source of raw material supply produced no significant effects, except a three-factor interaction involving shift differences, which was believed to be spurious.
After a careful discussion of how the process associated with the data operated, it was discovered a 24/7 three-shift operation was conducted by four operating teams. In effect, the shift variable in the model was measuring the time of day effect (shift-to-shift variation) and differences among the teams.
When the shift and team effects were added to the model as different variables, the results were better behaved. It was concluded there was no difference between the two raw material sources, and team four—due to its greater experience—produced yields that were 5% higher than the others teams, which was a large increase due to the high volume of product produced by the process. This unexpected finding provided a method to increase process yields.
Understand how the product was made. A process engineer was concerned about frequent stops of the production line caused by defective plastic components jamming the sorter wheel.11 The engineer discovered each component had the number of the mold cavity that made the component stamped on the component.
The engineer requested his operators collect the defective components each time a stoppage occurred. At the end of the day, he reviewed the accumulated defective parts and recorded the number of defective components for each of the mold cavities that had made the defective parts.
The summary of the data showed the defects were associated with 16 mold cavities. The remaining 16 cavities had no defects. These data could have been sent to the supplier of the components, but the engineer decided to think more about the data: How could the mold be structured? A single line of 32 cavities didn’t make sense.
After considering several candidate geometric configurations, a 4 x 8 array seemed to match the data suggesting the cavities at the ends of the mold were being "starved" for material (see Tables 1 and 2. When the data and configuration were presented to the supplier, the 4 x 8 mold cavity array was confirmed, and the supplier agreed to get the mold cavity "starving" corrected immediately.
In this example, data quality was not the issue, but data pedigree was. Only after a clear understanding of how the product was produced could the problem be solved.
Check experiment assumptions. A two-level factorial experiment was run on production equipment in which there was time available in the process. The analysis found few significant variables creating surprise and concern because the variables were all thought to be important.
A review of how the experiment was conducted revealed the experimental runs took about nine months to complete. The experiment had not been blocked, creating a possible design flaw when considering the experiment was conducted over a long time period.
An assumption of designed experiments is that all variables in the design are to be held constant, except for the variables being varied according to the design. This assumption is unlikely to be satisfied because processes are dynamic and likely to change over a long time period.
A residual analysis identified a trend in residuals over the length of time the experiment was conducted. The residuals had not been previously evaluated, which many consider to be an analysis flaw. When a time trend variable was added to the model, more of the variables were found to be significant, but the lack of good experiment design cast a cloud over the findings.
Cuthbert Daniel’s analysis of the bean field trial identifies a similar situation.12 The residual variance of the model was high. Daniel plotted the residuals on the field-plot layout and found a significant within-block trend. Random block designs assume there is no within-block variation. This suggests the blocks may have been too big and not homogeneous. A more careful evaluation of the field prior to deciding on block size may have been helpful.
Do I really understand?
Guidance on how to proceed and what to look for when assessing data pedigree is shown in Table 3. In general, you should always look for data issues—from the beginning to the end of the project. Trust, but verify. Constant use of graphical displays is an invaluable tool to assess the data.
Always ask yourself, "Do I really understand how the data were collected? Can I trace back and identify the origin of each data point?" A good principle to remember is that data are guilty until proven innocent, not the other way around.
© 2012 Ronald D. Snee and Roger W. Hoerl
- Gina Kolata, "How Bright Promise in Cancer Testing Fell Apart," New York Times, July 8, 2011.
- Darrel Ince, "The Duke University Scandal—What Can Be Done?" Significance-Statistics Making Sense, September 2011, pp. 113-115.
- Necip Doganaksoy and Gerald J. Hahn, "Getting the Right Data Up Front: A Key Challenge," Quality Engineering, Vol. 24, No. 4, October-December 2012.
- Emmett Cox, Retail Analytics—The Secret Weapon, John Wiley and Sons, 2012.
- Thomas H. Davenport and Jeanine B. Harris, Competing on Analytics—The New Science of Winning, Harvard Business School Press, 2007.
- William P. Castelli, "Concerning the Possibility of a Nut…" Archives of Internal Medicine, July 1992, Vol. 152, No. 7, pp. 1,371-1,372.
- George Cobb and Stephen Gehlbach, "Statistics in the Courtroom," Statistics: A Guide to the Unknown, fourth edition, Thomson Brooks/Cole, 2006, pp. 3-18.
- Ronald D. Snee and John M. Pierrard, "The Annual Average: An Alternative to the Second Highest Value as a Measure of Air Quality," Air Pollution Control Association Journal, 1977, Vol. 27, No. 2, pp 131-133.
- Ellis R. Ott, William C. Frey and Louis A. Pasteelnick, "Some Fundamentals of Statistical Quality Control," Transactions of the 23rd annual all-day conference on Quality Control and Statistics in Industry, Rutgers University, New Brunswick, NJ, Sept. 11, 1971, pp. 1-16.
- Cuthbert Daniel, Applications of Statistics to Industrial Experimentation, John Wiley and Sons, 1976.
Hoerl, Roger W., and Ronald D. Snee, Statistical Thinking—Improving Business Performance, John Wiley and Sons, 2012.
Ronald D. Snee is president of Snee Associates LLC in Newark, DE. He has a doctorate in applied and mathematical statistics from Rutgers University in New Brunswick, NJ. Snee has received ASQ’s Shewhart and Grant Medals. He is an ASQ fellow and an academician in the International Academy for Quality.
Roger W. Hoerl is Brate-Peschel assistant professor of statistics at Union College in Schenectady, NY. He has a doctorate in applied statistics from the University of Delaware in Newark. Hoerl is an ASQ fellow, a recipient of the ASQ’s Shewhart Medal and Brumbaugh Award, and an academician in the International Academy for Quality.