Data Mining for Quality

by I. Elaine Allen and Christopher A. Seaman

In a 1996 Quality Progress article, Bert Gunter urged caution in the use of data mining based on the extraordinary amount of hype and false promises it was receiving at the time.1 Focusing on formal experimental design, his article validated the methodologies used in data mining but contrasted them with the ability to formulate and test hypotheses using standard statistical techniques.

Since the publication of Gunter’s article, it has gotten easier to collect large amounts of data. Volumes of data can be collected from the continuous operation of machines, weblogs and Web transactions, and healthcare studies involving claims data or long-term evaluations of patients. Standard statistical techniques, however, become less meaningful when applied to these enormous databases because all mean comparisons are significant and standard statistical measures of variability are extremely small.

In a 1998 Technometrics article, Gerry Hahn and Roger Hoerl examined the changing role of statistics and statisticians in business, citing enormous databases and finding proactive statistical process control methods for understanding and analyzing the questions and issues that arise with these data.2 While we have witnessed this growth in the size of databases, researched articles examining quality of data using data mining have not followed.

This article examines a large healthcare trial in which we used data mining techniques to assess the quality of the data. We combined these techniques with standard statistical analyses to identify nonrandomness and lack of homogeneity in the data—all of which led to a surprisingly nonstatistical conclusion.

The Study

A National Institutes of Health agency designed a multistate study to compare the care given to patients by fee-for-service and managed care providers. The study was carried out in multiple sites within a state and over multiple states. All data processing and analyses were centralized through a nonprofit mental health services organization.

Patients entered the study as fee-for-service patients and were followed for one year prior to being randomized to continue to receive care through a fee-for-service provider or to be switched to a managed care provider. Once randomized, patients were followed for another 24 months. The outcomes of care were tracked through insurance claims data, and the final database contained more than 60,000 patients and 4 million claims.

Analysis of Homogeneity At Baseline

In conducting the statistical analysis for the study, we assumed homogeneity for patient characteristics across treatment conditions because they were strictly randomized. We also assumed homogeneity for patient care within treatment conditions and within and between states because treatment was standardized by government protocols. We investigated these assumptions prior to analyzing the outcomes of the study.

With the enormous number of patients in the database, any strict statistical comparison of baseline demographics and diagnostic data would be considered statistically significant, so we used several data mining techniques. For the purposes of this article, we looked at a subset of data from three states.

To examine the randomization over the three states, we constructed a data web3 (see Figure 1), which plotted the strongest connections between baseline variables (gender, race and diagnosis) and the randomization factor (fee-for-service or managed care). Prior to examining the web, we assumed the randomization was balanced with respect to these variables.

Figure 1, however, shows more women and nonwhites were randomized to the managed care providers. The fact that males and whites did not show up with strong links to either managed care or fee-for-service indicates balance with respect to those variables. The strong links are summarized in Table 1.

After concluding the study randomization was biased, we conducted a site-by-site analysis using standard statistical analysis4 to compare gender and racial balance by site and type of care. Figure 2 illustrates an example of a site analysis, showing a plot of females for three states. Clearly, state three had a significantly unbalanced randomization of females into managed care (p < 0.001).

To correct this imbalance, we implemented a model of the data with baseline characteristics as covariates controlling for site differences.

Examination of Site Differences

Prior to analyzing the differences in the types of claims between fee-for-service and managed care, we examined whether there were unusual differences in patterns of service between states that could not be explained by normal variability of service. All sites followed a government developed protocol for service delivery that should have minimized variability.

With such a large dataset, we used an analysis of variance approach—the standard statistical methodology for examining site differences—that showed every comparison was significantly different statistically between states. For the purposes of this article, we again looked at a subset of data from three states.

To evaluate the service heterogeneity over states, we first used a classification and regression tree to develop a tree model to differentiate between states. This classification model uses logistic regression to identify the most important variables. These variables were then used to separate the data into branches that best identify differences between states in the use of outpatient therapy.

The branches were defined in the model to be binary splits of the data, and each branch showed the split by the outcome variable (state). While state three had more than 50% of the patients, it also had more than 90% of the claims for outpatient therapy (see Figure 3, p. 72).

After identifying that state three had a disproportionate number of claims for outpatient therapy, we conducted a state-by-state analysis of each service using standard statistical analysis to compare type of care by type of service by state. Figure 4 (p. 72) illustrates an example of the state analysis, showing a plot of outpatient therapy service claims for three states where state three has significantly more claims in fee-for-service and managed care, with the most claims being in managed care (p < 0.00001).

Data mining can be the first line of analysis with a large database. Designed to identify unusual patterns, it can be used to find departures from homogeneity or outliers requiring further exploration. After identifying that state three was an outlier in terms of medical claims for outpatient therapy, we no longer had a statistical methodology problem.

The results of our statistical investigation were made available to the study monitoring center and ultimately to the funding agency. Upon further (nonstatistical) investigation, two sites within state three were found to be responsible for the excess outpatient therapy and were submitting duplicate and fraudulent claims. Indeed, statistical analyses to examine nonrandomness can sometimes lead to surprising conclusions.


  1. Bert Gunter, “Data Mining: Mother Lode or Fools Gold?” Quality Progress, April 1996, p. 113.
  2. Gerry Hahn and Roger Hoerl, “Key Challenges for Statisticians in Business and Industry,” Technometrics, August 1998, p. 195.
  3. All data mining analyses were done in Clementine 9.0 from SPSS Inc.
  4. All statistical analyses were done in SPSS 13.0 for Windows.


  1. Adams, Larry, “Mining the World of Quality Data,” Quality Magazine, August 2001.
  2. Courtheoux, Richard J., “Marketing Data Analysis and Data Quality Management,” Journal of Targeting, Measurement and Analysis for Marketing, June 2003, p. 299.
  3. Jiang, Wei, “A Joint Monitoring Scheme for Automatically Controlled Processes,” IIE Transactions, December 2004, p. 1,201.
  4. Spadola, Tracy, “Steering the Course,” Best’s Review, July 2003, p. 108.
  5. Wells, David, “Quality Assurances,” Enter-prise Systems Journal, October 2000, p. 16.
  6. Whiting, Rick, “Trend Spotters: Finding Needles in Haystacks Is Getting Easier,” InformationWeek, Sept. 5, 2005, p. 1,054.

I. ELAINE ALLEN is professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY, and is a member of ASQ.

CHRISTOPHER A. SEAMAN is a statistical researcher at Human Services Research Institute in Cambridge, MA.

Average Rating


Out of 0 Ratings
Rate this article

Add Comments

View comments
Comments FAQ

Featured advertisers