Identify and Act
Performing product life data analysis with unidentified subpopulations
by Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker
The National Highway Traffic Safety Administration (NHTSA) in 2015 recalled 28 million vehicles equipped with Takata air bags due to inflators rupturing during deployment.1 It was determined that moisture could penetrate the inflator canister and make the propellant more explosive over time.
Data analysis showed that older air bags and those in regions with high humidity, such as the Gulf Coast, were up to 10 times more likely to rupture. This information about product subpopulations with different air bag malfunction vulnerability played a key role in defining short term and long-term remedial strategies.
Similar to this example, there are many situations in which some units are more likely to fail in service than others. Such differences in reliability may be due to variability in manufacturing-process conditions, or in raw materials and components. To take appropriate action to minimize premature failures in such situations, it is important to know which parts of the product population, already built, are most vulnerable.
In addition, identifying vulnerable product subpopulations can help identify the root causes of reliability problems and allow you to use this knowledge to improve future product reliability. Targeted experimentation, coupled with appropriate analysis of resulting and existing data, can help you identify product subpopulations with different levels of vulnerability.
In a previous Statistics Roundtable column,2 we showed how to identify subpopulations with different failure vulnerabilities by using segmentation analysis—a divide-and-conquer strategy that breaks down the total product population into meaningful subpopulations that allows separate analyses on each identified subpopulation. See "Segmentation Analysis of Bleed Systems" for a summary of the column.
Segmentation Analysis of Bleed Systems
An earlier Statistics Roundtable column1 dealt with a system that bleeds off air pressure from an aircraft engine to operate a compressor or generator. Lifetime data were available on bleed systems from 2,256 engines in military aircraft operating from various bases. Sidebar Figure 1 shows a single Weibull distribution probability plot for the 19 failures that occurred.
The slope of the plot seems to change at about 600 hours, indicating that a simple Weibull distribution does not provide an adequate representation. Further study revealed that 10 of the 19 failures occurred at base D, one of the bases at which the planes were stationed, even though base D involved only 202 engines.
Moreover, there was physical justification (salty air) for the higher failure rate at base D. Thus, separate Weibull analyses and probability plots were performed for the system lifetimes at base D and for those at all other bases (see Sidebar Figure 2). The data in each of these plots scatter around straight lines, suggesting that separate Weibull distributions provide adequate representations and justifying the segmentation analysis.
—N.D., G.J.H. and W.Q.M.
- Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker, "Divide and Conquer in Reliability Analyses," Quality Progress, November 2009, pp. 46-48.
In segmentation analysis, the definition of subpopulations should be based on physical considerations. This requires an in-depth understanding of the design, manufacture and use conditions of the product—and an identification of the specific subpopulation to which a particular product unit belongs. In the sidebar example, the base at which each aircraft was stationed was used to segment the data.
In some applications, however, relevant product subpopulations cannot be identified on an individual unit basis. As a result, you cannot conduct a segmentation analysis. In this column, we describe a method to analyze life data in such situations.
Description. A manufacturing process for semiconductor devices requires hundreds of steps, often spanning weeks or even months. Missed problems in early stages of manufacturing can lead to significant quality and reliability issues, involving large quantities of product at the end of the process that might be an appreciable time later.
In a particular operation, manufacturing staff discovered a faulty valve in an early production stage diffusion furnace. Investigations showed that the valve malfunctioned sporadically over time, potentially compromising dielectric insulation and leading to premature dielectric breakdown (that is, sudden loss of electrical insulation) and device failure of units processed during the affected periods of time.
After it was discovered, the malfunction was addressed vigorously and steps were taken to avoid its recurrence. In the meantime, however, an appreciable quantity of product had been built, some of which was shipped to customers. The rest remained in inventory. It is these untouched, potentially vulnerable devices that were the target of this study.
Preliminary evaluations suggested that—due to the sporadic nature of the malfunction—only an unknown fraction (p) of a defined population of product had been exposed to the faulty valve and was vulnerable. It was not possible, however, to identify individual affected devices. Yet for various reasons, as indicated in the following points, an estimate of the value of p was needed:
- First, a decision was required about the disposition of product already shipped to customers. In particular, if p exceeded 3%, a 100% recall was deemed necessary.
- A second reason pertained mainly to the units still in the producer’s inventory, but also applied to units returned from the field. A burn-in life test at high voltage stress and high temperature was developed to weed out the vulnerable devices without damaging the healthy ones. This test, however, was expensive and would be economically worthwhile (versus scrapping all potentially vulnerable product) only if p was less than 20%.
- In addition, we wanted to determine the appropriate (minimum) time duration of the burn-in test that would weed out at least 98% of the defective units.
To respond to the preceding questions, a sample of 3,600 randomly selected devices from the population of affected devices was subjected to the burn-in test for 50 minutes each. The resulting time-to-failure data are summarized in Table 1.
Initial analysis. Table 1 shows that 188 units failed and the remaining 3,412 units survived the 50-minute test. Therefore, 188/3600, or 5.2%, is an initial but crude, estimate of p. This estimate could be too low because it excluded dielectric breakdown failures that occur presumably shortly after 50 minutes. Or the estimate may be too large because it may have included possible early failures unrelated to the valve malfunction (that is, failures on healthy units). Also, it did not respond to the question concerning the duration of the burn-in test. Therefore, more sophisticated statistical analysis was needed.
Single distribution Weibull analysis. A Weibull distribution is commonly used as the underlying statistical model in the analysis of data on time to dielectric breakdown. Figure 1 is a Weibull distribution probability plot of the time to dielectric breakdown data. The straight line shown in Figure 1 is the maximum likelihood (ML) fit of a Weibull distribution to the data. The plotted points are for the 188 units that failed during tests. The unfailed units are taken as censored observations in this plot and in the subsequent analysis.3
The curvature of the plotted observations suggests that a simple Weibull distribution does not provide a good model. This is not surprising because a simple Weibull distribution assumes that all units are subject to a single common failure mode. In the current application, there is strong evidence suggesting that there are two distinct product populations: devices with compromised dielectric insulation due to exposure to the faulty valve and normal (healthy) devices that were not exposed. It was not known prior to the test, however, whether a particular device came from the defective (that is, exposed to faulty valve) or the healthy population. Therefore, a segmentation analysis could not be performed.
Data and Technical Details
The data set used in this study, along with technical details of fitting the Weibull mixture model in JMP, are included in Online Figure 1 and Online Table 1.
Suppose that the product population is comprised of two subpopulations—D (defective) and H (healthy) devices, and each are subject to a different failure-time distribution. Use fD(t) and fH(t) to designate the probability density functions—assumed to be Weibull distributions in this example—for the lifetimes of these subpopulations.
If, for example, it was known that 15% of the population belonged to the defective subpopulation and the remaining 85% belonged to the healthy subpopulation, the probability density function f(t) for product life of the entire (that is, combined) population is provided by the so called "mixture model" for two subpopulations:
f(t) = 0.15 fD(t) + 0.85 fH(t).
In our application, the mixing proportion is unknown and also must be estimated from the data. Therefore, more generally, the mixture model with two subpopulations can be expressed as:
f(t) = pfD(t) + (1 – p) fH(t),
in which p (0 ≤ p ≤ 1) is the unknown mixing proportion.
Mixture model analysis
In our application, we used a mixture model with a Weibull distribution for the two subpopulations. Thus, there were five parameters to be estimated from the data: the mixture proportion p, and the Weibull scale and shape parameters for subpopulations D and H. The ML method was used to estimate these parameters from the data.
The probability plot in Figure 2 is identical to that in Figure 1, but now shows the ML fit to the data using the preceding mixture model. This model appears to fit the data reasonably well.
The model parameter estimates are shown in Table 2. The proportion p of the defective subpopulation is 0.045 (with a 95% confidence interval of 0.037 to 0.056). This meant that there must be a recall of potentially vulnerable devices from the field because the lower 97.5% confidence bound on p of 3.7% exceeded the specified 3%.
Moreover, a burn-in of potentially vulnerable devices also is necessary because the upper 97.5% confidence bound on p of 5.6% falls well below 20%. Moreover, the 98th percentile of the time-to-failure distribution, fD(t), of the subpopulation of defective units was estimated to be 33.2 minutes, with an upper 97.5% confidence bound of 84.5 minutes, providing the needed information for setting the duration of the burn-in test.
Finally, we note that because we estimate 4.5% of the total units to be defective, we would estimate the total population percentage defective after burn-in to be 0.09%, that is, (0.045) x (1 – 0.98) x 100 with an upper 97.5% confidence bound of 0.112%, that is, (0.056) x (1 – 0.98) x 100.
Software. ML estimation of mixtures of distributions with censored data requires special software. Fortunately, modern software packages increasingly allow you to fit such models. We used the JMP software in our study. Reliasoft Weibull++ also provides this capability.
Mixture models, segmentation analyses and competing failure mode analyses. When data come from two or more populations and when the subpopulation to which individual devices belong can be identified, we recommend using segmentation analysis, as described in "Segmentation Analysis of Bleed Systems," and an earlier column.4
Another model arises when units come from the same population, but there are several causes of failure (several or competing failure modes). The most common assumption in this case (in part, because it makes the analysis simple) is that all units are susceptible to each of these failure modes, acting independently.
The actual (observed) failure time on a particular device is the minimum of these (hypothetical) failure times. When, in addition, the specific failure mode of the failed devices can be identified, we recommend separate analyses for each failure mode (and combining the results), as discussed in an earlier column.5
In this column’s example, neither the subpopulation nor the specific failure mode was known, leading to the mixture model that we have described. We should emphasize, however, that physically meaningful segmentation analysis and individual failure mode analysis are generally more informative than mixture model analysis.
Thus, we urge practitioners to strive to obtain the necessary information to conduct such in-depth analyses and to consider mixture model analysis only as a last resort.
Other models. A less-complicated mixture is presented by the "limited failure population model,"6 assuming two subpopulations. The first subpopulation, is made up of an unknown proportion p of the population, consisting of units that fail according to some statistical distribution, such as the Weibull. The second subpopulation consists of units that, for all practical purposes, will not fail.
This model typically requires estimation of only three parameters: p and the, presumably, two parameters of the time to failure distribution of the first subpopulation. More advanced mixture models can be used to accommodate multiple subpopulations and multiple failure modes.7
- Andy Barnett, "Expert Answers," Quality Progress, May 2016, pp. 8-9.
- Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker, "Divide and Conquer in Reliability Analyses," Quality Progress, November 2009, pp. 46-48.
- Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker, "Product Life Data Analysis: A Case Study," Quality Progress, June 2000, pp. 115-121.
- Doganaksoy, Hahn and Meeker, "Divide and Conquer in Reliability Analyses," see reference 2.
- Necip Doganaksoy, William Q. Meeker and Gerald J. Hahn, "Reliability Analysis by Failure Mode," Quality Progress, June 2002, pp. 47-52.
- William Q. Meeker and Luis A. Escobar, Statistical Methods for Reliability Data, John Wiley & Sons, 1998, section 11.5.
- Victor Chan and William Q. Meeker, "A Failure-Time Model for Infant Mortality and Wearout Failure Modes," IEEE Transactions on Reliability, Vol. 48, No. 2, December 1999, pp. 377-387.
Necip Doganaksoy is an associate professor at Siena College School of Business in Loudonville, NY, following a 26-year career in the industry, mostly at General Electric (GE). He has a doctorate in administrative and engineering systems from Union College in Schenectady, NY. Doganaksoy is a fellow of ASQ and the American Statistical Association.
Gerald J. Hahn is a retired manager of statistics at the GE Global Research Center in Schenectady. He has a doctorate in statistics and operations research from Rensselaer Polytechnic Institute in Troy, NY. Hahn is a fellow of ASQ and the American Statistical Association.
William Q. Meeker is professor of statistics, and distinguished professor of liberal arts and sciences at Iowa State University in Ames. He has a doctorate in administrative and engineering systems from Union College. Meeker is a fellow of ASQ and the American Statistical Association.