## 2019

# Identify and Act

## Performing product life data analysis with unidentified subpopulations

by Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker

The National Highway Traffic Safety
Administration (NHTSA) in 2015 recalled 28 million vehicles equipped with Takata air bags due to inflators rupturing during
deployment.^{1} It was determined that moisture could penetrate the
inflator canister and make the propellant more explosive over time.

Data analysis showed that older air bags and those in regions with high humidity, such as the Gulf Coast, were up to 10 times more likely to rupture. This information about product subpopulations with different air bag malfunction vulnerability played a key role in defining short term and long-term remedial strategies.

Similar to this example, there are many situations in which some units are more likely to fail in service than others. Such differences in reliability may be due to variability in manufacturing-process conditions, or in raw materials and components. To take appropriate action to minimize premature failures in such situations, it is important to know which parts of the product population, already built, are most vulnerable.

In addition, identifying vulnerable product subpopulations can help identify the root causes of reliability problems and allow you to use this knowledge to improve future product reliability. Targeted experimentation, coupled with appropriate analysis of resulting and existing data, can help you identify product subpopulations with different levels of vulnerability.

In a previous Statistics Roundtable column,^{2} we showed how to identify subpopulations with
different failure vulnerabilities by using segmentation analysis—a
divide-and-conquer strategy that breaks down the total product population into
meaningful subpopulations that allows separate analyses on each identified
subpopulation. See "Segmentation Analysis of Bleed Systems" for a summary of
the column.

## Segmentation Analysis of Bleed Systems

An earlier Statistics Roundtable column^{1}
dealt with a system that bleeds off air pressure from an aircraft engine to
operate a compressor or generator. Lifetime data were available on bleed
systems from 2,256 engines in military aircraft operating from various bases.
Sidebar Figure 1 shows a single Weibull distribution
probability plot for the 19 failures that occurred.

The slope of the plot seems to change at about 600 hours, indicating that a simple Weibull distribution does not provide an adequate representation. Further study revealed that 10 of the 19 failures occurred at base D, one of the bases at which the planes were stationed, even though base D involved only 202 engines.

Moreover, there was physical justification (salty air) for the higher failure rate at base D. Thus, separate Weibull analyses and probability plots were performed for the system lifetimes at base D and for those at all other bases (see Sidebar Figure 2). The data in each of these plots scatter around straight lines, suggesting that separate Weibull distributions provide adequate representations and justifying the segmentation analysis.

*—N.D., G.J.H. and W.Q.M.*

**Reference**

- Necip Doganaksoy, Gerald J. Hahn
and William Q. Meeker, "Divide and Conquer in Reliability Analyses,"
*Quality Progress*, November 2009, pp. 46-48.

### Subpopulation identification

In segmentation analysis, the definition of subpopulations should be based on physical considerations. This requires an in-depth understanding of the design, manufacture and use conditions of the product—and an identification of the specific subpopulation to which a particular product unit belongs. In the sidebar example, the base at which each aircraft was stationed was used to segment the data.

In some applications, however, relevant product subpopulations cannot be identified on an individual unit basis. As a result, you cannot conduct a segmentation analysis. In this column, we describe a method to analyze life data in such situations.

### Semiconductor example

**Description.**
A manufacturing process for semiconductor devices requires hundreds of steps,
often spanning weeks or even months. Missed problems in early stages of
manufacturing can lead to significant quality and reliability issues, involving
large quantities of product at the end of the process that might be an
appreciable time later.

In a particular operation, manufacturing staff discovered a faulty valve in an early production stage diffusion furnace. Investigations showed that the valve malfunctioned sporadically over time, potentially compromising dielectric insulation and leading to premature dielectric breakdown (that is, sudden loss of electrical insulation) and device failure of units processed during the affected periods of time.

After it was discovered, the malfunction was addressed vigorously and steps were taken to avoid its recurrence. In the meantime, however, an appreciable quantity of product had been built, some of which was shipped to customers. The rest remained in inventory. It is these untouched, potentially vulnerable devices that were the target of this study.

Preliminary
evaluations suggested that—due to the sporadic nature of the
malfunction—only an unknown fraction (*p*) of a defined
population of product had been exposed to the faulty valve and was vulnerable.
It was not possible, however, to identify individual affected devices. Yet for
various reasons, as indicated in the following points, an estimate of the value
of *p*
was needed:

- First,
a decision was required about the disposition of product already shipped to
customers. In particular, if
*p*exceeded 3%, a 100% recall was deemed necessary. - A second reason pertained mainly to the units still in the
producer’s inventory, but also applied to units returned from the field. A
burn-in life test at high voltage stress and high temperature was developed to
weed out the vulnerable devices without damaging the healthy ones. This test,
however, was expensive and would be economically worthwhile (versus scrapping
all potentially vulnerable product) only if
*p*was less than 20%. - In addition, we wanted to determine the appropriate (minimum) time duration of the burn-in test that would weed out at least 98% of the defective units.

To respond to the preceding questions, a sample of 3,600 randomly selected devices from the population of affected devices was subjected to the burn-in test for 50 minutes each. The resulting time-to-failure data are summarized in Table 1.

**Initial
analysis.** Table 1 shows that 188 units failed and the remaining
3,412 units survived the 50-minute test. Therefore, 188/3600, or 5.2%, is an
initial but crude, estimate of *p*. This estimate could
be too low because it excluded dielectric breakdown failures that occur
presumably shortly after 50 minutes. Or the estimate may be too large because
it may have included possible early failures unrelated to the valve malfunction
(that is, failures on healthy units). Also, it did not respond to the question
concerning the duration of the burn-in test. Therefore, more sophisticated
statistical analysis was needed.

**Single distribution Weibull
analysis.** A Weibull
distribution is commonly used as the underlying statistical model in the
analysis of data on time to dielectric breakdown. Figure 1 is a Weibull distribution probability plot of the time to
dielectric breakdown data. The straight line shown in Figure 1 is the maximum
likelihood (ML) fit of a Weibull distribution to the
data. The plotted points are for the 188 units that failed during tests. The unfailed units are taken as censored observations in this
plot and in the subsequent analysis.^{3}

The curvature of the plotted observations suggests that a simple Weibull distribution does not provide a good model. This is not surprising because a simple Weibull distribution assumes that all units are subject to a single common failure mode. In the current application, there is strong evidence suggesting that there are two distinct product populations: devices with compromised dielectric insulation due to exposure to the faulty valve and normal (healthy) devices that were not exposed. It was not known prior to the test, however, whether a particular device came from the defective (that is, exposed to faulty valve) or the healthy population. Therefore, a segmentation analysis could not be performed.

## Data and Technical Details

The data set used in this study, along with technical details of fitting the Weibull mixture model in JMP, are included in Online Figure 1 and Online Table 1.

### Mixture model

Suppose that the product population is comprised of two
subpopulations—D (defective) and H (healthy) devices, and each are
subject to a different failure-time distribution. Use *f _{D}*

*(*

*t)*and

*f*

_{H}*(t)*to designate the probability density functions—assumed to be Weibull distributions in this example—for the lifetimes of these subpopulations.

If, for
example, it was known that 15% of the population belonged to the defective
subpopulation and the remaining 85% belonged to the healthy subpopulation, the
probability density function *f(**t)*
for product life of the entire (that is, combined) population is provided by
the so called "mixture model" for two subpopulations:

*f**(t) = *0.15*
f _{D}(t) + *0.85

*f*

_{H}*(t)*.

In our application, the mixing proportion is unknown and also must be estimated from the data. Therefore, more generally, the mixture model with two subpopulations can be expressed as:

*f**(t) = pf _{D}(t)
+ *(1 –

*p*)

*f*

_{H}*(t),*

in which *p* (0 ≤ *p
*≤ 1) is the unknown mixing proportion.

Semiconductor example:

Mixture model analysis

In our application, we used a mixture
model with a Weibull distribution for the two
subpopulations. Thus, there were five parameters to be estimated from the data:
the mixture proportion *p*, and the Weibull scale
and shape parameters for subpopulations D and H. The ML method was used to
estimate these parameters from the data.

The probability plot in Figure 2 is identical to that in Figure 1, but now shows the ML fit to the data using the preceding mixture model. This model appears to fit the data reasonably well.

The model parameter estimates are shown in Table
2. The proportion *p* of the defective subpopulation is 0.045 (with a
95% confidence interval of 0.037 to 0.056). This meant that there must be a
recall of potentially vulnerable devices from the field because the lower 97.5%
confidence bound on *p *of 3.7% exceeded the specified 3%.

Moreover, a
burn-in of potentially vulnerable devices also is necessary because the upper
97.5% confidence bound on *p* of 5.6% falls well
below 20%. Moreover, the 98^{th} percentile of the time-to-failure
distribution, *f _{D}*

*(t)*, of the subpopulation of defective units was estimated to be 33.2 minutes, with an upper 97.5% confidence bound of 84.5 minutes, providing the needed information for setting the duration of the burn-in test.

Finally, we note that because we estimate 4.5% of the total units to be defective, we would estimate the total population percentage defective after burn-in to be 0.09%, that is, (0.045) x (1 – 0.98) x 100 with an upper 97.5% confidence bound of 0.112%, that is, (0.056) x (1 – 0.98) x 100.

### Further discussion

**Software.**
ML estimation of mixtures of distributions with censored data requires special
software. Fortunately, modern software packages increasingly allow you to fit
such models. We used the JMP software in our study. Reliasoft
Weibull++ also provides this capability.

**Mixture models, segmentation analyses and competing failure
mode analyses.**** **When data come from two or
more populations and when the subpopulation to which individual devices belong
can be identified, we recommend using segmentation analysis, as described in
"Segmentation Analysis of Bleed Systems," and an earlier column.^{4}

Another model arises when units come from the same population, but there are several causes of failure (several or competing failure modes). The most common assumption in this case (in part, because it makes the analysis simple) is that all units are susceptible to each of these failure modes, acting independently.

The actual
(observed) failure time on a particular device is the minimum of these
(hypothetical) failure times. When, in addition, the specific failure mode of
the failed devices can be identified, we recommend separate analyses for each
failure mode (and combining the results), as discussed in an earlier column.^{5}

In this column’s example, neither the subpopulation nor the specific failure mode was known, leading to the mixture model that we have described. We should emphasize, however, that physically meaningful segmentation analysis and individual failure mode analysis are generally more informative than mixture model analysis.

Thus, we urge practitioners to strive to obtain the necessary information to conduct such in-depth analyses and to consider mixture model analysis only as a last resort.

**Other
models.** A less-complicated mixture is presented by
the "limited failure population model,"^{6} assuming two
subpopulations. The first subpopulation, is made up of
an unknown proportion *p* of the population,
consisting of units that fail according to some statistical distribution, such
as the Weibull. The second subpopulation consists of
units that, for all practical purposes, will not fail.

This model
typically requires estimation of only three parameters: *p*
and the, presumably, two parameters of the time to failure distribution of the
first subpopulation. More advanced mixture models can be used to accommodate
multiple subpopulations and multiple failure modes.^{7}

### References

- Andy
Barnett, "Expert Answers,"
*Quality Progress,*May 2016, pp. 8-9. - Necip Doganaksoy, Gerald J. Hahn
and William Q. Meeker, "Divide and Conquer in Reliability Analyses,"
*Quality Progress,*November 2009, pp. 46-48. - Necip Doganaksoy, Gerald J. Hahn
and William Q. Meeker, "Product Life Data Analysis: A Case Study,"
*Quality Progress,*June 2000, pp. 115-121. - Doganaksoy, Hahn and Meeker, "Divide and Conquer in Reliability Analyses," see reference 2.
- Necip Doganaksoy, William Q. Meeker
and Gerald J. Hahn, "Reliability Analysis by Failure Mode,"
*Quality Progress,*June 2002, pp. 47-52. - William
Q. Meeker and Luis A. Escobar,
*Statistical Methods for Reliability Data,*John Wiley & Sons, 1998, section 11.5. - Victor
Chan and William Q. Meeker, "A Failure-Time Model for Infant Mortality and Wearout Failure Modes,"
*IEEE Transactions on Reliability,*Vol. 48, No. 2, December 1999, pp. 377-387.

**Necip Doganaksoy** is an
associate professor at Siena College School of Business in Loudonville, NY,
following a 26-year career in the industry, mostly at General Electric (GE). He
has a doctorate in administrative and engineering systems from Union College in
Schenectady, NY. Doganaksoy is a fellow of ASQ and
the American Statistical Association.

**Gerald J. Hahn** is a
retired manager of statistics at the GE Global Research Center in Schenectady.
He has a doctorate in statistics and operations research from Rensselaer
Polytechnic Institute in Troy, NY. Hahn is a fellow of ASQ and the American
Statistical Association.

**William Q. Meeker** is professor of statistics, and
distinguished professor of liberal arts and sciences at Iowa State University
in Ames. He has a doctorate in administrative and engineering systems from
Union College. Meeker is a fellow of ASQ and the American Statistical
Association.

Featured advertisers