Using Degradation Data
For Product Reliability Analysis
A case study shows how this type of data can
more precise results in assessing reliability
by William Q. Meeker, Necip Doganaksoy and Gerald J. Hahn
This is the third installment in a series of articles on reliability improvement and data analysis. The first article, "Reliability Improvement: Issues and Tools" ran in Quality Progress in May 1999. QP published the second installment, "Product Life Data Analysis: A Case Study" in its June 2000 issue.
High reliability systems require individual components to have extremely high reliability for a long time. Often, the time for product development is short, imposing severe constraints on reliability testing.
Traditionally, methods for the analysis of censored failure-time data are used to extrapolate mission reliability from the longest test times--even though there may be few observed failures.1 This significantly limits the accuracy and precision of the conclusions, motivating us to search for better methods.
Many failure mechanisms can be traced to an underlying degradation process. Degradation eventually leads to a reduction in strength or a change in physical state that causes failure. Degradation measurements, when available, often provide more information than failure-time data for assessing and improving product reliability.
This article provides a brief introduction on how one can, in some situations, leverage degradation data for reliability prediction and improvement, and it presents a simple method for analyzing such data. See Statistical Methods for Reliability Data2 and Applied Reliability3 for more details, examples and references.
In some studies (on tire wear, for example), degradation over time is measured directly. In other cases, degradation cannot be observed directly, but measures of product performance degradation (such as power output) are available. Sometimes, degradation is measured continuously. In other applications, measurements become available only at discrete inspection times. In any case, the advantages of using degradation data are considerable (see the sidebar "Advantages of Using Degradation Data" on p. 62).
In some applications, one deals with hard failures, resulting in a complete loss of functionality--when the filament in a light bulb fails, for example. At other times, failures are soft, occurring when a critical performance measurement reaches a predefined level. The component continues to function, but unsatisfactorily. As Wayne Nelson suggests, although the definition of failure is often arbitrary, it should be meaningful.4
Let's consider a study adapted from William Q. Meeker and Luis A. Escobar,5 which involves a gallium arsenide (GaAs) laser for telecommunications systems. As the device ages, more current is required to obtain the required light output. The device has a built-in feedback circuit to maintain constant light output. A unit is defined to have failed at the time at which a 10% current increase is first needed.
Fifteen lasers were run on life test at the accelerated temperature of 80° C ambient for 4,000 hours. By this time, three lasers had failed--at 3,374 hours, 3,521 hours and 3,781 hours--using the preceding failure definition.
The product needed to operate for at least 200,000 hours over 20 years at a temperature of 20° C. From experience, the engineers conservatively estimated that the 80° C test would provide an acceleration factor of approximately 40 in time to failure. To allow for the needed redundancy, an estimate of the probability of failure at 200,000/40 = 5,000 hours (equivalent to over 20 years in operation) was desired.
All analyses were conducted using the SLIDA collection of S-Plus functions.6 Other software packages are referenced later.
The lognormal distribution was felt, from experience, to be an appropriate time to failure model. Figure 1 is a lognormal probability plot of the data, showing the three failures. The 12 unfailed units at 4,000 hours are shown on the top of the plot. The straight line is the maximum likelihood (ML) estimate of F(t), the probability of failure by time t, using traditional methods (based on estimates of the distribution parameters µ and ).
This fit takes into account the 12 unfailed units (explaining why the ML line does not seem to fit the plotted points and why maximum likelihood, rather than simple linear regression, was used). Also shown are 95% confidence limits on the fitted line. See our previous article "Product Life Data Analysis: A Case Study"7 for an introductory discussion of these methods.
The estimate of F(5,000) is 0.658 with an approximate 95% confidence interval of 0.126 to 0.962. This extremely wide interval reflects the fact that the analysis is based on a small number of failures.
Estimation of 5,000-hour failure probability
from 4,000-hour degradation data
The preceding analysis did not utilize the power output measurements, except in the go/no-go sense of calling a failure when the current increase exceeded 10%. Figure 2 (p. 60), a plot of the degradation data at 4,000 hours, seems to provide additional useful information. For example, the degradation path of one of the units suggests it was close to failing by 4,000 hours. This added information is leveraged in the following analysis.
The basic approach is to generate pseudofailure times for unfailed units by extrapolating their degradation paths and including these in the analysis. Figure 3 shows simple linear regression lines fitted to the degradation paths of the 12 unfailed devices extrapolated to 5,000 hours.
By this time, the extrapolated degradation for three added units exceeded a 10% increase in current, resulting in pseudofailure times of 4,194 hours, 4,721 hours and 4,995 hours, in addition to the three failures prior to 4,000 hours. The other nine units remained censored.
Figure 4 is a lognormal probability plot of the six failure times (three observed failures and three pseudofailures), with nine censored units, at 5,000 hours.
The ML estimate of the probability of failing by 5,000 hours (based on a lognormal fit to the data) is F(5,000) = 0.410 with an approximate 95% confidence interval of 0.197 to 0.657. Although this interval is still quite wide, it is much narrower than that from the failure-time analysis. (This interval, however, does not incorporate the added variability due to uncertainty in extrapolating from the degradation data.)
The censoring time of 5,000 hours to generate pseudofailures was chosen to minimize extrapolation. A further analysis, not shown here, allowing all devices to "fail" yielded a similar estimate but with a shorter confidence interval.
Estimation of 5,000-hour failure probability
from 2,000-hour degradation data
To illustrate an important advantage of degradation data analysis, let's analyze the data available after only 2,000 hours. This shorter test could allow an earlier release of a reliable product and speedier corrective action on an unreliable one. However, it requires more extrapolation and reliance on the assumed linear degradation model. Since there were no failures at 2,000 hours, standard failure-time analysis is not possible (although an upper confidence bound on the failure probability at 2,000 hours can be obtained using binomial distribution methods).
Figure 5 (p. 64) shows the 2,000-hour degradation paths extrapolated to 5,000 hours. Seven paths now exceed a 10% current increase, resulting in pseudofailure times of 3,229, 3,514, 3,742, 4,047, 4,282, 4,781 and 4,969 hours. The eight remaining units continue to be censored at 5,000 hours.
Figure 6 (p. 64) is a lognormal probability plot of the pseudofailure times from the 2,000-hour degradation data. The ML estimate of the probability of failing by 5,000 hours is F(5,000) = 0.475 with an approximate 95% confidence interval of 0.248 to 0.712.
The results of this analysis did not differ much from those of the 4,000-hour analysis. The device did not meet reliability requirements (even if one, optimistically, used the lower confidence bound on the failure probability) and required redesign. However, this conclusion was reached 2,000 hours earlier than in the conventional failure-time analysis--an important practical advantage!
Limitations of degradation data analysis
Degradation data analyses need to be conducted cautiously, recognizing the underlying assumptions. In our example, the degradation paths were well-behaved, with little measurement error, allowing pseudofailure times to be reasonably extrapolated. Not all degradation processes are that simple. Other models and/or analysis methods are needed if:
1. The sample paths are not linear (and cannot be transformed to become linear) or cannot be expected to be reasonably linear in extrapolation (what is "reasonably linear" depends on the degree of extrapolation).
2. There is substantial measurement error, causing the pseudofailure times to differ appreciably from the actual (unrealized) failure times.
3. Failures occur suddenly (instantaneous increases in degradation) with little correlation to degradation. Such behavior frequently indicates catastrophic failure due to a mechanism different from the measured degradation, implying that the degradation data provide little information about time to failure.
In some applications (such as disassembling a motor to measure wear), the degradation measurement itself impacts future degradation, or is destructive. This allows only a single degradation measurement at a strategically selected time on each unit.
In addition, pseudofailure times are not actual failure times. If the fitted lines do not provide a good extrapolation to the actual (un-known) failure times, the analysis could be badly biased. This danger is especially great if there is much extrapolation.
Some technical comments
Another way of expediting results is to conduct accelerated life tests, based on an appropriate physical model. One can also combine the two approaches--obtaining degradation data from accelerated tests.
Simple degradation analyses can be implemented by standard statistical methods (simple regression to estimate pseudofailure times and maximum likelihood fitting of the resulting censored failure-time data), using statistical packages such as SAS,8 Minitab9 and S-Plus.10 The SLIDA collection of S-Plus functions has built-in functions to facilitate fitting separate regression lines for each unit. Weibull++11 offers automated features to conduct similar analyses.
When degradation is well-behaved and measurement error is small, the simple method described here is often adequate. Meeker and Escobar12 describe a more sophisticated analysis method that accounts for measurement error (without explicitly predicting failure times for unfailed units).
Leveraging well-behaved degradation data
Well-behaved degradation data can provide more precise reliability estimates than times to failure alone and can permit extrapolations without failures. This allows one to draw tentative conclusions earlier--often an important practical advantage. This article has focused on statistical methods for leveraging well-behaved degradation data.
Our ability to perform such analyses makes the mainly nonstatistical task of identifying a well-behaved degradation measurement--and one that is a true precursor of failure that can be readily obtained--of paramount importance.
2. William Q. Meeker and Luis A. Escobar, Statistical Methods for Reliability Data (New York: John Wiley & Sons, 1998).
3. Paul A. Tobias and David C. Trindade, Applied Reliability, second edition (New York: Van Nostrand Reinhold, 1995).
4. Wayne Nelson, Accelerated Testing: Statistical Models, Test Plans and Data Analyses (New York: John Wiley & Sons, 1990).
5. Meeker and Escobar, Statistical Methods for Reliability Data (see reference 2).
8. SAS/STAT User's Guide: Release 6.03 Edition (Cary, NC: SAS Institute, 1988).
9. Minitab User's Guide 2: Data Analysis and Quality Tools, Release 12 (State College, PA: Minitab, 1997).
10. S-Plus User's Manual, Version 2000 (Seattle: Statistical Sciences, 1999).
11. Life Data Analysis Reference-Weibull++ (Tucson, AZ: ReliaSoft Publishing, 1997).
12. Meeker and Escobar, Statistical Methods for Reliability Data (see reference 2).
Hahn, Gerald J., Necip Doganaksoy and William Q. Meeker, "Reliability Improvement: Issues and Tools," Quality Progress, May 1999.
WILLIAM Q. MEEKER is professor of statistics and distinguished professor of liberal arts and sciences at Iowa State University, Ames. He obtained a doctorate in administrative and engineering systems from Union College in Schenectady, NY. He is an ASQ member.
NECIP DOGANAKSOY is a statistician at GE Corporate Research and Development in Schenectady, NY. He obtained a doctorate in administrative and engineering systems from Union College in Schenectady, NY. He is an ASQ member.
GERALD J. HAHN is recently retired manager of applied statistics at GE Corporate Research and Development in Schenectady, NY. He holds a doctorate in statistics and operation research from Rensselaer Polytechnic Institute in Troy, NY. He is an ASQ Fellow.
Some advantages of using relevant degradation data in reliability analyses over, or in addition to, traditional failure-time data, are:
- More informative analyses--especially when there are few or no failures.
- Useful data often become available much earlier.
- Degradation, or some closely related surrogate, may allow direct modeling of the mechanism causing failure, provides more credible and precise reliability estimates and establishes a firm basis for often needed extrapolations in time or stress.
In addition, degradation data may increase physical understanding, and, thereby, enable earlier rectification of reliability issues.