Under the Limit
Statistical methods to treat and analyze nondetectible data
by Robert L. Mason and Jerome P. Keating
Engineers often encounter a wide spectrum of issues pertaining to the data they collect from experiments. A particular area of concern arises when the quantity to be measured falls below the detection limit of the measuring apparatus.
For example, this often occurs in water treatment facilities when a plant operator is required to test for the presence of contaminants such as mercury, lead or arsenic. It also occurs in many other disciplines in which measurements are taken in areas such as air pollution, earthquake detection, radioactive material and soil contamination.
Whenever the amount of a contaminant is so small that it falls below a specified detection limit, the engineer may initially be pleased to know the quantity of the contaminant cannot be detected but later confounded about how to treat the observation using traditional statistical methods.
Such types of data are usually called nondetects. The nondetects described in the previous paragraphs fall into a category known as left-censored observations because they are on the left side of the mean of the data.1
Suppose the detection limit of a test is equal to a positive constant, d, and the engineer knows that the true amount, x, of the contaminant present is less than d. Engineers often treat such data using one of the following four substitution methods:
- Set the value equal to zero: x = 0.
- Set the value equal to the midpoint of the interval between 0 and d: x = d/2.
- Set the value equal to the detection threshold: x = d.
- Delete the value.
Analyzing censored data using the first method will tend to underestimate the true amount of the contaminant present in the population from which the sample was taken. The third method will tend to overestimate the true amount of the contaminant present in the population.
Analyzing censored data using the second method is an attempt to take the middle position between methods one and three. The fourth method simply ignores the undetected values, but the results may seriously overestimate the amount of the contaminant.
Better approaches than these substitution methods are often available for estimating summary statistics using censored data, and each has strengths and weaknesses. A method called regression on order statistics is useful with small data sets (n < 30).
Because all of the nondetects have contaminant levels that are less than those for the rest of the sample (the detects), you can treat the observations using the ordered observations. The ordered observations are known as order statistics. When you evaluate an order statistic at its cumulative distribution function, the resulting random variable has a beta distribution.
This relation can be used to develop a probability plot for the comparison of the observed data with the quantiles of a potential probability distribution, such as the normal, lognormal, exponential, extreme value and Weibull.2 A quantile is a value that divides a data set into two groups so a specified proportion of the sample has values that are less than or equal to the value of the quantile.
A useful intermediate statistic for making this comparison is the median rank, which is the median of the beta distribution of the ith order statistic. You can approximate this value using mi = (i – 0.375)/(n + 0.25), and then create a probability plot by plotting the observed data versus the quantile of the specified underlying distribution that corresponds to mi.
If the data follow the proposed distribution, the subsequent plot should produce a collection of points that follow a straight line. The intercept and slope of the straight line then can be used to estimate the mean and standard deviation of the candidate distribution.
Consider the following data,3 which consists of a sample of n = 8 observations of the concentration of a substance in air measured in parts per billion (ppb). The detection limit of the measuring apparatus is d = 0.8, and the ordered observations are:
< 0.8, < 0.8, < 0.8, < 0.8, < 0.8, 1, 2 and 5.
Notice there are five left-censored observations in this sample. Although you don’t know the exact values of these five nondetects, you do know the exact values for the sixth, seventh and eighth-order statistics are 1, 2 and 5. Thus, you can use their corresponding plotting positions of i = 6, 7 and 8 when fitting a distribution to the data.
Table 1 shows the approximate median rank values for these three observations. Also included is the natural log of the measured concentration value and the value of the the normal distribution’ corresponding quantile. Data of this type are often analyzed on a logarithmic scale because the log values tend to be normally distributed.
A probability plot of the log of the concentration versus the normal quantiles is plotted in Figure 1. The natural log of concentration on the y-axis is plotted against the normal quantile on the x-axis. Notice we have not plotted the numerical values of the five sample items, which were below the detection limit, but we did use their values in obtaining the median ranks for the three largest sample values.
A simulated data set4 of size 15 was obtained from a normal population with a mean of 1.33 and standard deviation of 0.2. A lower detection limit of d = 1 was adopted for the simulated measurement, denoted as SM, which produced two censored observations. The ordered data are given as:
< 1, <1, 1.1342, 1.1560, 1.1568, 1.1612, 1.2883, 1.2884, 1.2914, 1.3251, 1.3253, 1.3641, 1.4581, 1.4688 and 1.5638.
Figure 2 contains a probability plot of the SM data, based on the set of 13 detect values using the fourth method, versus the associated normal quantiles.
We also plot the three sets of two nondetect SM values using methods 1, 2 and 3. You can see the estimated values for the two nondetect observations are in best agreement with the detect values when the third method is used, and the two values are set equal to d = 1. Of course, this choice will vary from case to case.
Another way in which nondetects affect statistical estimates occurs when sampling two populations: one containing contaminants and one containing no contaminants, or simply background values.
In some cases, the background values are so low as to be undetectable or barely detectable. Thus, the observed population is a mixture of two populations and it has a probability density function defined as a weighted sum of two different probability density functions. In this case, there are two possibilities:
- The mixture may be such that the contaminant levels for both populations are very low and thus quite close to the threshold value. You may be able to observe this phenomenon by estimating the values for the nondetects. But this will not guarantee you will be able to detect a mixture of two populations in all cases.
- The mixture may be such that the mean value from the population with only background values may be much smaller than the mean value from the population of values taken from contaminated sites. To highlight this point, here’s an example in which you can clearly see the presence of two populations.
The following are concentrations of triphenyltin (TPT) in μg/kg dry weight measured in a sediment core.5 The 18 ordered concentrations are:
< 2, < 2, < 2, < 2, < 2, 2, 4, 10, 26, 29, 34,
35, 51, 56, 69, 71, 83 and 107.
Although you cannot determine by a simple inspection of these data that this is a mixture, it should be noted these data are bimodal and represent the type of mixture phenomena described earlier.
The probability plot in Figure 3 plots the natural log of TPT versus the quantile of the log-Weibull distribution. If the data follow a straight line on this probability graph, the resulting distribution will be approximated by the log-Weibull distribution.
The plot clearly depicts two different straight lines: one from sites with little or no contamination (called F1), and one from those with significantly more contamination (called F2).
There are several other statistical procedures for analyzing data with nondetects, including parametric methods, such as maximum likelihood estimation, and nonparametric methods, such as the Kaplan-Meier procedure.6
References and Notes
- For a comprehensive treatment of nondetects, see Dennis R. Helsel, Nondetects and Data Analysis: Statistics for Censored Environmental Data, John Wiley & Sons, 2005.
- Alan Gleit, "Estimation for Small Normal Data Sets With Detection Limits," Environmental Science and Technology, 1985, Vol. 19, pp. 1201-1206.
- Anita Singh and John M. Nocerino, "Robust Estimation of Mean and Variance Using Environmental Data Sets With Below Detection Limit Observations," proceedings from the Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, H. Niederreiter and P.J.S. Shiue, eds., Lecture Notes in Statistics, Vol. 106, Springer, 1995.
- K. Fent and J. Hunn, "Phenyltins in Water, Sediment and Biota of Freshwater Marinas," Environmental Science and Technology, 1991, Vol. 25, pp. 956-963.
- Helsel, Nondetects and Data Analysis: Statistics for Censored Environmental Data, see reference 1.
Robert L. Mason is an institute analyst at Southwest Research Institute in San Antonio. He received a doctorate in statistics from Southern Methodist University in Dallas and is a fellow of ASQ and the American Statistical Association.
Jerome P. Keating is a professor of statistics at the University of Texas at San Antonio. He received a doctorate in mathematics from the University of Texas at Arlington and is a fellow of the American Statistical Association.