When are there really differences in overlapping confidence intervals?
by Connie M. Borror
While teaching a workshop for a small manufacturing firm, two of the participants approached me to discuss what seemed to be a simple problem they had encountered at work. Recently, management noticed a decrease in the number of products coming off the two assembly lines in their manufacturing plant.
Specific steps in the product’s assembly were done by hand. The company was interested in determining whether the perceived decrease in production was real. Several studies had been planned. However, management first wanted to determine whether there was a significant difference between the two assembly lines with respect to average time to complete the task.
Management randomly selected 20 people from assembly line one (AL1) and 20 people from assembly line 2 (AL2) to participate in a designed study in which workers completed a particular task. The time to complete the task was recorded in seconds.
Together, the employees carried out the experiment using all the usual recommendations, such as randomization and controlling factors that were not of interest in the study. Having some understanding of statistics, the participants realized the groups of interest were independent and wanted to test the hypothesis that mean1 = mean2 or mean1 - mean2 = 0, in which mean1 was the true average time to complete the task for all AL1 workers, and mean2 was the true average time to complete the task for all AL2 workers.
They really wanted to know whether there was a significant difference in the average time to complete the task by the two groups (that is, mean1 ≠ mean2 or mean1 – mean2 ≠ 0). The experiment was carried out, and results were collected. Table 1 shows the summary statistics. They decided to analyze the results separately but agreed to use confidence intervals with a 95% level of confidence to reach their conclusions.
They came to me with their results and the problem: Using 95% confidence intervals and the same data, they reached two different conclusions. One reported there was no difference between the two groups, while the second reported there was a difference.
After looking at both sets of results, it was obvious they carried out their individual analyses carefully and without error. So what went wrong? How could two different conclusions be reached using the same information?
First, let’s examine what they did in more detail. Both employees assumed time to complete a task for the assembly lines to be normally distributed, and they did not assume anything about the population variances (what they were or whether they were equal). In addition, both employees constructed 95% two-sided confidence intervals on the individual population means, mean1 and mean2:
18.69 ≤ mean1 ≤ 23.17 and
15.15 ≤ mean2 ≤ 19.76.
Figure 1 shows the individual 95% confidence intervals.
That’s where the similarity between the two analyses ended. Examining the confidence intervals and realizing they overlapped, the first employee concluded there was no statistically significant difference in the average time to complete the task for the two groups. In fact, he noted that because the confidence interval for mean2 overlapped roughly 24% of the interval on mean1, there was even stronger evidence supporting his conclusion.
The second employee, however, recalled seeing a method for constructing a confidence interval on the difference in two population means and used it to obtain a single 95% two-sided confidence interval
0.37 ≤ mean1 – mean2 ≤ 6.59,
and concluded because 0 was not contained in this interval (although it was just barely outside), there was a statistically significant difference between the two groups. At this point, they were not sure who was correct. One of them also had performed a two-sided hypothesis test and found a p-value = 0.030. Still, they were at a loss. Let’s examine the two approaches they used.
The first participant used a two-interval method—examining the two confidence intervals on the individual means and seeing whether they overlapped. Because he assumed both populations were normally distributed, the sample sizes were fairly small (n1 = n2 = 20), and nothing was known about the population variances. The 100(1 – α )% confidence intervals on the two population means, mean1 and mean2, were (using notation from Table 1):
in which the values of t1 and t2 are found using Student’s t-distribution. Often, the interpretation of the intervals is either:
- If Equations 1 and 2 do not overlap, there is a statistically significant difference between the two populations.
- If Equations 1 and 2 overlap, there is no statistically significant difference between the two population means.
The first interpretation is always true.1 However, the second interpretation is not entirely correct. In fact, if the two confidence intervals overlap, a statistically significant difference may or may not exist between the two population means.
The second employee used a 100(1–α)% two-sided confidence interval on the difference between two population means, mean1 – mean2, for independent samples:
Again, t* is found using Student’s t-distribution. Refer to this as the single-interval method. For the second employee’s analysis, Equation 3 was used to construct the 95% confidence interval (0.37 ≤ mean1 – mean2 ≤ 6.59) from earlier.
So what about my participants’ problem? I explained the results from the two-interval method were incorrect. The most efficient method to use for their problem was the single-interval method for independent samples, constructing a confidence interval on the difference in the two population means (Equation 3).
This was the method used by the second participant in which he found the 95% confidence interval to not contain the value 0, concluding there was a statistically significant difference between the two groups with respect to average time to complete the task.
Recommendations and cautions
If the two-interval method can sometimes lead to the wrong conclusion, can it be useful at all? The answer is yes, but with caution. When comparing two confidence intervals, be mindful of:
Decision making. If the two individual confidence intervals do not overlap—leading to the rejection of the claim mean1 – mean2 = 0—the single-interval method will also lead to rejection of this claim. If the two individual confidence intervals do overlap, then the single-interval method may lead to rejection of the claim mean1 – mean2 = 0. More information is needed.
Power. The two-interval method fails to reject a false null hypothesis more often than the single-interval method.2-4 As a result, the two-interval method is less powerful than the single-interval method.
Statistical significance. Whether using hypothesis tests or confidence intervals, statistical significance does not imply practical significance.
Paired data. If two groups you’re comparing are dependent, the two-interval method is inappropriate. A single-interval method for paired data should be used.
Are those confidence intervals? There are many different types of intervals, such as confidence, tolerance and prediction intervals, for example.5 It may not always be clear from a graph or discussion what the interval represents. Standard error bars, for example, look similar to confidence intervals, but they are typically intervals such as . If not clearly stated or understood, these intervals can be misinterpreted as confidence intervals.
If confidence intervals on individual parameters do not overlap, we know for sure a statistically significant difference exists. It’s when the confidence intervals do overlap that the conclusions are unclear. We must rely on additional exploratory analysis to determine statistical significance and expert knowledge to determine practical significance.
- Donald J. Barr, "Using Confidence Intervals to Test Hypotheses," Journal of Quality Technology, Vol. 1, No. 4, October, 1969, pp. 256-258.
- Lloyd S. Nelson, "Evaluating Overlapping Confidence Intervals," Journal of Quality Technology, Vol. 21, No. 2, April 1989, pp. 140-141.
- Nathaniel Schenker and Jane Gentleman, "On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals," The American Statistician, Vol. 55, No. 3, April 2001, pp. 182-186.
- Christine M. Anderson-Cook, "Interval Training," Quality Progress, October 2009, pp. 58-60.
Connie M. Borror is a professor in the division of mathematical and natural sciences at Arizona State University West in Glendale. She earned her doctorate in industrial engineering from Arizona State University in Tempe. She is a fellow of ASQ and the American Statistical Association. Borror is also editor of Quality Engineering.