 ## 2020

STATISTICS SPOTLIGHT

# Fallacies of Statistical Significance

## The case for focusing your data analyses beyond significance tests

by Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker

Statistical significance (or hypothesis) tests, and the related concept of p-values, are popular tools in statistical data analysis. Unfortunately, the practical implications of statistical significance often turn out to be limited and are frequently misinterpreted and overstated, or even stated incorrectly.

This column—in the context of product reliability data analysis—will address the important distinction between statistical significance and practical significance, and urge practitioners to carry their analyses beyond significance tests—often by constructing appropriate confidence intervals.

### Statistical significance and p-values in reliability data analysis

In applications dealing with product reliability and life data analysis, significance tests and p-values tell whether a particular statistical hypothesis is reasonable based upon analysis of the given data. Some typical statistical hypotheses are:

• "These two products have the same mean life."
• "The median of the product life distribution exceeds five years."
• "Ten-year reliability of this product is 80%."

For further elaboration, see the sidebar "More on Significance Testing and p-values."

## More on Significance Testing and p-values

Statistical significance tests are designed so there is a small probability (such as 1% or 5%) of incorrectly rejecting a so-called "null hypothesis"—illustrated by the three statements in the text—when, in actuality, this hypothesis is true. This probability is known as the "significance level" or Type I error probability, and is frequently denoted by α.

A slight generalization of the standard significance test is—instead of specifying a significance level—to calculate a so-called "p-value" associated with the null hypothesis, based on the data.

For example, for the null hypothesis "These two products have the same mean life," the p-value calculated from the data provides the probability of obtaining, by chance alone, a difference between the two sample means that is as large or larger than that observed, if, indeed, the two product populations have identical means.

If the p-value is small—for example, less than 0.05—you typically reject the null hypothesis. In this example, this rejection would result in the conclusion that there is a statistically significant difference between the two population means.

On the other hand, if the p-value is not small, you would conclude that the data do not provide sufficient evidence to reject the null hypothesis. In this example, this would mean concluding that the data do not provide sufficient evidence to conclude that the two sampled populations have different means.

However, a p-value of 0.06, for example, comes closer to providing such evidence than  a p-value of 0.60, for example, even though neither allows you to formally reject the null hypothesis at a 5% significance level.

All of this, of course, is far different from proving that the two population means are the same.

—N.D., G.J.H. and W.Q.M.

### Key limitation of significance tests and p-values

As we will demonstrate with two examples from product reliability data analysis, a key limitation of the concept of statistical significance (and p-values) is that statistical significance may frequently differ from practical significance as defined by the problem context. Specifically:

• If the sample size for the given data is large, a statistically significant result is likely to be obtained even when the actual magnitude of the effect is relatively small and of little or no practical importance.
• When the sample size for the given data is small, it is quite likely that insufficient evidence for a statistically significant result will be obtained—even though there is, indeed, an effect of practical importance.

Essentially, practical significance depends on real-world considerations, dealing with such matters as customer satisfaction, management expectations and cost. Statistical significance and p-values, on the other hand, are heavily affected not only by the magnitude of the real-world effects (as they should be), but also the size of the sample upon which the analysis is based—typically an extraneous factor in considering practical significance.

### Glass strength example

The following example deals with tensile strength testing of glass specimens. This test characterizes the maximum stress that the specimens can withstand under a predefined load before breaking. Glass strength applications lend themselves especially well to this discussion because this business presents frequent situations with large and small data sets.

A glass manufacturer has extensive experience with a current manufacturing process. In particular, it has been established that the tensile strength distribution of specimens, built in an existing furnace, is stable over time and follows a lognormal distribution (as is typical in such applications) with a median value of 35 megapascals (MPa) and a shape parameter (standard deviation of log tensile strength) of 0.25. This distribution is shown by the red dashed curves in Figures 1 and 2.  It is desired to replace the old furnace with a new furnace. Before making this important process change, however, the manufacturer wants to compare the tensile strength distributions of the two furnaces and, in particular, the medians of the two distributions.

With this in mind, a random sample of specimens built using the new furnace are to be tested to compare, among other things, the median of the estimated distribution of tensile strength for the new furnace with the assumed known median tensile strength of 35 MPa for the old furnace.

You can argue that in this example, like many other similar applications, there is no reason for product from two different furnaces to have identical tensile strength distributions or, for that matter, identical distribution medians. It was felt, however, that the difference between furnaces might not be sufficiently large to be considered practically important.

We will now describe two scenarios, resulting in:

• Statistical significance when—as a consequence of a large sample size—the effect under consideration is, in fact, without practical significance.
• Failure to achieve statistical significance—as a consequence of a small sample size—when, in fact, the effect is of practical significance.

### Scenario one: Statistical significance without practical significance

Assumed tensile strength distribution for new furnace. Under this scenario, assume that the tensile strength distribution for specimens from the new furnace, like that for the old furnace, is lognormal with a shape parameter of 0.25. However, now assume that the median of this distribution has shifted (slightly down) from 35 MPa to 34 MPa.

This distribution is shown by the blue solid curve in Figure 1. The small deterioration in tensile strength for the new furnace, as compared with the old furnace, can be readily compensated for by taking other measures and—if it were known—would not be regarded as sufficiently large to disqualify the new furnace.

Assumed tensile strength data. The preceding information about the new furnace, unlike that for the old furnace, is unknown to the manufacturer. All that is available is tensile strength data from a sample of 1,000 specimens, randomly selected from the new furnace. For purposes of illustration, such a sample is created by randomly selecting 1,000 observations via computer simulation from the assumed tensile strength distribution for the new furnace, shown in Figure 1.

Statistical significance analysis. We conducted a statistical comparison of the median tensile strength for the new furnace, as estimated from the 1,000 (simulated) specimens from this furnace, versus the known median tensile strength of 35 MPa for the old furnace. In particular, the null hypothesis for the significance test stated that the median tensile strength for the new furnace is also 35 MPa.

This analysis resulted in evidence of a statistically significant difference between the two distribution medians at a 5% significance level (that is, rejection of the stated null hypothesis) and a p-value of 0.0015. This finding contradicts the fact that, unknown to the manufacturer, the true tensile strength distribution for the new furnace does not differ from the distribution for the old furnace to a degree that is of practical significance.

Is the preceding result typical? You might think that the sample results obtained in the preceding study were a fluke—that is,  an idiosyncrasy of the data generated by our simulation.

However, further analysis (see the online sidebar, "Technical Details") showed that, given the underlying tensile strength distribution for the new furnace shown in Figure 1, the probability of getting a p-value of 0.05 or less (and declaring a statistically significant difference) in comparing the estimated median from a random sample of 1,000 specimens from the new furnace with the known median of 35 MPa for the old furnace, is 0.956 (obtained from Table 1). Moreover, in randomly selecting 1,000 specimens from the new furnace and comparing the median of this sample with the known median of 35 MPa for the old furnace, the probability of getting a p-value of 0.0015 or less is 0.69. Thus, these results are typical of what you might expect in such simulations.

More on impact of sample size. Table 1 shows the probability of establishing a statistically significant difference between the medians of the tensile strength distributions for the two furnaces for different sample sizes for the new furnace and different significance levels, assuming the tensile strength distributions for the two furnaces, shown in Figure 1.

In particular—despite the small true differences between the medians of the tensile strength distributions for the two furnaces shown in Figure 1—evidence at a 5% significance level of a statistically significant difference in medians seems likely for sample sizes of 500 specimens or more for the new furnace and seems almost assured for samples of 1,500 or more.

On the other hand, for much smaller sample sizes from the new furnace, such as 100 specimens, it is unlikely that the analysis of the data will lead to a statistically significant difference.

## Technical Details

This sidebar provides technical details about the statistical analyses (construction of statistical significance tests and calculation of confidence intervals) for the two scenarios and the generation of Online Tables 1 and 2 (XLSX 55 KB).

### Scenario one: Statistical significance without practical significance

Simulation of sample for new furnace: We assumed that the tensile strength distribution for specimens from the new furnace is lognormal with a shape parameter of 0.25 and a median of 34 MPa. In other words, the logarithms of the tensile strengths are normally distributed with mean μ = 3.5264 and standard deviation σ = 0.25. Note that the median of the lognormal distribution for tensile strengths is exp( μ )  = exp(3.5264) = 34 MPa.

We used the following R commands to generate a sample of 1,000 log tensile strengths from the preceding distribution:

set.seed(123)
n <- 1000
mean <- 3.526359623; sd <- 0.25
glassDataLarge <- rnorm(n, mean=mean, sd = sd)

Statistical significance tests: We conducted a statistical comparison of the median tensile strength for the new furnace, as estimated from the 1,000 (simulated) specimens from this furnace, versus the known median tensile strength of 35 MPa for the old furnace by testing the following hypothesis on the mean of the new furnace log tensile strengths:

• Null hypothesis: Mean of log tensile strengths for the new furnace = 3.5553.
• Alternative hypothesis: Mean of log tensile strengths for the new furnace ≠ 3.5553.

Here, the null hypothesis value of 3.5553 corresponds to the known median value of exp(3.5553)  = 35 MPa of the lognormal tensile strength distribution for the old furnace. This hypothesis test can be performed using the one-sample t-test procedure. The R commands are:

result <-  t.test(glassDataLarge, mu = 3.555347134, type = "one.sample", alternative = "two.sided", conf.level = 0.95)
result

Confidence interval for the median of the lognormal distribution:  The t-test procedure yields the confidence interval for the mean of the logs of the tensile strengths for the new furnace. The resulting interval endpoints can be transformed to confidence limits for the median tensile strengths in R through a simple transformation:

exp(result\$conf.int)

### Scenario two: Failure to achieve statistical significance when there is practical significance

Simulation of sample for new furnace: We assumed that the tensile strength distribution for specimens from the new furnace is lognormal with a shape parameter of 0.25 and a median of 31 MPa. In other words, the logarithms of the tensile strengths are normally distributed with mean μ = 3.434 and standard deviation σ = 0.25.

We used the following R commands to generate a sample of 10 log tensile strengths from the preceding distribution:

set.seed(123)
n <- 10
mean <- 3.433986632; sd <- 0.25
glassDataSmall <- rnorm(n, mean = mean, sd = sd)

Statistical significance analyses and confidence interval estimation were performed using the R commands provided for scenario one.

### R commands used to generate Table 1

This table displays the probability of establishing a statistically significant difference in median tensile strength between the old and new tensile strength distributions assumed in scenario one (see Figure 1) for different sample sizes from the new furnace.

It was obtained using the following R commands:

mean <- 3.526359623 # Mean of the log of tensile strengths for the new furnace
meanbase <-3.555347134 # Mean of the log of tensile strengths for the old furnace
sd <-0.25 # Std dev of the log of tensile strengths for the new furnace
cl <-0.95 # Confidence level

power.t.test(n = c(100,200,500,1000,1500), delta = meanbase - mean, sd = sd, sig.level = 1-cl, type = c("one.sample"), alternative = c("two.sided"))\$power

These commands yield the results for the 0.05 significance level.  Results for the 0.01 and 0.10 significance levels can be obtained by changing the value of cl to 0.99 and 0.90, respectively.

### R commands used to generate Table 2

This table displays the probability of failing to establish a statistically significant difference in median tensile strength between the old and new tensile strength distributions assumed in scenario two (see Figure 2) for different sample sizes from the new furnace.

It was obtained using the following R commands:

mean <- 3.433986632 # Mean of the log of tensile strengths for the new furnace
meanbase <- 3.555347134 # Mean of the log of tensile strengths for the old furnace
sd <- 0.25 # Std dev of the log of tensile strengths for the new furnace
cl <- 0.95 # Confidence level

1-power.t.test(n = c(10,20,50,75,100), delta = meanbase-mean, sd = sd, sig.level = 1-cl,type= c("one.sample"), alternative = c("two.sided"))\$power

These commands yield the results for the 0.05 significance level.  Results for the 0.01 and 0.10 significance levels can be obtained by changing the value of cl to 0.99 and 0.90, respectively.

—N.D., G.J.H. and W.Q.M.

### Scenario two: Failure to achieve statistical significance when there is practical significance

Assumed tensile strength distribution for the new furnace. Now consider the scenario in which the difference in tensile strength distributions for specimens from the two furnaces is regarded to be large enough to be of practical significance. In particular, let’s again assume that the tensile strength distribution for specimens from the new furnace, like that from the old furnace, is lognormal with a shape parameter of 0.25.

Now assume that the median of the distribution has shifted from 35 MPa to 31 MPa. This distribution is shown and contrasted with the known distribution for the old furnace by the blue solid curve in Figure 2. This 4 MPA difference in the medians between the two furnaces is of practical significance and, if known, would, in fact, be sufficient reason to reject the new furnace.

Assumed tensile strength data. The preceding information about the new furnace, unlike that for the old furnace, is again unknown to the manufacturer. All that is available under this scenario is tensile strength data from a scant sample of 10 (simulated) specimens, randomly selected from this furnace. This small sample is in strong contrast to the large sample (1,000 specimens) for scenario one. For purposes of illustration, we created such a sample by randomly selecting 10 observations via computer simulation from the assumed tensile strength distribution for the new furnace shown in Figure 2.

Statistical significance analysis. We conducted a statistical comparison of the median tensile strength for the new furnace, as estimated from the 10 (simulated) specimens from this furnace, versus the known median tensile strength of 35 MPa for the old furnace. In particular, the null hypothesis for the significance test again stated that the median tensile strength for the new furnace is 35 MPa. This analysis resulted in insufficient evidence of a statistically significant difference between the two distribution medians at the 5% significance level (that is, failure to reject the null hypothesis) and a p-value of 0.21.

The preceding results might suggest to some that it makes no difference, with regard to tensile strength, which furnace is used. This, however, would be an incorrect conclusion in light of the underlying—but unknown to the manufacturer—true distribution for the new furnace, shown in Figure 2, and the actual difference in distribution medians between the two furnaces.

Is the preceding result typical? Further analysis shows that, given the underlying tensile strength distribution for the new furnace shown in Figure 2, the probability of getting a p-value of 0.05 or more (and failing to declare statistical significance), in comparing the estimated median from a random sample of 10 specimens from the new furnace with the known median of 35 MPa for the old furnace, is 0.72 (obtained from Table 2). In addition, we found that in randomly selecting 10 specimens from the new furnace and comparing the median of this sample with the known median of 35 MPa for the old furnace, the probability of getting a p-value of 0.21 or greater is 0.42. Thus, the results again are typical of what you might expect in such simulations. More on impact of sample size. Table 2 shows the probability of failing to establish a statistically significant difference between the medians of the tensile strength distributions for the two furnaces for different sample sizes for the new furnace and different significance levels, assuming the tensile strength distributions for the new furnace, shown in Figure 2.

In particular—despite the relatively large (in terms of practical significance) true differences between the tensile strength distributions for the two furnaces shown in Figure 2—failure to establish a statistically significant difference at a 5% significance level has a probability slightly less than 0.50 for samples of 20 specimens for the new furnace and seems likely for samples of 10 or less. On the other hand, for larger samples sizes, such as 75 or more, the analysis of the data will likely lead to a statistically significant difference.

The preceding discussion has hopefully convinced you to focus your data analyses beyond significance tests. But what do we recommend?

First, use an incisive plot of the data (for example, sample data displayed on a lognormal probability plot1). Then, to assess the practical significance of the results, construct an appropriate statistical interval. A variety of statistical intervals exist to address specific applications.2

The most frequently used among these is a confidence interval. In the current application, for example, it would be appropriate to compute a confidence interval for the median tensile strength for the new furnace and compare it with the known median tensile strength of the old furnace.

As noted later, there is a close relationship between confidence intervals and significance tests. Confidence intervals, however, generally give much more information than do significance tests. This is because confidence intervals provide quantitative bounds on the statistical uncertainty, providing direct information about practical significance rather than just an accept-or-reject hypothesis decision.

Such intervals also shrink in length—as they should—with an increase in sample size and the associated reduction in uncertainty. In addition, confidence intervals generally are easier to explain to management and customers than hypothesis tests. To illustrate this, let’s return to the  earlier scenarios.

Scenario one. A 95% confidence interval on the median of the tensile strength distribution for specimens from the new furnace is calculated from the 1,000 (simulated) specimens to be (33.6, 34.7) MPa. Roughly speaking, this means that you can be 95% sure that the median tensile strength for the new furnace is between 33.6 and 34.7 MPa. More precisely, you can assert that if there were many such intervals calculated from different sets of data, about 95% of such intervals would, in fact, contain the true median.

The deviation of the median tensile strength for the new furnace from 35 MPa is statistically significant at the 5% significance level—that is, the null hypothesis is rejected. This is because the 95% confidence interval for the new furnace median does not include the null hypothesis value of 35 MPa.

Moreover, the deviation of the median tensile strength from 35 MPa could be as small as 0.3 MPa (35-34.7) and is unlikely to exceed 1.4 MPa (35-33.6). Even the latter deviation, however, would not be regarded as being of practical significance—despite its statistical significance.

Scenario two. A 95% confidence interval on the median of the tensile strength distribution for specimens from the new furnace is calculated from the 10 (simulated) specimens to be (26.6, 37.5) MPa. Because this interval includes 35 MPa, the data do not provide evidence at the 5% significance level to reject the null hypothesis that the median of the tensile strength distribution for the new furnace is 35 MPa.

Moreover, the limited data for the new furnace suggests that it is possible that the median of the tensile strength distribution for the new furnace is as much as 8.4 MPa (35–26.6) smaller than that for the old furnace. But it also appears possible that the median tensile strength for the new furnace is 2.5 MPa (37.5–35) greater than that of the old furnace.

Both conclusions would be regarded as being of practical significance. Thus, further analysis of the new furnace data suggests that there could be a difference of practical importance in favor of either the new or the old furnace.

What this really means is that the specimen-to-specimen variability is too large, and/or the sample size from the new furnace is too small to permit you to draw definitive conclusions, and additional samples from the new furnace are needed to obtain more conclusive results.

### Technical notes

• Details of the statistical analyses for the two scenarios—including the construction of the statistical significance tests and the calculation of the confidence intervals, as well as the data files for the 1,000 and 10 simulated observations for the two scenarios—can be found on this article’s webpage at www.qualityprogress.com.
• The calculation of significance tests and statistical intervals assume that the given data can be considered to be a random sample from the population(s) of interest. They, therefore, quantify only the uncertainty due to sampling variability. In our examples, the data are from past production, but, in practice, you’re typically interested in performance for future production, which makes this, borrowing from W. Edwards Deming’s terminology, an "analytic study." See W. Edwards Deming, "On Probability as a Basis for Action," The American Statistician, Vol. 29, No. 4, 1975, pp. 146-152. Thus, a basic assumption underlying any future projections of our analyses is that the statistical distributions of tensile strengths of specimens from the two furnaces do not change from the past to the future. When this assumption is questionable, it might be appropriate to refrain from doing any statistical analysis or, as a minimum, to make clear the limitations of such analyses.
• The two scenarios in this column are focused on the comparison of the medians of the tensile strength distributions for the two furnaces because this was of greatest interest in this application. The analyses can be readily modified, however, to apply for other population properties of interest, such as different tensile strength distribution percentiles or the probability of tensile strength falling below a specified value.
• The two scenarios involved complete samples in that a quantitative tensile strength reading was obtained for all specimens. In many reliability applications, the data need to be analyzed before all units have failed, resulting in so-called "censored" observations. The general results presented in this column also extend to such situations.
• The controversy surrounding significance testing, and the related concept of p-values, has been in the spotlight recently and is the subject of a statement by the American Statistical Association. See Ronald L. Wasserstein and Nicole A. Lazar, "The ASA’s Statement on p-values: Context, Process, and Purpose," American Statistician, Vol. 70, No. 2, 2016, pp. 129-133.

### References

1. Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker "Reliability (or Product Life) Data Analysis: A Case Study," Quality Progress, June 2000, pp. 115-121.
2. William Q. Meeker, Gerald J. Hahn and Luis A. Escobar, Statistical Intervals: A Guide for Practitioners and Researchers, second edition, Wiley, 2017.

Necip Doganaksoy is associate professor at Siena College School of Business in Loudonville, NY, following a 26-year career in industry, mainly at General Electric (GE). He has a doctorate in administrative and engineering systems from Union College in Schenectady, NY. Doganaksoy is a fellow of ASQ and the American Statistical Association.

Gerald J. Hahn is a retired manager of statistics at the GE Global Research Center in Schenectady. He has a doctorate in statistics and operations research from Rensselaer Polytechnic Institute in Troy, NY. Hahn is a fellow of ASQ and the American Statistical Association.

William Q. Meeker is professor of statistics and distinguished professor of liberal arts and sciences at Iowa State University in Ames. He has a doctorate in administrative and engineering systems from Union College. Meeker is a fellow of ASQ and the American Statistical Association. This is an outstanding and easy to understand article with an excellent explanation of the p value and "significance level"--a term that can be confusing in the absence of an explanation like the one given in "More on Significance Testing and p-values."
--Bill Levinson, 02-10-2018 In the discussion on confidence intervals, the authors state that a 95% confidence interval on the median roughly means that you can be 95% sure that the median falls within the stated confidence limits. While the authors go on to more precisely state the correct meaning of a confidence interval, I think their initial, "rough" statement is misleading. The step from the authors "rough" statement to someone stating that the median is between the stated confidence limits with 95% probability is made all too frequently. Authors of the stature of these need to emphasize the more precise meaning and avoid the more "rough" version.

Thanks for the great article.
--Mark Fiedeldey, 11-19-2017

### Average Rating Out of 1 Ratings