## 2020

STATISTICS SPOTLIGHT

# Fallacies of Statistical Significance

## The case for focusing your data analyses beyond significance tests

by Necip Doganaksoy, Gerald J. Hahn and William Q. Meeker

Statistical significance (or hypothesis) tests, and
the related concept of *p*-values, are popular
tools in statistical data analysis. Unfortunately, the practical implications
of statistical significance often turn out to be limited and are frequently
misinterpreted and overstated, or even stated incorrectly.

This column—in the context of product reliability data analysis—will address the important distinction between statistical significance and practical significance, and urge practitioners to carry their analyses beyond significance tests—often by constructing appropriate confidence intervals.

### Statistical significance and *p*-values in reliability data analysis

In applications dealing with product reliability
and life data analysis, significance tests and *p*-values
tell whether a particular statistical hypothesis is reasonable based upon
analysis of the given data. Some typical statistical hypotheses are:

- "These two products have the same mean life."
- "The median of the product life distribution exceeds five years."
- "Ten-year reliability of this product is 80%."

For further elaboration, see the sidebar "More on
Significance Testing and *p*-values."

## More on
Significance Testing and *p*-values

Statistical significance tests are designed so there is a small probability (such as 1% or 5%) of incorrectly rejecting a so-called "null hypothesis"—illustrated by the three statements in the text—when, in actuality, this hypothesis is true. This probability is known as the "significance level" or Type I error probability, and is frequently denoted by α.

A slight generalization of the standard significance test
is—instead of specifying a significance level—to calculate a
so-called "*p*-value" associated
with the null hypothesis, based on the data.

For example, for the null hypothesis "These two products
have the same mean life," the *p*-value calculated
from the data provides the probability of obtaining, by chance alone, a
difference between the two sample means that is as large or larger than that
observed, if, indeed, the two product populations have identical means.

If the *p*-value is
small—for example, less than 0.05—you typically reject the null
hypothesis. In this example, this rejection would result in the conclusion that
there is a statistically significant difference between the two population
means.

On the other hand, if the *p*-value is not small,
you would conclude that the data do not provide sufficient evidence to reject
the null hypothesis. In this example, this would mean concluding that the data
do not provide sufficient evidence to conclude that the two sampled populations
have different means.

However, a *p*-value of 0.06, for
example, comes closer to providing such evidence than a *p*-value of 0.60, for
example, even though neither allows you to formally reject the null hypothesis
at a 5% significance level.

All of this, of course, is far different from proving that the two population means are the same.

*—N.D., G.J.H. and W.Q.M.*

### Key
limitation of significance tests and *p*-values

As we will demonstrate with two examples from
product reliability data analysis, a key limitation of the concept of
statistical significance (and *p*-values) is that
statistical significance may frequently differ from practical significance as
defined by the problem context. Specifically:

- If the sample size for the given data is large, a statistically significant result is likely to be obtained even when the actual magnitude of the effect is relatively small and of little or no practical importance.
- When the sample size for the given data is small, it is quite likely that insufficient evidence for a statistically significant result will be obtained—even though there is, indeed, an effect of practical importance.

Essentially,
practical significance depends on real-world considerations, dealing with such
matters as customer satisfaction, management expectations and cost. Statistical
significance and *p*-values, on the other hand,
are heavily affected not only by the magnitude of the real-world effects (as
they should be), but also the size of the sample upon which the analysis is
based—typically an extraneous factor in considering practical
significance.

### Glass strength example

The following example deals with tensile strength testing of glass specimens. This test characterizes the maximum stress that the specimens can withstand under a predefined load before breaking. Glass strength applications lend themselves especially well to this discussion because this business presents frequent situations with large and small data sets.

A glass manufacturer has extensive experience with a current manufacturing process. In particular, it has been established that the tensile strength distribution of specimens, built in an existing furnace, is stable over time and follows a lognormal distribution (as is typical in such applications) with a median value of 35 megapascals (MPa) and a shape parameter (standard deviation of log tensile strength) of 0.25. This distribution is shown by the red dashed curves in Figures 1 and 2.

It is desired to replace the old furnace with a new furnace. Before making this important process change, however, the manufacturer wants to compare the tensile strength distributions of the two furnaces and, in particular, the medians of the two distributions.

With this in mind, a random sample of specimens built using the new furnace are to be tested to compare, among other things, the median of the estimated distribution of tensile strength for the new furnace with the assumed known median tensile strength of 35 MPa for the old furnace.

You can argue that in this example, like many other similar applications, there is no reason for product from two different furnaces to have identical tensile strength distributions or, for that matter, identical distribution medians. It was felt, however, that the difference between furnaces might not be sufficiently large to be considered practically important.

We will now describe two scenarios, resulting in:

- Statistical significance when—as a consequence of a large sample size—the effect under consideration is, in fact, without practical significance.
- Failure to achieve statistical
significance—as a consequence of a small sample size—when, in fact,
the effect is of practical significance.

### Scenario one: Statistical significance without practical significance

**Assumed tensile strength distribution
for new furnace. **Under this scenario, assume that the tensile
strength distribution for specimens from the new furnace, like that for the old
furnace, is lognormal with a shape parameter of 0.25. However, now assume that
the median of this distribution has shifted (slightly down) from 35 MPa to 34
MPa.

This distribution is shown by the blue solid curve in Figure 1. The small deterioration in tensile strength for the new furnace, as compared with the old furnace, can be readily compensated for by taking other measures and—if it were known—would not be regarded as sufficiently large to disqualify the new furnace.

**Assumed tensile strength data. **The
preceding information about the new furnace, unlike that for the old furnace,
is unknown to the manufacturer. All that is available is tensile strength data
from a sample of 1,000 specimens, randomly selected from the new furnace. For
purposes of illustration, such a sample is created by randomly selecting 1,000
observations via computer simulation from the assumed tensile strength
distribution for the new furnace, shown in Figure 1.

**Statistical significance analysis. **We
conducted a statistical comparison of the median tensile strength for the new
furnace, as estimated from the 1,000 (simulated) specimens from this furnace, versus
the known median tensile strength of 35 MPa for the old furnace. In particular,
the null hypothesis for the significance test stated that the median tensile
strength for the new furnace is also 35 MPa.

This analysis resulted in evidence of a statistically significant difference between the two distribution medians at a 5% significance level (that is, rejection of the stated null hypothesis) and a p-value of 0.0015. This finding contradicts the fact that, unknown to the manufacturer, the true tensile strength distribution for the new furnace does not differ from the distribution for the old furnace to a degree that is of practical significance.

**Is the preceding result typical? **You
might think that the sample results obtained in the preceding study were a
fluke—that is, an
idiosyncrasy of the data generated by our simulation.

However, further analysis (see the online sidebar, "Technical Details") showed that, given the underlying tensile strength distribution for the new furnace shown in Figure 1, the probability of getting a p-value of 0.05 or less (and declaring a statistically significant difference) in comparing the estimated median from a random sample of 1,000 specimens from the new furnace with the known median of 35 MPa for the old furnace, is 0.956 (obtained from Table 1).

Moreover, in randomly selecting 1,000 specimens from the new furnace and comparing the median of this sample with the known median of 35 MPa for the old furnace, the probability of getting a p-value of 0.0015 or less is 0.69. Thus, these results are typical of what you might expect in such simulations.

**More on impact of sample size. **Table
1 shows the probability of establishing a statistically significant difference
between the medians of the tensile strength distributions for the two furnaces
for different sample sizes for the new furnace and different significance
levels, assuming the tensile strength distributions for the two furnaces, shown
in Figure 1.

In particular—despite the small true differences between the medians of the tensile strength distributions for the two furnaces shown in Figure 1—evidence at a 5% significance level of a statistically significant difference in medians seems likely for sample sizes of 500 specimens or more for the new furnace and seems almost assured for samples of 1,500 or more.

On the other hand, for much smaller sample sizes from the new furnace, such as 100 specimens, it is unlikely that the analysis of the data will lead to a statistically significant difference.

## Technical Details

This sidebar provides technical details about the statistical analyses (construction of statistical significance tests and calculation of confidence intervals) for the two scenarios and the generation of Online Tables 1 and 2 (XLSX 55 KB).

### Scenario one: Statistical significance without practical significance

**Simulation of
sample for new furnace**: We assumed that the tensile strength distribution for specimens from
the new furnace is lognormal with a shape parameter of 0.25 and a median of 34
MPa. In other words, the logarithms of the tensile strengths are normally
distributed with mean *μ* = 3.5264
and standard deviation *σ* = 0.25.
Note that the median of the lognormal distribution for tensile strengths is
exp( μ ) =
exp(3.5264) = 34 MPa.

We used the following R commands to generate a sample of 1,000 log tensile strengths from the preceding distribution:

set.seed(123)

n <- 1000

mean <- 3.526359623; sd <-
0.25

glassDataLarge <- rnorm(n,
mean=mean, sd = sd)

**Statistical
significance tests: **We conducted a statistical comparison of the median tensile strength
for the new furnace, as estimated from the 1,000 (simulated) specimens from
this furnace, versus the known median tensile strength of 35 MPa for the old
furnace by testing the following hypothesis on the mean of the new furnace log
tensile strengths:

**Null hypothesis:**Mean of log tensile strengths for the new furnace = 3.5553.**Alternative hypothesis:**Mean of log tensile strengths for the new furnace ≠ 3.5553.

Here, the null hypothesis value of 3.5553 corresponds to the known median value of exp(3.5553) = 35 MPa of the lognormal tensile strength distribution for the old furnace. This hypothesis test can be performed using the one-sample t-test procedure. The R commands are:

result <- t.test(glassDataLarge, mu = 3.555347134,
type = "one.sample", alternative = "two.sided", conf.level
= 0.95)

result

**Confidence
interval for the median of the lognormal distribution: ** The t-test procedure yields the
confidence interval for the mean of the logs of the tensile strengths for the
new furnace. The resulting interval endpoints can be transformed to confidence
limits for the median tensile strengths in R through a simple
transformation:

exp(result$conf.int)

### Scenario two: Failure to achieve statistical significance when there is practical significance

**Simulation of
sample for new furnace: **We assumed that the tensile strength distribution for specimens from
the new furnace is lognormal with a shape parameter of 0.25 and a median of 31
MPa. In other words, the logarithms of the tensile strengths are normally
distributed with mean *μ* = 3.434
and standard deviation *σ* = 0.25.

We used the following R commands to generate a sample of 10 log tensile strengths from the preceding distribution:

set.seed(123)

n <- 10

mean <- 3.433986632; sd <-
0.25

glassDataSmall <- rnorm(n, mean =
mean, sd = sd)

Statistical significance analyses and confidence interval estimation were performed using the R commands provided for scenario one.

### R commands used to generate Table 1

This table displays the probability of establishing a statistically significant difference in median tensile strength between the old and new tensile strength distributions assumed in scenario one (see Figure 1) for different sample sizes from the new furnace.

It was obtained using the following R commands:

mean <- 3.526359623 # Mean of the
log of tensile strengths for the new furnace

meanbase <-3.555347134 # Mean of the
log of tensile strengths for the old furnace

sd <-0.25 # Std dev of the log of
tensile strengths for the new furnace

cl <-0.95 # Confidence level

power.t.test(n = c(100,200,500,1000,1500), delta = meanbase - mean, sd = sd, sig.level = 1-cl, type = c("one.sample"), alternative = c("two.sided"))$power

These commands yield the results for the 0.05 significance level. Results for the 0.01 and 0.10 significance levels can be obtained by changing the value of cl to 0.99 and 0.90, respectively.

### R commands used to generate Table 2

This table displays the probability of failing to establish a statistically significant difference in median tensile strength between the old and new tensile strength distributions assumed in scenario two (see Figure 2) for different sample sizes from the new furnace.

It was obtained using the following R commands:

mean <- 3.433986632 # Mean of the log of tensile
strengths for the new furnace

meanbase <- 3.555347134 # Mean of
the log of tensile strengths for the old furnace

sd <- 0.25 # Std dev of the log of
tensile strengths for the new furnace

cl <- 0.95 # Confidence level

1-power.t.test(n = c(10,20,50,75,100), delta = meanbase-mean, sd = sd, sig.level = 1-cl,type= c("one.sample"), alternative = c("two.sided"))$power

These commands yield the results for the 0.05 significance level. Results for the 0.01 and 0.10 significance levels can be obtained by changing the value of cl to 0.99 and 0.90, respectively.

*—N.D., G.J.H. and W.Q.M.*

### Scenario two: Failure to achieve statistical significance when there is practical significance

**Assumed tensile
strength distribution for the new furnace. **Now consider the scenario in which the difference in tensile strength
distributions for specimens from the two furnaces is regarded to be large
enough to be of practical significance. In particular, let’s again assume that
the tensile strength distribution for specimens from the new furnace, like that
from the old furnace, is lognormal with a shape parameter of 0.25.

Now assume that the median of the distribution has shifted from 35 MPa to 31 MPa. This distribution is shown and contrasted with the known distribution for the old furnace by the blue solid curve in Figure 2. This 4 MPA difference in the medians between the two furnaces is of practical significance and, if known, would, in fact, be sufficient reason to reject the new furnace.

**Assumed tensile
strength data. **The preceding
information about the new furnace, unlike that for the old furnace, is again
unknown to the manufacturer. All that is available under this scenario is
tensile strength data from a scant sample of 10 (simulated) specimens, randomly
selected from this furnace. This small sample is in strong contrast to the
large sample (1,000 specimens) for scenario one. For purposes of illustration,
we created such a sample by randomly selecting 10 observations via computer
simulation from the assumed tensile strength distribution for the new furnace
shown in Figure 2.

**Statistical significance analysis. **We
conducted a statistical comparison of the median tensile strength for the new
furnace, as estimated from the 10 (simulated) specimens from this furnace,
versus the known median tensile strength of 35 MPa for the old furnace. In
particular, the null hypothesis for the significance test again stated that the
median tensile strength for the new furnace is 35 MPa. This analysis resulted
in insufficient evidence of a statistically significant difference between the
two distribution medians at the 5% significance level (that is, failure to
reject the null hypothesis) and a p-value of
0.21.

The preceding results might suggest to some that it makes no difference, with regard to tensile strength, which furnace is used. This, however, would be an incorrect conclusion in light of the underlying—but unknown to the manufacturer—true distribution for the new furnace, shown in Figure 2, and the actual difference in distribution medians between the two furnaces.

**Is the preceding
result typical? **Further analysis
shows that, given the underlying tensile strength distribution for the new
furnace shown in Figure 2, the probability of getting a p-value of 0.05 or more (and failing to declare
statistical significance), in comparing the estimated median from a random
sample of 10 specimens from the new furnace with the known median of 35 MPa for
the old furnace, is 0.72 (obtained from Table 2). In addition, we found that in
randomly selecting 10 specimens from the new furnace and comparing the median
of this sample with the known median of 35 MPa for the old furnace, the
probability of getting a p-value of 0.21 or greater
is 0.42. Thus, the results again are typical of what you might expect in such
simulations.

**More on impact of sample size. **Table
2 shows the probability of failing to establish a statistically significant
difference between the medians of the tensile strength distributions for the
two furnaces for different sample sizes for the new furnace and different
significance levels, assuming the tensile strength distributions for the new
furnace, shown in Figure 2.

In particular—despite the relatively large (in terms of practical significance) true differences between the tensile strength distributions for the two furnaces shown in Figure 2—failure to establish a statistically significant difference at a 5% significance level has a probability slightly less than 0.50 for samples of 20 specimens for the new furnace and seems likely for samples of 10 or less. On the other hand, for larger samples sizes, such as 75 or more, the analysis of the data will likely lead to a statistically significant difference.

### A confidence interval is generally more informative

The preceding discussion has hopefully convinced you to focus your data analyses beyond significance tests. But what do we recommend?

First, use an incisive plot of the data (for
example, sample data displayed on a lognormal probability plot^{1}).
Then, to assess the practical significance of the results, construct an
appropriate statistical interval. A variety of statistical intervals exist to
address specific applications.^{2}

The most frequently used among these is a confidence interval. In the current application, for example, it would be appropriate to compute a confidence interval for the median tensile strength for the new furnace and compare it with the known median tensile strength of the old furnace.

As noted later, there is a close relationship between confidence intervals and significance tests. Confidence intervals, however, generally give much more information than do significance tests. This is because confidence intervals provide quantitative bounds on the statistical uncertainty, providing direct information about practical significance rather than just an accept-or-reject hypothesis decision.

Such intervals also shrink in length—as they should—with an increase in sample size and the associated reduction in uncertainty. In addition, confidence intervals generally are easier to explain to management and customers than hypothesis tests. To illustrate this, let’s return to the earlier scenarios.

**Scenario one**. A 95%
confidence interval on the median of the tensile strength distribution for
specimens from the new furnace is calculated from the 1,000 (simulated)
specimens to be (33.6, 34.7) MPa. Roughly speaking, this means that you can be
95% sure that the median tensile strength for the new furnace is between 33.6
and 34.7 MPa. More precisely, you can assert that if there were many such
intervals calculated from different sets of data, about 95% of such intervals
would, in fact, contain the true median.

The deviation of the median tensile strength for the new furnace from 35 MPa is statistically significant at the 5% significance level—that is, the null hypothesis is rejected. This is because the 95% confidence interval for the new furnace median does not include the null hypothesis value of 35 MPa.

Moreover, the deviation of the median tensile strength from 35 MPa could be as small as 0.3 MPa (35-34.7) and is unlikely to exceed 1.4 MPa (35-33.6). Even the latter deviation, however, would not be regarded as being of practical significance—despite its statistical significance.

**Scenario two**. A 95% confidence interval on the median of the
tensile strength distribution for specimens from the new furnace is calculated
from the 10 (simulated) specimens to be (26.6, 37.5) MPa. Because this interval
includes 35 MPa, the data do not provide evidence at the 5% significance level
to reject the null hypothesis that the median of the tensile strength
distribution for the new furnace is 35 MPa.

Moreover, the limited data for the new furnace suggests that it is possible that the median of the tensile strength distribution for the new furnace is as much as 8.4 MPa (35–26.6) smaller than that for the old furnace. But it also appears possible that the median tensile strength for the new furnace is 2.5 MPa (37.5–35) greater than that of the old furnace.

Both conclusions would be regarded as being of practical significance. Thus, further analysis of the new furnace data suggests that there could be a difference of practical importance in favor of either the new or the old furnace.

What this really means is that the specimen-to-specimen variability is too large, and/or the sample size from the new furnace is too small to permit you to draw definitive conclusions, and additional samples from the new furnace are needed to obtain more conclusive results.

### Technical notes

- Details of the statistical analyses for the two scenarios—including the construction of the statistical significance tests and the calculation of the confidence intervals, as well as the data files for the 1,000 and 10 simulated observations for the two scenarios—can be found on this article’s webpage at www.qualityprogress.com.
- The calculation of significance tests and statistical intervals assume that the given data can be considered to be a random sample from the population(s) of interest. They, therefore, quantify only the uncertainty due to sampling variability. In our examples, the data are from past production, but, in practice, you’re typically interested in performance for future production, which makes this, borrowing from W. Edwards Deming’s terminology, an "analytic study." See W. Edwards Deming, "On Probability as a Basis for Action," The American Statistician, Vol. 29, No. 4, 1975, pp. 146-152. Thus, a basic assumption underlying any future projections of our analyses is that the statistical distributions of tensile strengths of specimens from the two furnaces do not change from the past to the future. When this assumption is questionable, it might be appropriate to refrain from doing any statistical analysis or, as a minimum, to make clear the limitations of such analyses.
- The two scenarios in this column are focused on the comparison of the medians of the tensile strength distributions for the two furnaces because this was of greatest interest in this application. The analyses can be readily modified, however, to apply for other population properties of interest, such as different tensile strength distribution percentiles or the probability of tensile strength falling below a specified value.
- The two scenarios involved complete samples in that a quantitative tensile strength reading was obtained for all specimens. In many reliability applications, the data need to be analyzed before all units have failed, resulting in so-called "censored" observations. The general results presented in this column also extend to such situations.
- The controversy
surrounding significance testing, and the related concept of
*p*-values, has been in the spotlight recently and is the subject of a statement by the American Statistical Association. See Ronald L. Wasserstein and Nicole A. Lazar, "The ASA’s Statement on*p*-values: Context, Process, and Purpose," American Statistician, Vol. 70, No. 2, 2016, pp. 129-133.

### References

- Necip Doganaksoy, Gerald J. Hahn and William Q.
Meeker "Reliability (or Product Life) Data Analysis: A Case Study,"
*Quality Progress*, June 2000, pp. 115-121. - William Q. Meeker, Gerald J. Hahn and Luis A.
Escobar,
*Statistical Intervals: A Guide for Practitioners and Researchers*, second edition, Wiley, 2017.

**Necip Doganaksoy** is associate professor at Siena College School of Business in Loudonville, NY,
following a 26-year career in industry, mainly at General Electric (GE). He has
a doctorate in administrative and engineering systems from Union College in
Schenectady, NY. Doganaksoy is a fellow of ASQ and the American Statistical
Association.

**Gerald J. Hahn** is
a retired manager of statistics at the GE Global Research Center in
Schenectady. He has a doctorate in statistics and operations research from
Rensselaer Polytechnic Institute in Troy, NY. Hahn is a fellow of ASQ and the
American Statistical Association.

**William Q. Meeker** is professor of statistics and distinguished professor of liberal arts and
sciences at Iowa State University in Ames. He has a doctorate in administrative
and engineering systems from Union College. Meeker is a fellow of ASQ and the
American Statistical Association.

In the discussion on confidence intervals, the authors state that a 95% confidence interval on the median roughly means that you can be 95% sure that the median falls within the stated confidence limits. While the authors go on to more precisely state the correct meaning of a confidence interval, I think their initial, "rough" statement is misleading. The step from the authors "rough" statement to someone stating that the median is between the stated confidence limits with 95% probability is made all too frequently. Authors of the stature of these need to emphasize the more precise meaning and avoid the more "rough" version.

Thanks for the great article.

--Mark Fiedeldey, 11-19-2017

Featured advertisers

This is an outstanding and easy to understand article with an excellent explanation of the p value and "significance level"--a term that can be confusing in the absence of an explanation like the one given in "More on Significance Testing and p-values."

--Bill Levinson, 02-10-2018