Fleshing Things Out

Combining p-values to test for significance in a hypothesis

by Connie M. Borror

A few months ago, a colleague asked me to review some data a student had gathered on cannibalization in spiders.

They had 10 families of spiders, and each family had 14 members. We’ll call the families A, B, C … J. The families were randomly paired. Within each pair, a member from one family was put into a container with one member of another family, as shown in Figure 1.

Figure 1

The original study was designed to determine whether spiders from different families cannibalized more quickly than spiders from the same family. Previous studies indicated that spiders in the same family tended to cannibalize more quickly than nonfamily members.1 The spider pairs were observed to see which family member ate or cannibalized the other one. The results were zero if the spider was eaten and one if the spider did the eating.

The hypothesis was that each family had a 50/50 chance of being cannibalized. So for each pair, the hypothesis would be that the proportion of successes for one family is equal to the proportion of successes for the second family, or we can test that the proportion of successes equals 0.50; the two hypotheses are equivalent, just tested differently.

It’s important to note that the families have nothing in common aside from being spiders. For example, we have no reason to combine the outcomes for family A with the outcomes for family C. As a result, we have five independent hypothesis tests.

The student correctly tested each hypothesis and obtained five individual p-values. Three of the five p-values were less than 0.10, with one of them less than 0.05, giving my colleague and the student the impression that maybe something was going on here: Perhaps families do not have an equal chance of "eat or be eaten."

But my colleague and the student realized the sample sizes were small and asked whether they could somehow combine their results to obtain an overall test of significance to determine whether there was really something going on that merited further investigation. In addition, my colleague wanted a single p-value to report, not five.

At this point, you may be wondering what this has to do with quality or a Statistics Roundtable column. Cannibalizing spiders may have nothing to do with quality, but the method used to report a single p-value for several independent significance tests does.

After researching the problem, I came across Ronald A. Fisher’s method for combining p-values2, 3 and realized it has been used extensively for meta-analysis in areas such as genetics, ecology, biosurveillance and public health monitoring.4,5

Fisher’s method

Using this method, we first combine all of the p-values into a single statistic, Figure 1, in which pi represents the p-value, and k is the number of p-values to be combined. There are two degrees of freedom for each individual p-value. This statistic follows a chi-square distribution with 2k degrees of freedom. After this value is calculated, the p-value can be obtained from Excel. I also have used a program called MetaP, which is free and can be found online, to carry out the calculations.6

To illustrate, suppose the p-values for the five independent spider tests were 0.095, 0.105, 0.999, 0.082 and 0.028. Given this information, we obtain a combined p-value.

Table 1 displays the Excel layout and calculations for the test. For this problem, the combined p-value is 0.019, which would lead to the rejection of the null hypothesis.

Table 1

This would be interpreted as saying the accumulation of information from the tests provides evidence the cannibalization rate may not be the same across families.

From a practical viewpoint, it would lead us to conduct larger experiments specifically designed to more fully investigate this phenomenon. It’s also interesting to note you don’t need to know any additional information about the study from which the p-value came.

For example, it is not necessary to know the sample sizes to carry out this test. The sample sizes have been accounted for in the original calculation of the individual p-value.

Proceed with care

The combined p-value approach is simple to use, but it should be used with caution. The approach was intended to give the user an idea as to whether the results of the individual tests as a whole (on the aggregate) are significant if you are testing the same or similar hypotheses.

With a small combined p-value, we can reject the hypothesis that all of our individual hypotheses are true and conclude that our combined data indicates the shared null hypothesis is false.

But Fisher’s method doesn’t require all null hypotheses be the same as it was in the spider example. In fact, many applications, such as microarray analysis, involve (in some cases) thousands of null hypotheses that aren’t the same.

In that case, the alternative hypothesis is that at least one of the null hypotheses is false. During the past decade, the method and its alternative forms have been integrated into surveillance and monitoring methods to aid in detecting unusual activity or anomalies in processes.7

There has been much debate about the power and appropriateness of this simple test. As a result, there are many alternative methods that have been developed to combine p-values from various independent sources.8

I encourage interested readers to seek out more information about the uses and drawbacks to these methods for combining information and determine whether such methods can be useful in other arenas in the field of quality.


  1. J. Chadwick Johnson, Kathryn Kitchen, and Maydianne C.B. Andrade, "Family Affects Sibling Cannibalism in the Black Widow Spider," Latrodectus Hesperus, Ethology, 2010, Vol. 116, pp. 770–777.
  2. Ronald A. Fisher, Statistical Methods for Research Workers, Oliver and Boyd, 1925.
  3. Charles Frederick Mosteller, and Ronald A. Fisher, "Questions and Answers: Combining Independent Tests of Significance," American Statistician, Vol. 2, No. 5, 1948, pp. 30-31.
  4. Howard S. Burkom, Sean Murphy, Jacqueline Coberly and Kathy Hurt-Mullen, "Public Health Monitoring Tools for Multiple Data Streams," Morbidity and Mortality Weekly Report, Centers for Disease Control, 2005, www.cdc.gov/mmwr/preview/mmwrhtml/su5401a11.htm.
  5. Josep Roure, Artur Dubrawski and Jeff Schneider, "A Study Into Detection of Bio-Events in Multiple Streams of Surveillance Data," Intelligence and Security Informatics: Biosurveillance, 2007, pp. 124-133, www.cs.cmu.edu/~schneide/JosepBioWorkshop07.pdf.
  6. Dongliang Ge, "MetaP: A Program to Combine p-values," software, 2012, www.svaproject.org/metap.php.
  7. Roure, Dubrawski and Schneider, "A Study Into Detection of Bio-Events in Multiple Streams of Surveillance Data," see reference 5.
  8. Walter W. Piegorsch and A. John Bailer, "Combining Information," Wiley Interdiscip Rev Comput Stat, 2009, Vol. 1, No. 3, pp. 354-360.

Connie M. Borror is a professor in the division of mathematical and natural sciences at Arizona State University West in Glendale. She earned her doctorate in industrial engineering from Arizona State University in Tempe. She is a fellow of ASQ and the American Statistical Association, and is the editor of Quality Engineering.

Average Rating


Out of 0 Ratings
Rate this article

Add Comments

View comments
Comments FAQ

Featured advertisers