A Different Take
Permutation tests and Monte Carlo
simulations to show
by Connie M. Borror
I still find p-values to be one of the more difficult topics to explain. A rough definition of a p-value is the probability of obtaining a value of the test statistic that is as large or larger than the test statistic value you obtained from your experiment.
In other words, if we could repeat the experiment several times under identical conditions, how likely is it that we would see that same value of the test statistic or something larger than that value? Just introducing p-values in this way to my students—after hammering away at them with the idea of null and alternative hypotheses—results in glazed and faraway looks.
I have found implementing permutation tests and simple Monte Carlo simulations to demonstrate p-values can be helpful. Let’s look at this using an example.
Suppose we work with an organization that manufactures biodegradable trash bags for daily household use. There are several components (factors) needed to manufacture the bags. Let’s concentrate on just one particular raw material used in the production of these bags. The manufacturer currently purchases this raw material from supplier 1 (S1).
The organization is considering adding supplier 2 (S2) of the raw material to improve the reliability of its supply chain. Some initial studies have been completed and determined that the chemical properties for the raw material from the two suppliers appear to be the same. S2 will only be added, however, if it can be shown further that the bags produced using their raw material result in bag strength at least as strong as those bags currently manufactured by the organization.
The practical question is: Are bags made with raw material from the prospective supplier (S2) as strong or stronger than bags made with raw material from the current supplier (S1)?
Twenty bags were produced: 10 using S1’s raw material and 10 using S2’s raw material. Tensile strength (in pascal, Pa) of the bag is the measurement of interest. Column 2 in Table 1 shows the results of the study. Figure 1 displays a dot plot of tensile strength by supplier.
From Figure 1, it appears that the bags from S2 are somewhat stronger than those from S1, but it is not obvious whether this is a significant increase or whether the differences in tensile strength are due to chance. To determine whether this perceived difference is statistically significant, we want to carry out some significance tests.
The organization will only purchase raw material from S2 if the average tensile strength of its trash bags is the same or better than the average tensile strength of trash bags made with raw material from S1 (that is, average strength of S2 > average strength of S1, which can be rewritten as S2 – S1 > 0).
The common analysis of results from a two-sample test such as this involves the comparison of the average tensile strengths in which we state null hypothesis as H0: μ1 - μ2 = 0 (or μ2 - μ1 = 0), and the alternative hypothesis as H1: μ1 - μ2 < 0 (or μ2 - μ1 > 0), in which μ1 represents the true population mean tensile strength for S1 and μ2 represents the true population
Note the alternative hypothesis represents the case that the mean tensile strength from S2 exceeds that of S1, which, if supported, represents a change in business, adding S2 to the supply chain.
After the purpose of the test has been identified, we would calculate a test statistic, often from the student’s t-distribution, calculate a p-value and finally make a decision as to whether we can reject the null hypothesis in favor of the alternative (that is, if the p-value is small, reject the null).
The standard two-sample t-test as identified here is perfectly valid. When teaching statistics, however, I see at least a portion of the class misunderstands or misinterprets the p-value. I have found applying the ideas of randomization tests, permutation tests or Monte Carlo simulation often helps significantly in trying to unravel the mystery of the p-value.
Sometimes, just the idea behind these approaches will aid in the understanding and interpretation without having to rely on specific distributions or standard test statistics. Using only the difference of averages, random assignment of observations and repetition can significantly improve understanding.
Difference of averages as the test statistic
For this example, the average tensile strength for 10 bags from S1 is 24.549 Pa (sample average for S1) and the average tensile strength for 10 bags from S2 is 26.088 Pa (sample average for S2). The average difference is then 26.088 – 24.549 = 1.539 Pa (subtracting S1 from S2, which is appropriate if we write our null and alternative hypotheses using μ2 - μ1).
Is 1.539 a significant difference? Is this difference due to different suppliers or is this difference due to chance? We can use our raw data again in Table 1 to answer these questions.
The idea here is as follows: If there is truly no significant difference in suppliers, the 20 tensile strengths observed in column 2 of Table 1 could have come from either S1 or S2. That is, we could take all 20 observations, randomly assign 10 of them to S1 and the other 10 to S2, and the average difference in tensile strength should be approximately the same as our original difference: 1.539.
For example, one possible arrangement of all 20 observations is given in column 3 of Table 1 (labeled assignment 1) where the tensile strengths in bold for S1 were originally tensile strengths from S2, and the tensile strengths in bold for S2 were originally from S1.
For this new arrangement, the average tensile strengths for S1 and S2 are now 25.031 Pa and 25.605 Pa, respectfully, resulting in a new average difference of 25.605 – 25.031 = 0.574. Is an average difference of 0.574 close to the original difference of 1.539? What if we rearranged the observations again?
Suppose we randomly assigned the 20 original observations to S1 and S2 repeatedly, and calculated the average difference each time. For example, assignments 2-5 in Table 1 are, again, all 20 original observations, but now randomly assigned to the two suppliers. The average tensile strengths for each supplier are provided in Table 1, along with the difference in the average tensile strengths. The difference in average tensile strengths displayed in Table 1 for the random assignments are quite different from the actual difference of 1.539.
How likely is it that a difference of 1.539 should occur if there really is no difference between the two suppliers? So far, with a limited number of random assignments, 1.539 does not appear to be highly likely, if the two suppliers really are resulting in trash bags with similar tensile strengths. The dot diagram in Figure 2 shows the average differences for the actual experiment and for all five reassignments noted in Table 1.
What about the p-value for this problem? We are going to represent the p-value as the proportion of differences that are equal to or greater than our 1.539. That is, we are interested in the number of times the difference we obtain is greater than or equal to 1.539 and divide that total by the number of times we reassigned the 20 observations. In this case, the p-value would be zero because zero out of five differences from random assignments were greater than or equal to our actual difference of 1.539. Of course, we would not want to base our decision on just five differences. We would want many more.
The tensile strengths could be randomly assigned to either S1 or S2 in many different ways. In fact, the total possible number of combinations of these 20 observations being assigned to either S1 or S2 in groups of 10 is 184,756. If we randomly assigned the 20 tensile strengths to the suppliers (10 each) repeatedly, plotting the average differences each time, we would get a histogram similar to that in Figure 3.
Figure 3 actually displays the distribution of 100,000 average differences (S2 – S1) of randomly assigned tensile strengths to S1 and S2. The vertical red dotted line represents the actual average difference of 1.539 Pa found from the original data in Table 1. From Figure 3, we see that the proportion of differences that are greater than or equal to 1.539 is very small. In fact, of 100,000 simulated permutations, only 19 of them resulted in differences that were as large as or larger than 1.539.
The proportion of differences in the tail area is 19/100,000 = 0.00019 and represents the p-value for this test. This small p-value indicates that an average difference of 1.539 that we did see is highly unusual, and the tensile strengths for the two suppliers are significantly different.
Furthermore, because the difference between the average of S2 and the average for S1 (S2 – S1) is positive (1.539), we could conclude the average tensile strength of bags from S2 is statistically significantly higher than the average tensile strength of bags from S1. Based on this test, the organization could add S2 to its supply chain and increase its chances of always having a supply of the much-needed raw material available when needed without sacrificing the strength of the bag.
The illustration given here works really well in a classroom setting when students can participate in the randomization and reassignment of the original observations and obtain their own proportions (that is, estimates of p-values).
The activity is useful for understanding p-values and is not a demonstration on how to conduct a test of hypothesis by hand. Fortunately, there are statistical packages and code already written for randomization tests or permutation tests using a Monte Carlo simulation so that the resulting p-value is calculated automatically, but it’s still helpful to understand the basic ideas presented in this column.1
- T. Lynn Eudey, Joshua D. Kerr and Bruce E. Trumbo, "Using R to Simulate Permutation Distributions for Some Elementary Experimental Designs," Journal of Statistics Education, 2010, Vol. 18, No. 1, pp. 1-30, www.amstat.org/publications/jse/v18n1/eudey.pdf.
Connie M. Borror is a professor in the school of mathematical and natural sciences at Arizona State University West in Glendale. She earned her doctorate in industrial engineering from Arizona State University in Phoenix. Borror is a fellow of both ASQ and the American Statistical Association and past editor of Quality Engineering.