Different, Equivalent or Both?
by I. Elaine Allen and Christopher A. Seaman
In conventional tests of statistical significance, the investigator tries to reject the null hypothesis of no difference between two groups. If the investigator fails, then the hypothesis groups cannot be rejected.
In some cases we wish to do what conventional difference testing says is impossible: to prove the null hypothesis of no difference. The question then follows: “Are the two things alike in their effect, and if so, how much alike?”
One method, known as equivalence analysis, is widely used in psychological research and evaluation.1,2,3 Equivalence analysis is based on the method of bioequivalence testing, used by the Food and Drug Administration (FDA) and the pharmaceutical industry to determine whether to accept a new drug as an alternative to the drug previously approved. For this, equivalence analysis can estimate whether any difference between two groups is small enough to warrant considering the results as equivalent for clinical or policy purposes.
Equivalence Analysis Overview
Unlike traditional hypothesis testing, equivalence analysis reverses the specification of the null and alternative hypotheses.4 In most statistical testing for significant differences, the null hypothesis says differences among group means are zero. The alternative hypothesis says we must reject the hypothesis that these differences are zero.
In equivalence testing, the null hypothesis states the difference among group means is greater than some minimal difference representing practical equivalence. The alternative hypothesis is the difference is not greater than this specified minimum difference. In the analysis of differences among groups, this step allows the researcher to estimate whether identified significant differences are meaningful differences for decision makers or policy makers. This type of analysis is similar to analysis for designing noninferiority studies, but it makes use of two-sided, rather than one-sided, hypothesis testing.
Equivalence analysis also makes it possible to determine whether nonstatistically significant differences may be the consequence of small sample sizes and large variability rather than actual equivalence between two programs or systems.
Equivalence analysis can be considered the complementary test to null hypothesis testing, but it requires additional information about the effect size of the outcome to describe the variation between the groups. Using this analysis, researchers establish equivalence boundaries for the effect size and determine equivalence or nonequivalence using confidence bounds.
The conventional standard for bioequivalence of drugs is the group mean for the test drug on some outcome. For example, blood plasma uptake is within a small enough range of the group mean for the previously established drug (the control group), so the difference is not considered substantively or clinically important.
The range used for the equivalence comparison varies by case. The best accepted method is to use a percentage of the mean of the control group: A 5% difference between group means might be a conservative standard for establishing equivalence, while a 20% difference between group means might be considered liberal.
Examining Difference And Equivalence
Difference testing and equivalence analysis are not mutually exclusive. Performed together, these activities yield four possibilities:
- Different and nonequivalent (D/NE): There is a difference—sufficient to have substantive relevance. Different but equivalent (D/E): There is a difference, but it is trivial. For example, the study is overpowered.
- Not different and equivalent (ND/E): The two conditions are indistinguishable.
- Not different but also not equivalent (ND/NE): The variability is too great relative to the effect size to interpret. For example, the study is underpowered.
The four possibilities are illustrated in
Figure 1 (p. 77). Confidence intervals of the test group are
shown as vertical bars. The entire length of the vertical bar
represents the 95% confidence interval used in the t-test to
determine whether the groups are significantly different. If 0 is
within this confidence interval, then the two groups are not
statistically different (ND). If 0 is not contained, then the
groups are statistically different (D).
To perform a 10% equivalence comparison, we use a 90% confidence interval—shown as hash marks on the 95% confidence intervals. If the entirety of the 90% confidence interval is bound within the critical values, then the two groups are statistically equivalent (E). If any part of the 90% confidence interval lies on or outside of that boundary, then the two groups are not statistically equivalent (NE).
It is important to note the power of the test may greatly affect the outcome of the analysis. In Figure 1, power can be represented most easily by an extremely large or extremely small sample size or, alternatively, large or small variability in the groups. For example, for D/E, an extremely large sample size effectively shrinks the test group’s confidence interval. In ND/NE, an extremely small sample size expands the test group’s confidence interval. In this figure, the critical boundaries (+ and –) indicate equivalence levels of 5% of the mean.
When using equivalence analysis for comparing program or system outcomes, you must address these two issues:
- Much greater power (or sample size or reduced variance) is required to determine equivalence than to determine a significant difference.
- The determination of equivalence boundaries should not be done subjectively. These can be setby predetermined specification guidelines for a process or system or can involve data from past analyses of the process.
With these nontrivial issues resolved, and using a format such as Figure 1, you may find equivalence analysis allows you to compare processes or systems and present the results in a way audiences would find easy to grasp, intuitively meaningful and useful in practice.
An Example in Health Policy
A five-site study assessed the use and quality of services, satisfaction, symptoms and the functioning of adults with serious mental illness in Medicaid managed behavioral health programs compared to fee-for-service systems.5
The reviewers used hypothesis tests and equivalence analysis methods to examine the overall consistency of quality of services among the sites. Prior to comparing site outcomes, the groups were standardized by adjusting for case mix differences, such as gender and diagnosis, in a multiple regression.
Next, the difference between the effect of managed care and fee-for-service systems was synthesized over sites using meta-analytic techniques. This allowed the reviewers to examine whether the differences between the managed care group and the fee-for-service group were statistically significant in quality and satisfaction outcomes.
Finally, reviewers applied equivalence analysis to determine whether the differences between groups (in addition to being statistically significant or not) also reached a threshold for policy relevance. Because no predetermined guideline existed for relevance, equivalences of 5%, 10% and 20% were used for this analysis.
The results presented in Table 1 show which
differences were significant and which differences were
equivalent based on varying thresholds. Among the findings, all
four variants of significant differences and levels of
equivalence are represented. The ND/E and D/NE findings are not
surprising, but are consistent. The D/E finding implies while the
two groups are significantly different, this is not an important
or relevant difference. The ND/NE finding may indicate the data
are too variable or the sample size may be too small.
- J.L. Rogers, K. I. Howard and J.T. Vessey, “Using Significance Tests To Evaluate Equi-valence Between Two Experimental Groups,” Psychological Bulletin, Vol. 113, 1993, pp. 553-565.
- B. L. Stegner, A.G. Bostrom, T.X. Greenfield and C.J. Secombes, “Equivalence Testing for Use in Psychosocial and Services Research: An Introduction With Examples,” Evaluation and Program Planning, Vol. 19, No. 3, 1996, pp. 193-198, and Vol. 19, No. 4, 1996, pp. 533-540.
- William A. Hargreaves, Martha Shumway, Teh-Wei Hu and Brian Cuffel, Cost-Outcome Methods for Mental Health, Academic Press, 1998.
- Hargreaves, Cost-Outcome Methods for Mental Health, see reference 3.
- H.S. Leff, D.A. Wieman, B.H. McFarland, J.P. Morrissey, A. B. Rothbard, D.L. Shern, A.M. Wylie, R.A. Boothroyd, T.S. Stroup and I. Elaine Allen, “Assessment of Medicaid Managed Care Behavioral Health for Persons With Serious Mental Illness,” Psychiatric Services, Vol. 56, No. 10, 2005, pp. 1,245-1,254.
I. ELAINE ALLEN is professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY, and is a member of ASQ.
CHRISTOPHER A. SEAMAN is a doctoral student in mathematics at the Graduate Center of City University of New York.