Multiple Comparisons and ANOVA
Researchers use comparisons to test hypotheses that are not addressed by a standard, omnibus analysis of variance. Often, there are multiple hypotheses. As a result, the analysis of data from a single experiment can involve statistical tests of multiple comparisons.
This lesson describes two ways that attributes of multiple comparisons affect the analysis of experimental data:
- They affect the probability of making a Type I error.
- They influence the methods used to test hypotheses.
Prerequisites: This lesson assumes familiarity with comparisons and orthogonal comparisons. You should be able to distinguish an ordinary comparison from an orthogonal comparison. You should know the difference between a planned comparison and a post hoc comparison. And you should know how to represent a statistical hypothesis mathematically by a comparison. If these things are new to you, you may want to review the following lessons:
Error Rate
Error rate refers to the probability of making a Type I error - rejecting the null hypothesis when it is true. When an experiment tests multiple comparisons, researchers need to be aware of two types of error rates:
- Error rate per comparison. When a null hypothesis is represented by a single comparison, error rate per comparison is the probability of making a Type I error when testing the comparison.
- Error rate familywise. A family of comparisons refers to all of the comparisons defined for a single treatment in an experiment. Error rate familywise is the probability of making at least one Type I error when testing the family of comparisons.
Both error rates are controlled by the experimenter. Error rate per comparison is determined by the significance level (α) specified for individual hypothesis tests. And error rate familywise is determined by the significance level and the number of comparisons in the family.
When comparisons in the family are orthogonal, the probability of incorrectly rejecting at least one null hypothesis is easily calculated as:
ERF = 1 - (1 - α)C
where ERF is the probability of making at least one Type I error (i.e., the error rate familywise), α is the significance level for a single hypothesis test, and C is the number of orthogonal comparisons being tested.
The table below shows the likelihood of making a Type I error as the number of orthogonal comparisons increases, assuming the significance level for each hypothesis test is 0.05.
| Comparisons | ERF | 
|---|---|
| 2 | 0.098 | 
| 3 | 0.143 | 
| 4 | 0.185 | 
| 5 | 0.222 | 
| 6 | 0.265 | 
| 7 | 0.302 | 
| 8 | 0.337 | 
| 9 | 0.370 | 
| 10 | 0.401 | 
When the hypotheses being tested are represented by nonorthogonal comparisons, the probability of making a Type I error is hard to compute. But the trend for nonorthogonal comparisons is the same as the trend for orthogonal comparisons. The more hypotheses you test, the more likely it is that you will reject at least one hypothesis that should not be rejected.
Note: The best way to control Type I errors in your experiment is to limit the number of hypotheses you test. The fewer hypotheses you test, the less likely you will incur a Type I error. Plan to test only hypotheses relevant to your most important research question(s). Resist the urge to test every possible comparison between mean scores.
Choose the Right Technique
Regarding analytical techniques, there's good news and bad news. First, the good news. There are many ways to control Type I errors when the analysis involves multiple comparisons. Now, the bad news. Statisticians do not agree on which option is best, so you will see different recommendations in different textbooks.
In the next three lessons, we will describe three analytical techniques that are commonly used in different situations to test the statistical significance of multiple comparisons:
When it comes to controlling error rate, each technique has advantages and disadvantages. F ratio tests are designed for planned comparisons. Bonferroni's correction and Scheffé's test can be used for planned comparisons and for post hoc testing. At the level of the individual comparison, the F ratio is most sensitive (most likely to detect significant differences between mean scores); but when the experiment includes many comparisons, error rate familywise can be unacceptably high. Bonferroni's correction does a better job of controlling error rate familywise, but when the experiment includes many comparisons, the probability of Type II errors can be unacceptably high. To lower the risk of Type II errors when there are many comparisons, researchers sometimes turn to Scheffé's test.
Which option should you choose? It depends on the situation, but your choice should take into account answers to the following three questions:
- Are all of the comparisons planned (rather than post hoc)?
- Are all of the comparisons orthogonal?
- How many comparisons are you testing?
In the end, the analysis method you choose will reflect your tolerance for controlling error rate per comparison versus error rate familywise.
Error Rate Per Comparison
Suppose an experimenter were most concerned with controlling error rate per comparison. Here is a flowchart that illustrates how such an experimenter might choose among the various options, based on answers to the three questions.
 
         
        In this decision tree, F ratio tests are used when the experiment calls for up to five planned orthogonal comparisons. This allows the experimenter to capitalize on the increased sensitivity provided by unadjusted F tests. Bonferroni tests are used when the experiment calls for six to ten comparisons, regardless of whether the comparisons are planned or unplanned. This allows the experimenter to control the familywise error rate, even when the experiment calls for as many as ten hypothesis tests. Scheffé's test is used when the experiment calls for more than ten planned comparisons, regardless of whether the comparisons are planned or unplanned. This allows the experimenter to control for Type I errors, yet keep the risk of Typle II errors to a manageable level.
Error Rate Familywise
Suppose an experimenter were most concerned with controlling error rate familywise. Here is a flowchart that illustrates how such an experimenter might choose among the various options.
 
         
        In this decision tree, Bonferroni tests are used when the experiment calls for testing up to ten comparisons; Scheffé's test, when the experiment calls for testing more than ten planned comparisons. Unadjusted F ratio tests, which provide no control for error rate familywise, are not used at all.
Note: The decision rules depicted in each flowchart represent reasonable rules of thumb, based on the type of error rate that the experimenter wants to control; but don't be surprised if you see different guidance elsewhere. As we mentioned earlier, statisticians often disagree about which analytical technique is best.