7.3 Multiple comparisons

On this page we talk about what happens after a significant ANOVA result, as well as some more detail about the multiple comparisons problem, and how it can be accounted for.

7.3.1 Post-hoc tests

The ANOVA table on the previous page allows us to calculate by hand whether there is a significant effect of our IV. However, it doesn’t tell us where that difference lies. Remember, we’re comparing at least three groups with each other - so we need some way of figuring out where the actual significant differences between means are.

Enter post-hoc comparisons (post-hoc = after the fact). These are essentially a series of t-tests where we compare each group with each other. So, if we have groups A, B and C, and we get a significant omnibus ANOVA that tells us there is a significant difference somewhere in those means, we would run post-hoc comparisons by running individual t-tests on A vs B, then B vs C, then A vs C.

However, we can’t just do this blindly, and there’s a good reason why…

7.3.2 The multiple comparisons problem

In Module 6, we talked a fair bit about Type I and II error rates - i.e., the rate of making a false positive and false negative judgement respectively. Here, we’ll focus on Type I error.

Recall that whenever we do a hypothesis test, we set a value for alpha, which forms our significance level. Alpha is essentially the probability of Type I errors we are willing to accept in a given test. In other words, when we use the conventional alpha = .05 as our criterion for significance, we are saying that we’re willing to accept a Type I error occurring 5% of the time - or 1 in 20.

This becomes problematic when we have to conduct multiple tests at once, like what we have to do in an ANOVA. For example, let’s say a variable has 5 levels - A, B, C, D and E. If we ran an ANOVA on this data and found a significant omnibus effect, we would want to find out where that effect is. But that means that we’d have to compare A, B, C, D and E all against each other, in a round-robin tournament way (e.g. A vs B, then A vs C…). Where this becomes an issue is that each comparison will have its own 5% Type I error rate (assuming we use alpha = .05). So in this scenario, the Type I error rate will stack with each new comparison, meaning that our overall Type I error rate will climb higher and higher. This means that the chance of finding a false positive will increase (see graph below).

This overall rate of Type I errors is called the family-wise error rate (FWER). ‘Family’ in this context refers to a family of comparisons - like the comparison across A to E that we saw above.

7.3.3 Correction methods

One way of dealing with this is to correct the p-values to set the family-wise error rate back to 5%. We’ll go through three main types (two of which are already covered in the seminar).

Tukey’s HSD

Tukey’s Honest Significant Differences (Tukey HSD for short) is appropriate specifically for post-hoc tests after ANOVAs. It essentially works by first calculating the largest possible difference between each group’s means, then using that to calculate a critical mean difference. Any mean difference that is greater than this critical threshold is significant.

How does the Tukey test work?

Tukey’s HSD works very similarly in principle to a t-test, although the exact maths involved are slightly different. The test statistic for a Tukey test is called q which denotes the Studentized range distribution (a specific distribution). For a given comparison between two group means, the formula for q is:

\[ \Large q = \frac{|M_1 - M_2|}{SE} \]

Where \(M_1\) and \(M_2\) denote the means of groups 1 and 2, and SE denotes the standard error of the sum of the means. The q-statistic is then compared to a Studentized range distribution to estimate a p-value for that comparison. The shape of the Studentized range distribution is determined by alpha, k (the number of groups being compared) and df (in this case, the **error degrees of freedom***).

Bonferroni

The Bonferroni correction is a method that can be used both within ANOVAs and in more general contexts. The Bonferroni correction works by multiplying each p-value in a set of comparisons by the number of comparisons (i.e. the number of p-values).

For example, if a set of comparisons gives the following p-values: 0.05, 0.25 and 0.10, we would apply a Bonferroni correction by multiplying each p-value by 3 (as there are 3 p-values) to get 0.15, 0.75 and 0.30 respectively.

(Note: in reality, Bonferroni works by dividing alpha by the number of comparisons, i.e. 0.05/3 - but for the purpose of significance testing, the maths works out to be the same).

Holm

The Holm correction is another correction method. In this method, each p-value is first ranked from smallest to largest. and then numbered in a descending fashion. Using the same p-values as above, we would get 0.05, 0.10 and 0.25. The p = .05 would be ranked as 3, the p = .10 would be 2 and p = .25 would be 1.

You then multiply the p-value by its rank, like the table below, to get your adjusted p-values:

Original p-value Sorted p-value Rank p-value x rank
0.05 0.05 3 0.15
0.25 0.10 2 0.20
0.10 0.25 1 0.25

The Bonferroni procedure is very conservative, in that it may actually overcorrect (and subsequently reject too many); the Holm correction still corrects the overall FWER without also increasing the overall Type II rate as well.

7.3.4 Post-hocs in R

To just get adjusted p-values, we can use the function pairwise.t.test(). This needs three arguments:

  1. x refers to the column of data for the outcome, expressed as data$x.
  2. g refers to the group/IV, again as data$g.
  3. p.adjust.method refers to the adjustment method. Note this argument does not allow Tukey adjustments: only Bonferroni and Holm are allowed.
pairwise.t.test(x = w9_slices$rating, g = w9_slices$group, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  w9_slices$rating and w9_slices$group 
## 
##         caramel lemon 
## lemon   0.6866  -     
## vanilla 0.0541  0.0017
## 
## P value adjustment method: bonferroni

A more powerful and flexible method is using the emmeans package. emmeans stands for estimated marginal means, which are model-derived estimates of each group’s mean. Simple effects tests are generally run on the estimate marginal means, and thus this package allows us a handy way to run these comparisons.

There are two steps to using emmeans:

  1. Define the estimated marginal means for the post-hocs.

This is simple to do. We call the emmeans() function, which will set up all comparisons for the post-hoc. This function needs two arguments at minimum:

  • The aov model name, and
  • The variable (i.e. the groups) from which you are comparing levels against. This is given as ~ group, where group denotes our IV.

Imagine an ANOVA model named model_aov, with variable A that defines the groups in our ANOVA. Our first step would be to set up the comparisons using emmeans() like below:

model_em <- emmeans(model_aov, ~ A)
  1. Generate pairwise comparisons for the post-hoc.

Continuing on from the example above, we can use the pairs() function on our EM object, model_em, to run the post-hocs. infer = TRUE will generate confidence intervals for the difference in EM means. By default this will be set at 95% confidence, which is typically what we want.

pairs(model_em, infer = TRUE)

If we want to adjust the p-values at this stage, we can give pairs() an additional argument adjust, with the name of the adjustment method as a string. pairs() will let you use Tukey, Bonferroni and Holm adjustments.

pairs(model_em, infer = TRUE, adjust = "holm")

Optionally, to do this all in one go we could simply pipe from emmeans() to pairs() as follows:

emmeans(model_aov, ~ A) %>%
  pairs(infer = TRUE, adjust = "holm")