7.5 One-way ANOVA

In the previous section, we looked at how to compare differences between either two groups, or two points in time. But how do we compare more than that? The answer is the ANOVA (analysis of variance) - one of the most common general statistical models for hypothesis testing. If you do some statistical testing as part of your thesis, the ANOVA is likely going to be one of your most useful tools to have.

7.5.1 The basic one-way ANOVA

The between-groups one-way ANOVA is the most basic form of ANOVA, which aims to test differences between two or more groups. (Typically, ANOVAs are used when there are three or more groups, but can easily be used in situations where there are only two groups.)

We’ve briefly mentioned the general hypotheses for all ANOVAs, but here they are again:

  • \(H_0\): The means between groups are not significantly different. (i.e. \(\mu_1 = \mu_2 = ... \mu_k\))
  • \(H_1\): The means between groups are significantly different. (i.e.\(\mu_1 \neq \mu_2 \neq ... \mu_k\))

7.5.2 Example data

In the seminar, we talk about an example from Watts et al. (2003). Below is another simple example, comparing taste ratings across three different types of slices: caramel slices, vanilla slices and lemon slices. Participants were randomly allocated to taste one of the three slices blindfolded, and were then asked to verbally rate its taste on a scale from 1-10 (10 being super tasty).

w9_slices <- read_csv(here("data", "week_9", "w9_slices.csv"))
## Rows: 63 Columns: 2
## ── Column specification ────────────────────────
## Delimiter: ","
## chr (1): group
## dbl (1): rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create factor
w9_slices$group <- factor(w9_slices$group)

head(w9_slices)

7.5.3 Assumption checks

In R, to test assumptions we need to run the ANOVA first. This is because some assumptions rely on values that are calculated when the ANOVA is run. Other programs, such as Jamovi and SPSS, will hide this manual work and present the information to you automatically. This is not quite the case in R, but that’s ok!

The basic function for running an ANOVA in R is the aov() function, which takes a formula input (much like t.test()). Anything created using aov() should be assigned to a new variable so that we can use it later. This creates an aov object that will a) test our main effects and b) let us check some of our assumptions.

w9_slices_aov <- aov(rating ~ group, data = w9_slices)

There are four main assumptions for a basic ANOVA.

  • Data must be independent of each other - in other words, one person’s response should not be influenced by another. This should come as a feature of good experimental design.
  • The residuals should be normally distributed. We can assess this using the same methods as t-tests: either a QQ plot or a normality test (e.g. Shapiro-Wilks). You can access the residuals from the aov object directly like this:
w9_slices_aov$residuals

This lets us do the SW test:

shapiro.test(w9_slices_aov$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  w9_slices_aov$residuals
## W = 0.95117, p-value = 0.01409

A SW test on this data suggests the assumption is violated (W =.951, p = .014), suggesting that the residuals are not normally distributed.

This is a bit of a problem in our dataset because our sample is relatively small. However - the ANOVA is fairly robust to violations of the normality assumption, meaning that non-normal residuals aren’t a major problem so long as a) you have a big enough sample size and b) the skew isn’t huge (or driven by outliers).

  • The equality of variance (homoscedasticity) assumption.

This means that the variances within each group are the same. We test this using Levene’s test, using the levene_test() function. Helpfully, levene_test() is flexible enough to work with either a formula or an aov object we have already fitted:

# Either one of these will work
w9_slices %>%
  levene_test(rating ~ group, center = "mean")

levene_test(w9_slices_aov, center = "mean")

The assumption doesn’t appear to be violated (F(2, 60) = .721, p = .491). If it was, we might consider using a Welch’s ANOVA, which can be done with welch_anova_test() from rstatix.

For now though, we’ll press ahead.

7.5.4 Output

Here’s our main output from the ANOVA, generated using the summary() function. This is called an omnibus, because it is a general test of the hypotheses:

summary(w9_slices_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## group        2  22.22  11.111   6.897 0.00201 **
## Residuals   60  96.67   1.611                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our main results suggest that there is a significant difference between means (F(2, 60) = 6.90, p = .002).

7.5.5 Post-hoc tests

The output above told us that we could reject our basic null hypothesis - that there was no difference between means. However, remember that it doesn’t tell us where those differences are. To figure out where they are, we need to follow up that ANOVA with post-hoc tests.

To generate post-hocs, there are a couple of ways. Tukey tests can be calculated with TukeyHSD() or emmeans():

TukeyHSD(w9_slices_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = rating ~ group, data = w9_slices)
## 
## $group
##                       diff        lwr       upr     p adj
## lemon-caramel   -0.4761905 -1.4175618 0.4651809 0.4486738
## vanilla-caramel  0.9523810  0.0110096 1.8937523 0.0467882
## vanilla-lemon    1.4285714  0.4872001 2.3699428 0.0015928
emmeans(w9_slices_aov, ~ group) %>%
  pairs(infer = TRUE, adjust = "tukey")
##  contrast          estimate    SE df lower.CL upper.CL t.ratio p.value
##  caramel - lemon      0.476 0.392 60   -0.465    1.418   1.216  0.4487
##  caramel - vanilla   -0.952 0.392 60   -1.894   -0.011  -2.431  0.0468
##  lemon - vanilla     -1.429 0.392 60   -2.370   -0.487  -3.647  0.0016
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates 
## P value adjustment: tukey method for comparing a family of 3 estimates

Here, you can see that we run a test for caramel vs lemon, caramel vs vanilla and lemon vs vanilla. For post-hocs in one-way ANOVAs, it is ok to stick to the Tukey-adjusted p-values. We can see that:

  • There is no significant difference between ratings of caramel and lemon slices (mean difference, MD = .48; p = .449).
  • There is a (marginal) significant difference between caramel and vanilla slices; participants rated vanilla slices higher than caramel slices (MD = .95, p = .047).
  • There is a significant difference between lemon and vanilla slices, in that participants preferred vanilla slices (MD = 1.43, p = .002).

To generate Bonferroni or Holm-corrected p-values, you need to use the pairwise.t.test() function:

pairwise.t.test(x = w9_slices$rating, g = w9_slices$group, p.adjust.method = "holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  w9_slices$rating and w9_slices$group 
## 
##         caramel lemon 
## lemon   0.2289  -     
## vanilla 0.0361  0.0017
## 
## P value adjustment method: holm

In this case, you must provide p.adjust.method as an argument. By default, this function will use Holm corrections. Alternatively, just change your emmeans() call:

emmeans(w9_slices_aov, ~ group) %>%
  pairs(infer = TRUE, adjust = "holm")
##  contrast          estimate    SE df lower.CL upper.CL t.ratio p.value
##  caramel - lemon      0.476 0.392 60   -0.489   1.4410   1.216  0.2289
##  caramel - vanilla   -0.952 0.392 60   -1.917   0.0124  -2.431  0.0361
##  lemon - vanilla     -1.429 0.392 60   -2.393  -0.4638  -3.647  0.0017
## 
## Confidence level used: 0.95 
## Conf-level adjustment: bonferroni method for 3 estimates 
## P value adjustment: holm method for 3 tests

Here is how a write-up might look:

A one-way ANOVA was conducted to examine whether there were differences in preferences for types of slices. There was a significant large main effect of slice type (F(2, 60) = 6.90, p = .002, \(\eta^2\) = .187). Post-hoc tests with Tukey HSD corrections showed that participants rated vanilla slices higher than caramel slices (mean difference, MD = .95, p = .047), as well as significantly higher than lemon slices (MD = 1.43, p = .002). However, there was no significant differencce between caramel and lemon slices (p = .449).