Posthoc comparison of one-way ANOVAs, be mindful of your Bonferroni-correction

What I am going to claim and my motivation

If you are going to use Bonferroni-comparison on your post-hoc testing of your oneway ANOVA, you might as well skip your ANOVA.
In fact, you should skip it, as this procedure reduces your ability to reject false null hypotheses.

My motivation is that I now in several reviews have been puzzled by authors’ insistence on doing ANOVAs followed by Bonferroni-correction, which in my eyes defeat the purpose of doing the ANOVA in the first place.

What is an ANOVA?

A standard tool for anyone looking to draw inferences about whether means differ between groups of measurement is the analysis of variance (ANOVA)[1]
The main selling point of the ANOVA is that it will tell you whether at least one mean differs from at least one other mean, without you having to worry about all the pairwise comparisons that are possible and the multiple comparisons problem that that entails.

How do you apply it?

So, for example, you have a dataset with three treatments

set.seed(7) # for replicability
n.obs <- 25 # a group of research subjects that each undergo the treatments below
treatment.A <- rnorm(n.obs, mean=2, sd=1) ## a "large" effect of 4 units
treatment.B <- rnorm(n.obs, mean=0.75, sd=1) ## a "small" effect of 1 unit
treatment.C <- rnorm(n.obs, mean=0, sd=1) ## no effect

## building our data frame
measurement <- c(treatment.A, treatment.B, treatment.C)
treatment.level <- factor(rep(c('A', 'B', 'C'), each=n.obs))
data <- data.frame(measurement=measurement, treatment=treatment.level)
## Building a linear model
model <- lm(measurement ~ treatment, data=data)
## Present the ANOVA results
print(anova(model))

## Analysis of Variance Table
## 
## Response: measurement
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## treatment  2 70.969  35.484  39.337 2.848e-12 ***
## Residuals 72 64.948   0.902                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lo and behold! Our F-value is higher than we’d expect, given that our null model was true, i.e. one single mean value would describe the differences between groups, thus we reject the null model.

But that’s not all that I want to know?!

No, you would like to know which treatment differs from which? So how we do the post-hoc tests. Will summary help?

print(summary(model))

## 
## Call:
## lm(formula = measurement ~ treatment, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85203 -0.64685  0.00364  0.57808  2.25272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4640     0.1900  12.972  < 2e-16 ***
## treatmentB   -1.7007     0.2686  -6.331 1.85e-08 ***
## treatmentC   -2.2957     0.2686  -8.546 1.47e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9498 on 72 degrees of freedom
## Multiple R-squared:  0.5221, Adjusted R-squared:  0.5089 
## F-statistic: 39.34 on 2 and 72 DF,  p-value: 2.848e-12

This gives us two-thirds of our story - the intercept line tells us that treatment A differs from 0, the second line that the estimated effect of treatment B is significantly smaller (-1.70 units) than that of treatment A, and the third line that the effect of treatment C is also significantly smaller (-2.30 units).
Some things to note:
- One comparison is missing, i.e. the comparison of treatments B and C.
- Our output is structured as if we did linear regression, giving us an estimated intercept and two slope estimates.
- In fact, we did do linear regression - ANOVA is a special case of linear regression[2].

So how do I get all the comparisons?

You could use pairwise.t.test, but I prefer `multcomp::glht, since it provides the t-values as well, and also allows for greater flexibility in specifying your contrasts when you have more complicated designs

library(multcomp)

## first we build a hypothesis matrix that has all the contrasts that we want to do
hypothesis.matrix <- matrix(data=0, nrow=3, ncol=3) # a 3x3 matrix, rows are n.tests we want to do and columns are our three treatments
hypothesis.matrix[1, c(1, 2)] <- c(1, -1) # H0: treatmentA minus treatmentB = 0
hypothesis.matrix[2, c(1, 3)] <- c(1, -1) # H0: treatmentA minus treatmentC = 0
hypothesis.matrix[3, c(2, 3)] <- c(1, -1) # H0: treatmentB minus treatmentC = 0
print(hypothesis.matrix)

##      [,1] [,2] [,3]
## [1,]    1   -1    0
## [2,]    1    0   -1
## [3,]    0    1   -1

How do I read this?

These are our pairwise t-tests of differences - null hypotheses are specified in the comments

## now we apply it
model <- lm(measurement ~ treatment, data=data) # our model is the same as earlier, just being explicit
t.tests <- glht(model, hypothesis.matrix)
print(summary(t.tests, test=adjusted('none')))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = measurement ~ treatment, data = data)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0   4.1647     0.4247   9.805 6.66e-15 ***
## 2 == 0   4.7597     0.4247  11.206  < 2e-16 ***
## 3 == 0   0.5950     0.2686   2.215   0.0299 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

Hurray!, all our tests are significant, the estimated differences between A and B, (4.16), between A and C (4.76) and between B and C (0.60) are all significantly different from the 0 value that our null hypotheses projected.

But wait?!

You set adjusted('none') - everyone knows you need to correct for multiple comparisons. Use the Bonferroni-correction, you dimwit!

Bonferroni-correction

When conducting several tests within the same model, we need to control the family-wise error rate (FWER)[3]. Doing each of the three tests with an \(\alpha\)=0.05 will result in us falsely rejecting a null hypothesis more than 5 % of the time
The Bonferroni-correction handles this by dividing our \(\alpha\)-level with the number of tests that we do, i.e.

\(\alpha_{BONF}\) = 0.05/3 = 0.0166…

## Bonferroni-correction
print(summary(t.tests, test=adjusted('bonferroni')))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = measurement ~ treatment, data = data)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0   4.1647     0.4247   9.805    2e-14 ***
## 2 == 0   4.7597     0.4247  11.206   <2e-16 ***
## 3 == 0   0.5950     0.2686   2.215   0.0898 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)

All _p_values have been multiplied by 3, and now the estimated difference between treatments B and C is not significant anymore, when we have controlled the family-wise error rate.

All good now?

I have asked myself before - if you are going to do the Bonferroni-correction anyway, why do the ANOVA at all? Why not go straight to the Bonferroni-corrected t-tests? (The answer to this rhetorical question is of course that you should go straight to the Bonferroni-corrected t-tests of your linear model) I have until recently espoused the procedure myself that for oneway ANOVAs, you should not correct the post-hoc tests, as long as you only performed those if you could reject the null model (no difference between any conditions) based on the F-value. This was based on the reasoning that this would control the error rate at \(\alpha\)=0.05, because you would only do the post-hoc testing when the null model could be rejected at \(\alpha\)=0.05

However, I have recently learnt that this only amounts to weak control of the family-wise error rate (controlling the rate at which you reject all true null hypotheses), whereas what many strive for (as implicitly revealed by their Bonferroni-correction) for strong control of the family-wise error rate[4] (controlling the rate at which you reject any true null hypothesis)

What I will argue using a simulation approach is that the doing the ANOVA first and Bonferroni-correction afterwards actually does harm to your power, i.e. it controls the family-wise error rate too conservatively.

Simulation function

The idea of the function is that we similarly to before run a oneway ANOVA with three levels, but this time, we make sure there is no True effect. (the measurements for the three treatments are all drawn from a normal distribution with \(\mu=0\) and \(\sigma=1\)). To obtain an estimate of the long-term error rate, we run the test “many” times and count the number of falsely rejected null hypotheses, global (F) and local (t)

# prints the Type-1 error rates for different levels of F.alpha and t.alpha
# three levels are hardcoded are the present
simulate.oneway <- function(n.tests, n.obs, F.alpha, t.alpha, correction,
                            do.ANOVA=TRUE)
{
    print(paste('Parameters: n.tests:', n.tests, '; n.obs:', n.obs,
                '; F.alpha:', F.alpha,
                '; t.alpha:', t.alpha,
                '; correction:', correction,
                '; do.ANOVA:', do.ANOVA))
    hypothesis.matrix <- matrix(data=0, nrow=3, ncol=3)
    hypothesis.matrix[1, c(1, 2)] <- c(1, -1)
    hypothesis.matrix[2, c(1, 3)] <- c(1, -1)
    hypothesis.matrix[3, c(2, 3)] <- c(1, -1)
    
    false.positives.F <- 0
    false.positives.t <- 0
    for(test.index in 1:n.tests)
    {
        if(test.index %% (n.tests / 10) == 0)
        {
            print(paste(test.index, 'of', n.tests,
                        'simulations run'))
        }
        meas.A <- rnorm(n.obs) ## expected value: 0
        meas.B <- rnorm(n.obs) ## expected value: 0
        meas.C <- rnorm(n.obs) ## expected value: 0
        
        meas <- c(meas.A, meas.B, meas.C)
        condition <- factor(rep(c('A', 'B', 'C'), each=n.obs))
        
        data <- data.frame(meas=meas, condition=condition)
        
        model <- lm(meas ~ condition, data=data)
        results <- anova(model)
        if(results$`Pr(>F)`[1] < F.alpha & do.ANOVA)
        {
            false.positives.F <- false.positives.F + 1
            if(correction == 'Tukey')
            {
                t.p <- summary(glht(model,
                    linfct = mcp(condition = 'Tukey')))$test$pvalues
            } else
            {
                t.p <- summary(glht(model, hypothesis.matrix),
                    test=adjusted(correction))$test$pvalues
            }
            # print(summary(glht(model, hypothesis.matrix)))
            # print(summary(glht(model, hypothesis.matrix), test=adjusted('none')))
            false.positives.t <- false.positives.t + sum(t.p < t.alpha)
        } else if(!do.ANOVA)
        {
            t.p <- summary(glht(model, hypothesis.matrix),
                           test=adjusted(correction))$test$pvalues
            false.positives.t <- false.positives.t + sum(t.p < t.alpha)
        }
        # else print('Not a valid combination of "do.ANOVA" and "correction"')
            
    }
    
    print(paste('Proportion of falsely rejected F.tests', 
                false.positives.F / n.tests))
    print(paste('Proportion of falsely rejected t.tests', 
                false.positives.t / (3*n.tests)))
    cat('\n')
}

Simulations

Now, we are going to run the actual simulations with 100000 (1e5) repetitions.

set.seed(7) ## 7 is my lucky number
## Bonferroni correction WITH ANOVA first
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
                correction='bonferroni', do.ANOVA=TRUE)

## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: bonferroni ; do.ANOVA: TRUE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0.0505"
## [1] "Proportion of falsely rejected t.tests 0.0143933333333333"

## Bonferroni correction WITHOUT ANOVA first
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
                correction='bonferroni', do.ANOVA=FALSE)

## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: bonferroni ; do.ANOVA: FALSE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0"
## [1] "Proportion of falsely rejected t.tests 0.0169233333333333"

We see that fitting the model and doing the Bonferroni straight away (without making a decision on whether or not to reject the null model based on the F-value) results in an error-rate ~\(0.0169\)) that is closer to the desired error-rate, namely \(0.05/3=0.166...\) than the one where an ANOVA is done first (\(error\ rate=0.0143)\).

When multiplying these estimated error rates by 3, we get back to the \(\alpha\)-values they correspond to:
- \(\alpha_{with\ ANOVA} = 0.043\)
- \(\alpha_{without\ ANOVA} = 0.051\)

Conclusion: It is clearly seen that the two-step (first ANOVA and then Bonferroni) is too conservative, and that as a consequence false null hypotheses may end up not being rejected.

Why this difference?

So why this difference? Why should it matter that you let the ANOVA be a gatekeeper to the pairwise comparisons (when Bonferroni-correcting). It is most likely due to the fact that an extra parameter is used to estimate the third level (losing a degree of freedom), which in loose language is wasteful if you are going to do pairwise comparisons anyway.

Addendum

What about the weak control procedure that I have used before - how well does it fare when applied to the strong criterion? (first step, ANOVA, then uncorrected t-tests)

set.seed(7)
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
                correction='none', do.ANOVA=TRUE)

## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: none ; do.ANOVA: TRUE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0.0505"
## [1] "Proportion of falsely rejected t.tests 0.0244133333333333"

It is too liberal, controlling the family-wise error rate at an estimated rate of:
- \(\alpha_{ANOVA\ no\ correction}=0.073\)

Worse than 0.05, but not as bad as I had feared.

Things not covered

There are other approaches to correct the post-hoc tests that get closer to the desired level, such as the Tukey HSD and Scheffé’s method that have also been implemented - test for yourself using my function
I have not covered systematically how this depends on n.obs, but my testing indicates that it doesn’t depend much