If you are going to use Bonferroni-comparison on your post-hoc
testing of your oneway ANOVA, you might as well skip your ANOVA.
In fact, you should skip it, as this procedure
reduces your ability to reject false null
hypotheses.
My motivation is that I now in several reviews have been puzzled by authors’ insistence on doing ANOVAs followed by Bonferroni-correction, which in my eyes defeat the purpose of doing the ANOVA in the first place.
A standard tool for anyone looking to draw inferences about whether
means differ between groups of measurement is the analysis of variance
(ANOVA)[1]
The main selling point of the ANOVA is that it will tell you whether at
least one mean differs from at least one other mean, without you having
to worry about all the pairwise comparisons that are possible and the
multiple comparisons problem that that entails.
So, for example, you have a dataset with three treatments
set.seed(7) # for replicability
n.obs <- 25 # a group of research subjects that each undergo the treatments below
treatment.A <- rnorm(n.obs, mean=2, sd=1) ## a "large" effect of 4 units
treatment.B <- rnorm(n.obs, mean=0.75, sd=1) ## a "small" effect of 1 unit
treatment.C <- rnorm(n.obs, mean=0, sd=1) ## no effect
## building our data frame
measurement <- c(treatment.A, treatment.B, treatment.C)
treatment.level <- factor(rep(c('A', 'B', 'C'), each=n.obs))
data <- data.frame(measurement=measurement, treatment=treatment.level)
## Building a linear model
model <- lm(measurement ~ treatment, data=data)
## Present the ANOVA results
print(anova(model))
## Analysis of Variance Table
##
## Response: measurement
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 70.969 35.484 39.337 2.848e-12 ***
## Residuals 72 64.948 0.902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lo and behold! Our F-value is higher than we’d expect, given that our null model was true, i.e. one single mean value would describe the differences between groups, thus we reject the null model.
No, you would like to know which treatment differs from which? So how
we do the post-hoc tests. Will summary
help?
print(summary(model))
##
## Call:
## lm(formula = measurement ~ treatment, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85203 -0.64685 0.00364 0.57808 2.25272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4640 0.1900 12.972 < 2e-16 ***
## treatmentB -1.7007 0.2686 -6.331 1.85e-08 ***
## treatmentC -2.2957 0.2686 -8.546 1.47e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9498 on 72 degrees of freedom
## Multiple R-squared: 0.5221, Adjusted R-squared: 0.5089
## F-statistic: 39.34 on 2 and 72 DF, p-value: 2.848e-12
This gives us two-thirds of our story - the intercept line tells us
that treatment A differs from 0, the second line that the estimated
effect of treatment B is significantly smaller (-1.70 units) than that
of treatment A, and the third line that the effect of treatment C is
also significantly smaller (-2.30 units).
Some things to note:
- One comparison is missing, i.e. the comparison of treatments B and
C.
- Our output is structured as if we did linear regression, giving us an
estimated intercept and two slope estimates.
- In fact, we did do linear regression - ANOVA is a special case of
linear regression[2].
You could use pairwise.t.test
, but I prefer
`multcomp::glht, since it provides the t-values as well, and also allows
for greater flexibility in specifying your contrasts when you have more
complicated designs
library(multcomp)
## first we build a hypothesis matrix that has all the contrasts that we want to do
hypothesis.matrix <- matrix(data=0, nrow=3, ncol=3) # a 3x3 matrix, rows are n.tests we want to do and columns are our three treatments
hypothesis.matrix[1, c(1, 2)] <- c(1, -1) # H0: treatmentA minus treatmentB = 0
hypothesis.matrix[2, c(1, 3)] <- c(1, -1) # H0: treatmentA minus treatmentC = 0
hypothesis.matrix[3, c(2, 3)] <- c(1, -1) # H0: treatmentB minus treatmentC = 0
print(hypothesis.matrix)
## [,1] [,2] [,3]
## [1,] 1 -1 0
## [2,] 1 0 -1
## [3,] 0 1 -1
These are our pairwise t-tests of differences - null hypotheses are specified in the comments
## now we apply it
model <- lm(measurement ~ treatment, data=data) # our model is the same as earlier, just being explicit
t.tests <- glht(model, hypothesis.matrix)
print(summary(t.tests, test=adjusted('none')))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = measurement ~ treatment, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 4.1647 0.4247 9.805 6.66e-15 ***
## 2 == 0 4.7597 0.4247 11.206 < 2e-16 ***
## 3 == 0 0.5950 0.2686 2.215 0.0299 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
Hurray!, all our tests are significant, the estimated differences between A and B, (4.16), between A and C (4.76) and between B and C (0.60) are all significantly different from the 0 value that our null hypotheses projected.
You set adjusted('none')
- everyone knows you need to
correct for multiple comparisons. Use the Bonferroni-correction, you
dimwit!
When conducting several tests within the same model, we need to
control the family-wise error rate (FWER)[3].
Doing each of the three tests with an \(\alpha\)=0.05 will result in us falsely
rejecting a null hypothesis more than 5 % of the time
The Bonferroni-correction handles this by dividing our \(\alpha\)-level with the number of tests
that we do, i.e.
\(\alpha_{BONF}\) = 0.05/3 = 0.0166…
## Bonferroni-correction
print(summary(t.tests, test=adjusted('bonferroni')))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = measurement ~ treatment, data = data)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 4.1647 0.4247 9.805 2e-14 ***
## 2 == 0 4.7597 0.4247 11.206 <2e-16 ***
## 3 == 0 0.5950 0.2686 2.215 0.0898 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
All _p_values have been multiplied by 3, and now the estimated difference between treatments B and C is not significant anymore, when we have controlled the family-wise error rate.
I have asked myself before - if you are going to do the Bonferroni-correction anyway, why do the ANOVA at all? Why not go straight to the Bonferroni-corrected t-tests? (The answer to this rhetorical question is of course that you should go straight to the Bonferroni-corrected t-tests of your linear model) I have until recently espoused the procedure myself that for oneway ANOVAs, you should not correct the post-hoc tests, as long as you only performed those if you could reject the null model (no difference between any conditions) based on the F-value. This was based on the reasoning that this would control the error rate at \(\alpha\)=0.05, because you would only do the post-hoc testing when the null model could be rejected at \(\alpha\)=0.05
However, I have recently learnt that this only amounts to weak control of the family-wise error rate (controlling the rate at which you reject all true null hypotheses), whereas what many strive for (as implicitly revealed by their Bonferroni-correction) for strong control of the family-wise error rate[4] (controlling the rate at which you reject any true null hypothesis)
What I will argue using a simulation approach is that the doing the ANOVA first and Bonferroni-correction afterwards actually does harm to your power, i.e. it controls the family-wise error rate too conservatively.
The idea of the function is that we similarly to before run a oneway ANOVA with three levels, but this time, we make sure there is no True effect. (the measurements for the three treatments are all drawn from a normal distribution with \(\mu=0\) and \(\sigma=1\)). To obtain an estimate of the long-term error rate, we run the test “many” times and count the number of falsely rejected null hypotheses, global (F) and local (t)
# prints the Type-1 error rates for different levels of F.alpha and t.alpha
# three levels are hardcoded are the present
simulate.oneway <- function(n.tests, n.obs, F.alpha, t.alpha, correction,
do.ANOVA=TRUE)
{
print(paste('Parameters: n.tests:', n.tests, '; n.obs:', n.obs,
'; F.alpha:', F.alpha,
'; t.alpha:', t.alpha,
'; correction:', correction,
'; do.ANOVA:', do.ANOVA))
hypothesis.matrix <- matrix(data=0, nrow=3, ncol=3)
hypothesis.matrix[1, c(1, 2)] <- c(1, -1)
hypothesis.matrix[2, c(1, 3)] <- c(1, -1)
hypothesis.matrix[3, c(2, 3)] <- c(1, -1)
false.positives.F <- 0
false.positives.t <- 0
for(test.index in 1:n.tests)
{
if(test.index %% (n.tests / 10) == 0)
{
print(paste(test.index, 'of', n.tests,
'simulations run'))
}
meas.A <- rnorm(n.obs) ## expected value: 0
meas.B <- rnorm(n.obs) ## expected value: 0
meas.C <- rnorm(n.obs) ## expected value: 0
meas <- c(meas.A, meas.B, meas.C)
condition <- factor(rep(c('A', 'B', 'C'), each=n.obs))
data <- data.frame(meas=meas, condition=condition)
model <- lm(meas ~ condition, data=data)
results <- anova(model)
if(results$`Pr(>F)`[1] < F.alpha & do.ANOVA)
{
false.positives.F <- false.positives.F + 1
if(correction == 'Tukey')
{
t.p <- summary(glht(model,
linfct = mcp(condition = 'Tukey')))$test$pvalues
} else
{
t.p <- summary(glht(model, hypothesis.matrix),
test=adjusted(correction))$test$pvalues
}
# print(summary(glht(model, hypothesis.matrix)))
# print(summary(glht(model, hypothesis.matrix), test=adjusted('none')))
false.positives.t <- false.positives.t + sum(t.p < t.alpha)
} else if(!do.ANOVA)
{
t.p <- summary(glht(model, hypothesis.matrix),
test=adjusted(correction))$test$pvalues
false.positives.t <- false.positives.t + sum(t.p < t.alpha)
}
# else print('Not a valid combination of "do.ANOVA" and "correction"')
}
print(paste('Proportion of falsely rejected F.tests',
false.positives.F / n.tests))
print(paste('Proportion of falsely rejected t.tests',
false.positives.t / (3*n.tests)))
cat('\n')
}
Now, we are going to run the actual simulations with 100000 (1e5) repetitions.
set.seed(7) ## 7 is my lucky number
## Bonferroni correction WITH ANOVA first
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
correction='bonferroni', do.ANOVA=TRUE)
## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: bonferroni ; do.ANOVA: TRUE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0.0505"
## [1] "Proportion of falsely rejected t.tests 0.0143933333333333"
## Bonferroni correction WITHOUT ANOVA first
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
correction='bonferroni', do.ANOVA=FALSE)
## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: bonferroni ; do.ANOVA: FALSE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0"
## [1] "Proportion of falsely rejected t.tests 0.0169233333333333"
We see that fitting the model and doing the Bonferroni straight away (without making a decision on whether or not to reject the null model based on the F-value) results in an error-rate ~\(0.0169\)) that is closer to the desired error-rate, namely \(0.05/3=0.166...\) than the one where an ANOVA is done first (\(error\ rate=0.0143)\).
When multiplying these estimated error rates by 3, we get back to the
\(\alpha\)-values they correspond
to:
- \(\alpha_{with\ ANOVA} =
0.043\)
- \(\alpha_{without\ ANOVA} =
0.051\)
Conclusion: It is clearly seen that the two-step (first ANOVA and then Bonferroni) is too conservative, and that as a consequence false null hypotheses may end up not being rejected.
So why this difference? Why should it matter that you let the ANOVA be a gatekeeper to the pairwise comparisons (when Bonferroni-correcting). It is most likely due to the fact that an extra parameter is used to estimate the third level (losing a degree of freedom), which in loose language is wasteful if you are going to do pairwise comparisons anyway.
What about the weak control procedure that I have used before - how well does it fare when applied to the strong criterion? (first step, ANOVA, then uncorrected t-tests)
set.seed(7)
simulate.oneway(n.tests=1e5, n.obs=25, F.alpha=0.05, t.alpha=0.05,
correction='none', do.ANOVA=TRUE)
## [1] "Parameters: n.tests: 1e+05 ; n.obs: 25 ; F.alpha: 0.05 ; t.alpha: 0.05 ; correction: none ; do.ANOVA: TRUE"
## [1] "10000 of 1e+05 simulations run"
## [1] "20000 of 1e+05 simulations run"
## [1] "30000 of 1e+05 simulations run"
## [1] "40000 of 1e+05 simulations run"
## [1] "50000 of 1e+05 simulations run"
## [1] "60000 of 1e+05 simulations run"
## [1] "70000 of 1e+05 simulations run"
## [1] "80000 of 1e+05 simulations run"
## [1] "90000 of 1e+05 simulations run"
## [1] "100000 of 1e+05 simulations run"
## [1] "Proportion of falsely rejected F.tests 0.0505"
## [1] "Proportion of falsely rejected t.tests 0.0244133333333333"
It is too liberal, controlling the family-wise error rate at an
estimated rate of:
- \(\alpha_{ANOVA\ no\
correction}=0.073\)
Worse than 0.05, but not as bad as I had feared.
There are other approaches to correct the post-hoc tests that get
closer to the desired level, such as the Tukey HSD and Scheffé’s method
that have also been implemented - test for yourself using my
function
I have not covered systematically how this depends on n.obs, but my
testing indicates that it doesn’t depend much