The one-way analysis of variance (One-Way ANOVA) is used to test the hypothesis about population means for an experimental setting of one factor (one explanatory variable) with 3 or more groups. An initial F-test is performed to assess if the population means of the groups are all equal. Once the F-test is performed, if the null hypothesis is not rejected, then the study did not provide sufficient evidence to conclude that there is a difference between the factor level means. On the other hand, if the null hypothesis is rejected, then the focus shifts to comparing combinations or pairs of means for the different groups through either single or multiple comparison methods.
Assumptions:
1) For each group, the response variable is normally distributed.
2) The observations are independent.
3) The variance is constant (the same in each group).
The analysis of variance method is robust to both deviations from the assumption of normality and of constant variance when the sample sizes between groups are equal (or near equal). For this reason, the use of a balanced experimental design with groups that have the same number of specimens is recommended.
The overall F-test for the one-way analysis of variance will test if the population means of k groups \((\mu_1, \mu_2, \cdots, \mu_k)\) are all equal.
Assumptions:
1) For each group, the response variable is normally distributed.
2) The observations are independent.
3) The variance is constant.
Process:
(1) In practice, the reader should expect to reject the null hypothesis, as it is unlikely that all population means will be equal.
(2) The reader should be careful in drawing the correct conclusion from the F-test. If the null hypothesis H0 is rejected, the reader should only infer that the k population means are not identical. This could occur if a single mean is significantly different than the other means, which could all be equal. For example, when comparing 5 groups, the null hypothesis will be rejected if \(\mu_1\) is significantly different than the other four means that are equal \(\mu_2=\mu_3=\mu_4=\mu_5\). To understand how the means differ between the 5 groups, a multiple comparison test should be performed.
Comparing Mean Wear of Five Dental Materials: A researcher is interested in comparing the amount of material loss to dental restorations among 5 different materials as a result of long-term wear. For each material, 8 specimens are subjected to 100,000 cycles of three-body wear with the use of a machine, and the amount of wear, as material loss, for the 40 specimens is measured at the completion of the cycles. A classic ANOVA F-test is performed to test the null hypothesis that the population means for the 5 groups are equal H0: \(\mu_1=\mu_2=\cdots=\mu_5\). The p-value from the F-test was 0.0023, which is less than \(\alpha=.05\), so the researcher rejects the null hypothesis H0 and concludes that at least one of the 5 materials experienced a mean wear that is significantly different than the others. To determine which pairs of means are significantly different and estimate the magnitude of their difference, the researcher will need to perform a multiple comparison test such as Tukey’s pairwise comparison test or produce simultaneous confidence intervals.
If the null hypothesis of the ANOVA overall F-test is rejected, the researcher can investigate to determine why the hypothesis failed through either a single or a multiple comparison method. A single comparison is used when there is one specific comparison of interest to the researcher. For example, the researcher may be interested in how a single new dental material compares to the average of two other materials already in use. In this case a single comparison method will be used to compare \(\mu_1\) to \(\frac{\mu_2+\mu_3}{2}\).
On the other hand, if the researcher is interested in making more than one comparison such as comparing all 3 pairs of the 3 means as in the two-sample t-tests, then we will use a multiple comparison method that properly controls for the full experiment-wise error rate associated with testing the three null hypotheses: \(\mu_1=\mu_2, \mu_1 = \mu_3\), and \(\mu_2 = \mu_3\). Such a method will also provide simultaneous confidence intervals for estimating the magnitude of the differences \(\mu_1-\mu_2, \mu_1 – \mu_3\), and \(\mu_2 – \mu_3\).
Assumptions:
1) For each group, the response variable is normally distributed.
2) The observations are independent.
3) The variance is constant.
a) Single Contrast
A contrast is a special type of expression involving the means \(\mu_1, \mu_2, \cdots, \mu_k\) that will be described in more detail below. To provide the reader with intuition about the concept of a contrast, we first introduce two illustrative examples in which contrasts are used to make desired comparisons.
Comparing Mean Wear with a Contrast 1: Suppose a researcher would like to compare the mean material loss \(\mu_1\) that a new dental material experiences when subjected to a three-body wear procedure to the average of the material losses incurred by materials 2, 3 and 5 of four other materials in common use. Material 4 is not included in this comparison. Then, the contrast \(\mu_1 – \frac{(\mu_2+\mu_3+\mu_5)}{3}\) is of interest. A 2-sided hypothesis test of H0: \(\mu_1 – \frac{\mu_2+\mu_3+\mu_5}{3}=0\) vs Ha: \(\mu_1 – \frac{\mu_2+\mu_3+\mu_5}{3}\neq 0\) can be performed or a confidence interval for this contrast can be produced to estimate its magnitude.
Comparing Mean Wear with a Contrast 2: Suppose a researcher is interested in comparing 2 dental materials in terms of how much material loss occurs due to three-body wear. An ACTA machine is used to simulate three-body wear at three levels: 100, 150, and 200 thousand cycles. Each treatment is applied to 8 specimens of each of the two materials, and there are six mean material losses to estimate: \(\mu_1, \mu_2, \mu_3\) from material one, and \(\mu_4, \mu_5, \mu_6\) from material two. One way to compare these materials is to compare the average material loss of the two materials over these three treatments. This leads to the contrast \(\frac{(\mu_1+\mu_2+\mu_3)}{3}-\frac{\mu_4+\mu_5+\mu_6}{3}\), which we can subject to a hypothesis test or estimate its value with a confidence interval.
In the two preceding examples, the contrast was a combination \(L=c_1\mu_1+c_2\mu_2+\cdots+c_k\mu_k\) of the means \(\mu_1, \mu_2, \cdots, \mu_k\) with coefficients \(c_1, c_2, \cdots, c_k\) that sum to zero. In example 1, the experiment had 5 total treatment groups, and the contrast \(L=\mu_1 – \frac{(\mu_2+\mu_3+\mu_5)}{3}\) had coefficients \(c_1=1, c_2=-\frac{1}{3}, c_3=-\frac{1}{3}, c_4=0\) and \(c_5=-\frac{1}{3}\), which sum to 0. In example 2 with 6 treatment groups, the contrast \(L=\frac{(\mu_1+\mu_2+\mu_3)}{3}-\frac{(\mu_4+\mu_5+\mu_6)}{3}\) had coefficients \(c_1=\frac{1}{3}, c_2=\frac{1}{3}, c_3=\frac{1}{3}, c_4=-\frac{1}{3}, c_5=-\frac{1}{3},\) and \(c_6=-\frac{1}{3}\), which sum to 0. Contrasts allow a very general uniform procedure for testing a wide variety of hypotheses about means of different groups.
The formula for the contrast L is stated in its full generality, and by choosing different coefficients \(c_1, c_2, \cdots, c_k\) the researcher can compare the means of the groups in different ways. For example, given 3 groups, to compare the population means \(\mu_1\) with \(\mu_2\), the researcher will chose \(c_1=1, c_2=-1\), and \(c_3=0\), and so the contrast is \(L=\mu_1-\mu_2\). On the other hand, to compare the population mean \(\mu_1\) with the average of \(\mu_2\) and \(\mu_3\), the reseracher will chose \(c_1=1, c_2=-0.5\), and \(c_3=-0.5\), and so the contrast is \(L=\mu_1-\frac{\mu_2+\mu_3}{2}\).
b) Hypothesis Test
Process:
c) Confidence Interval
The \((1-\alpha)100\%\) confidence interval of the contrast, \(L\), is \(\hat{L} \pm t_{\alpha/2, N-k} s\{\hat{L}\}\), where \(t_{\alpha/2, N-k}\) is the upper \(\frac{\alpha}{2}\)th percentile of the t-distribution.
Comparing Translucency of 3 Dental Materials: A researcher sets out to compare the optical properties of a new dental material (monolithic zirconia pre-colored in A2) with those of two dental materials in common use (porcelain-veneered ceramics colored in A2 of type I and type II). For each of the 3 groups, 7 specimens are produced in the lab and the translucency of the 21 specimens is measured using a spectrophotometer. The researcher is interested in comparing the translucency parameter of the new material with the older porcelain-veneered ceramics, so a comparison between population mean \(\mu_1\) of group 1 with the average of the means \(\frac{\mu_2+\mu_3}{2}\) for groups 2 and 3 is set up. The researcher therefore sets \(c_1=1, c_2=-0.5\), and \(c_3=-0.5\), to produce the contrast \(L=\mu_1-\frac{\mu_2+\mu_3}{2}\). If the hypothesis test has a p-value less than or equal to the significance level \(\alpha=0.05\), the researcher rejects the null hypothesis and concludes that the population mean \(\mu_1\) is significantly different than the average of the population means \(\mu_2\) and \(\mu_3\) for groups 2 and 3. However, if the p-value exceeds \(\alpha\), then the researcher does not have sufficient evidence to conclude \(\mu_1\) is different than the average of the population means \(\mu_2\) and \(\mu_3\) at this level.
The single contrast method should be used when the researcher has one pre-planned contrast to test. For testing more than one contrast, the researcher should use the Bonferroni method for a small number of contrasts, or the Scheffé method, when the number of contrasts is large.
Assumptions:
1) For each group, the response variable is normally distributed.
2) The observations are independent.
3) The variance is constant.
The Tukey method is designed to test all pairwise comparisons \(D_{ij}=\mu_i-\mu_j\) among the k groups, where \(i, j=1, \cdots, k\), and \(i \neq j\).
a) Hypothesis Test
Process
b) Confidence Interval
The \((1-\alpha)100\%\) confidence interval of the pairwise comparisons is \(\hat{D}_{ij}\pm \frac{1}{\sqrt{2}} q_{\alpha, k, N-k} \cdot s\{\hat{D}_{ij} \}\), where \(q_{\alpha, k, N-k}\) is the upper \(\alpha\)th percentile of the Studentized range distribution.
Comparing Mean Wear of Five Dental Materials: A researcher is interested in comparing the amount of material loss to dental restorations among 5 different materials as a result of long-term wear. For each material, 8 specimens are subjected to 100,000 cycles of three-body wear with the use of a machine, and the amount of wear, as material loss, for the 40 specimens is measured at the completion of the cycles. A classic ANOVA F-test is performed to test the null hypothesis that the population means for the 5 groups are equal H0: \(\mu_1 = \mu_2 = \cdots = \mu_5\). The total sample size is N=40. The p-value for the hypothesis test is 2.00E-16, which is less than the significance level \(\alpha=0.05\), so the researcher rejects the null hypothesis H0 and concludes that at least one of the 5 population means for wear is significantly different than the other means. To determine which pairs of means are significantly different, the researcher performs the Tukey’s post-hoc test. Tukey’s method will simultaneously perform all 10 pairwise comparisons between the 5 groups (G2-G1, G3-G1, G4-G1, G5-G1, G3-G2, G4-G2, G5-G2, G4-G3, G5-G3, and G5-G4) and generate simultaneous 95% confidence intervals.
The software output above shows that the 6 pairwise comparisons of groups labeled with a star * are significant, as their p-values are less than 0.05. The remaining four pairwise comparisons of groups (G5-G1, G3-G2, G4-G2, and G4-G3) are not significant as the individual p-values exceed the significance level .05. The pairwise comparison between groups 1 and 2 shows that the mean difference \(D_{12}=\mu_2-\mu_1\) is estimated to be \(184.18 \pm 34.18\), and so since 0 is not in this interval, the mean \(\mu_2\) is significantly larger than \(\mu_1\).
(1) In practice, some research papers report the results of experiments using the mean and standard deviation for each group and summarize the observed values as \(\hat{\mu}\pm SD\). For example, a researcher may report the wear for dental materials for 5 groups as \(65 \pm 4 \mu m\) for group 1, \(255 \pm 13 \mu m\) for group 2, \(257 \pm 24 \mu m\) for group 3, \(236 \pm 31 \mu m\) for group 4, and \(50 \pm 15 \mu m\) for group 5. If the populations are normally distributed, then the intervals reported above (one standard deviation from the mean) are 68% confidence intervals for each individual group, and so overall the simultaneous confidence in this list of intervals is substantially decreased. The Tukey confidence intervals on the other hand offer simultaneous 95% confidence intervals, so the researcher can with at least 95% confidence produce interval estimates of all pairwise differences \(\mu_i-\mu_j\) simultaneously.
(2) Tukey’s method is more powerful for performing hypothesis tests on all pairwise comparisons than other methods that are available like Bonferroni and Scheffe’s methods.
The Bonferroni procedure is used to control the type I error rate when simultaneously testing multiple hypotheses about \(g\) pre-planned contrasts. To simultaneously test \(g\) null hypotheses about the value of these contrasts \((L_1=0, L_2=0, \cdots, L_g=0)\), the Bonferroni method adjusts the individual significance level of each hypothesis by dividing \(\alpha\) by the number \(g\) of hypotheses. The Bonferroni method can be used for all research designs and is most useful when the number of simultaneous contrasts \(g\), is not too large.
a) Hypothesis test
Process:
c) Confidence Interval
The \((1-\alpha)100\%\) confidence intervals of each contrast \(L_i\) is \(\hat{L}_i \pm t_{\frac{\alpha}{2g}, N-k} \cdot s\{\hat{L}_i \}\), where \(t_{\frac{\alpha}{2g}, N-k}\) is the upper \(\frac{\alpha}{2g}\)th percentile of the t-distribution.
Effect of Sintering Temperature on Translucency: To assess the effect of sintering temperatures of dental restorations on the translucency of a given dental material, the researcher prepares 90 specimens of this dental material and randomly subdivides these into 3 groups. The 30 specimens in each group are sintered at the following temperatures: group 1 – 1350\(^\circ\)C, group 2 – 1450\(^\circ\)C, and group 3 – 1600\(^\circ\)C. A spectrophotometer is used to measure the translucency of each specimen. The researcher performs the ANOVA overall F-test (\(p \leq 0.05)\) and rejects the null hypothesis that the three population means are equal. The researcher is interested in assessing how each mean compares to the overall average of the 3 means \(\frac{\mu_1+\mu_2+\mu_3}{3}\). The three contrasts are therefore \(L_1 = \mu_1 – \frac{\mu_1+\mu_2+\mu_3}{3}=\frac{2}{3}\mu_1 – \frac{1}{3}\mu_2 – \frac{1}{3}\mu_3, L_2=\mu_2-\frac{\mu_1+\mu_2+\mu_3}{3}=-\frac{1}{3}\mu_1 + \frac{2}{3}\mu_2-\frac{1}{3}\mu_3\) and \(L_3=\mu_3-\frac{\mu_1+\mu_2+\mu_3}{3}=-\frac{1}{3}\mu_1-\frac{1}{3}\mu_2+\frac{2}{3}\mu_3\). The Bonferroni method is used to simultaneously test the three null hypotheses \(L_1 =0, L_2=0, L_3=0\).
For the contrast \(L_1\), the p-value is less than \(\frac{\alpha}{g}=\frac{0.05}{3}=0.01667\), and so the researcher concludes that \(\mu_1\) is significantly different than average of the 3 means. The same holds for the other two contrasts and means. Moreover, since the confidence intervals for the 3 contrasts do not overlap and the estimates for \(L_1, L_2\), and \(L_3\) are increasing, the researcher concludes that increased sintering temperatures lead to a higher translucency of the material.
For contrasts that are not pairwise comparisons, the Bonferroni method gives narrower confidence intervals than other methods provided the number of contrasts g is small. Some statisticians suggest that if the number of contrasts exceeds the number of groups, then Scheffé’s method should be used in lieu of Bonferroni. Moreover, if the researcher identifies contrasts of interest while analyzing the data (hence not pre-planned), they should test those contrasts using Scheffé’s method rather than Bonferroni, since the latter method would give a false bound on the type I error rate.
a) Hypothesis Testing
Process:
b) Confidence Interval
The \((1-\alpha)100\%\) confidence intervals of each contrast \(L_i\) is \(\hat{L}_i \pm \sqrt{(k-1)F(\alpha; k-1, N-k)} \ s\{\hat{L}_i\}\), where \(F(\alpha; k-1, N-k)\) is the upper \(\alpha\)th percentile of the F-distribution with the numerator degrees of freedom \(k-1\) and denominator degrees of freedom \(N-k\).
Comparing: To investigate if the strength of the adherence between a resin composite block and the resin composite luting agent (RCLA) depends on the surface roughness of the block, a researcher prepares 80 specimens by subjecting each group of 20 to no sanding (RCLA), as well as sanding using 600, 320, and 60 grit SiC papers, respectively. The interfacial fracture toughness is measured for each of the 80 specimens. The researcher performs the ANOVA overall F-test \((p \leq 0.05)\) and rejects the null hypothesis that the four population means are equal. The researcher then visually inspects the boxplot and decides to do a post-hoc analysis to assess if the mean \(\mu_4\) for group 4 (RCLA) is significantly different from the other 3 means.
Since this is a post-hoc analysis the researcher should use Scheffé’s method. The following four contrasts are set up for a simultaneous hypothesis test at level \(\alpha=0.05\):
The software output below indicates that all five contrasts are significantly different from 0, since the individual p-values are less than .05. With 95% confidence, the researcher concludes that \(\mu_4\) is larger than the other 3 means as well as their average.
Scheffé’s method can be used post-hoc—that is patterns that a researcher finds when performing data analysis can be tested using this method. Scheffé’s Method is conservative, but should replace the Bonferroni method when the number of contrasts exceeds the number of groups.
Multiple comparison methods can be used for a variety of research goals: multiple comparisons of specified groups to the best of the other groups, multiple comparisons of specified groups to control group, all pairwise comparisons, etc. Tukey, Bonferroni, and Scheffé’s method are 3 commonly used approaches that can be used to address these research questions, however, they are conservative methods. If the patterns witnessed are of sufficient interest to warrant further investigation, a study can be performed with pre-planned comparisons that produces narrower confidence intervals and has higher power. For example, Hsu’s method applies to multiple comparisons with the best treatment group, and Dunnets’ method to comparisons with the control group.
R Code and Examples
Tukey HSD Method: R script file
###-----------------------
### Tukey HSD Method
###-----------------------
# Generate data
set.seed(168585)
Group1 <- rnorm(8, 65, 4)
Group2 <- rnorm(8, 255, 13)
Group3 <- rnorm(8, 257, 24)
Group4 <- rnorm(8, 236, 31)
Group5 <- rnorm(8, 50, 15)
wear <- c(Group1, Group2, Group3, Group4, Group5)
group <- c(rep("1", 8), rep("2", 8), rep("3", 8), rep("4", 8), rep("5", 8))
Material <- data.frame(wear = wear, group = group)
# One Way ANOVA Model & Overall F test
mod <- aov(wear~group, data = Material)
summary(mod)
# TukeyHSD test
TukeyHSD(mod)
Bonferroni Method: R script file
###-----------------------
### Bonferroni Method
###-----------------------
# Generate data
set.seed(168585)
Group1 <- rnorm(30, 15.28, .43)
Group2 <- rnorm(30, 17.14, .71)
Group3 <- rnorm(30, 18.26, .36)
VYZacolored <- c(Group1, Group2, Group3)
temp <- c(rep("1350", 30), rep("1450", 30), rep("1600", 30))
Translucency <- data.frame(VYZacolored = VYZacolored, temp = temp)
# One Way ANOVA model & Overall F test
temp.aov <- aov(VYZacolored~group, data = Translucency)
summary(temp.aov)
# Bonferroni method
# overall mean
overall.mean <- mean(Translucency$VYZacolored)
# treatment effects
effect<-aggregate(Translucency$VYZacolored, list(Translucency$temp), mean)
MSE<-summary(temp.aov)[[1]][2,3]
DFE<-summary(temp.aov)[[1]][2,1]
alpha<-0.05
# Contrast
contrast <- c("mu1-(mu1+mu2+mu3)/3", "mu2-(mu1+mu2+mu3)/3", "mu3-(mu1+mu2+mu3)/3")
cont.matrix<-matrix(c(2/3,-1/3,-1/3,-1/3,2/3,-1/3,-1/3,-1/3,2/3),byrow=T,ncol =3)
Li<-effect$x%*%t(cont.matrix)
Sci<-sqrt(MSE*apply(cont.matrix^2,1,sum)/30)
lower<-Li - qt(1-alpha/(2*3),DFE)*Sci
upper<-Li + qt(1-alpha/(2*3),DFE)*Sci
Tstat <- Li/Sci
PvalueAdj <- (1-pt(abs(Tstat), DFE))*2
results <- print(t(rbind(Li,lower,upper,PvalueAdj)))
Scheffé method: R script file
###-----------------------
### Scheffé method
###-----------------------
# Toughness example
set.seed(168585)
Group1 <- rnorm(20, .85, .31)
Group2 <- rnorm(20, .89, .17)
Group3 <- rnorm(20, .96, .23)
Group4 <- rnorm(20, 1.62, .22)
toughness <- c(Group1, Group2, Group3, Group4)
storage <- c(rep("600", 20), rep("320", 20), rep("60", 20), rep("VELC", 20))
tough <- data.frame(toughness=toughness, storage=storage)
# Boxplot
tough$grouporder <- factor(tough$storage, levels=c("600", "320", "60", "VELC"))
boxplot(toughness~grouporder,data=tough,col=c(2:5),xlab="Storage",ylab="Toughness")
# One Way ANOVA model & Overall F test
tough.aov <- aov(toughness~storage, data = tough)
summary(tough.aov)
# Scheffé method
# overall mean
overall.mean <- mean(tough$toughness)
# treatment effects
effect<-aggregate(tough$toughness, list(tough$storage), mean)
MSE<-summary(tough.aov)[[1]][2,3]
DFE<-summary(tough.aov)[[1]][2,1]
alpha<-0.05
# Contrast
contrast <- c("m4-mu1","mu4-m2","mu4-mu3","mu4-(mu1+mu2+mu3)/3", "mu4-(mu1+mu2+mu3+mu4)/4")
cont.matrix<-matrix(c(-1, 0, 0, 1, 0, -1, 0, 1, 0, 0, -1, 1, -1/3, -1/3, -1/3, 1, -1/4, -1/4, -1/4, 3/4),byrow=T,ncol =4)
Li<-effect$x%*%t(cont.matrix)
Sci<-sqrt(MSE*apply(cont.matrix^2,1,sum)/20)
lower<-Li - sqrt(3*qf(alpha, 3, DFE))*Sci
upper<-Li + sqrt(3*qf(alpha, 3, DFE))*Sci
Sstat <- Li/Sci
Sstatt <- Sstat^2/(3)
PvalueAdj1 <- (1-pf(Sstatt, 3, DFE))*2
PvalueAdj <- (1-pf((((Sstat)^2)/3), 3, DFE))*2
results <- print(t(rbind(Li,lower,upper,PvalueAdj)))