In this post, I want to discuss a concept that is often mistaken by some of my collegues. With all crops, we are used to repeating experiments across years to obtain multi-year data; the structure of the resulting dataset is always the same and it is exemplified in the box below, that refers to a multi-year genotype experiment with winter wheat.
rm(list = ls())
library(tidyverse)
library(nlme)
library(emmeans)
filePath <- "https://www.casaonofri.it/_datasets/WinterWheat.csv"
dataset <- read.csv(filePath)
dataset <- dataset %>%
mutate(across(c(1:3, 5), .fns = factor))
head(dataset)
## Plot Block Genotype Yield Year
## 1 2 1 COLOSSEO 6.73 1996
## 2 110 2 COLOSSEO 6.96 1996
## 3 181 3 COLOSSEO 5.35 1996
## 4 2 1 COLOSSEO 6.26 1997
## 5 110 2 COLOSSEO 7.01 1997
## 6 181 3 COLOSSEO 6.11 1997
We can see that we have a column for the blocks, a column for the experimental factor (the genotype, in this instance), a column for the year and a column for the response variable (the yield, in this instance).
If we intend to take the genotype, year and block effects as fixed, we can fit the correct model with the lm() function and the resulting ANOVA table looks like the following one.
mod1 <- lm(Yield ~ Year/Block + Genotype*Year, data = dataset)
anova(mod1)
## Analysis of Variance Table
##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 6 159.279 26.5466 178.3996 < 2.2e-16 ***
## Genotype 7 11.544 1.6491 11.0824 2.978e-10 ***
## Year:Block 14 3.922 0.2801 1.8826 0.03738 *
## Year:Genotype 42 27.713 0.6598 4.4342 6.779e-10 ***
## Residuals 98 14.583 0.1488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we have made different experiments in different years (and with a different randomisation in each year, hopefully), the block effect can only be considered as nested within each year (‘Year/Block’ and not ’Year*Block’). As for the rest, the ANOVA table is very close to that for any types of two-factors factorial experiments.
The things change dramatically if we have a perennial crop, or a crop with a cycle lasting for more than one year, because yield measurements are taken repeatedly in the same plots across years. For this reason, even though the dataset looks very much like the previous one, it must be analysed in a totally different manner. Based on my experience as an editor of International Journals, I can say that several authors, still today, tend to forget this, resulting in several rejections or, at least, delays in publication.
Let’s consider the dataset below, that refers to the yield of lucerne genotypes in three consecutive years, taken from the same plots in a single experiment lasting for three years.
filePath <- "https://www.casaonofri.it/_datasets/alfalfa3years.csv"
dataset <- read.csv(filePath)
dataset <- dataset %>%
mutate(across(c(1:3), .fns = factor))
head(dataset)
## Plot Block Genotype Year Yield
## 1 1 1 4cascine 2006 6.631775
## 2 25 2 4cascine 2006 6.705397
## 3 60 3 4cascine 2006 6.499588
## 4 63 4 4cascine 2006 7.087686
## 5 7 1 casalina 2006 8.033558
## 6 33 2 casalina 2006 8.265165
If we analyse tha data as in the winter wheat example, we neglect that the observations are grouped within the same plots and, therefore, they are correlated. Consequently, the independence assumption is broken and it is no wonder that the manuscript must be rejected. What should we do, then?
First of all, we should build a new variable to uniquely identify each plot; it is easy, if we think that yield values taken for the same genotype in the same block must have been taken in the same plot.
# Refernce the plots
dataset$Plot <- dataset$Block:dataset$Genotype
Next, we can follow the usual rules to build the model (Piepho et al., 2004):
- Consider one single year and build the treatment model
- Consider one single year and build the block model
- Include the year factor into the model and combine the year with all the effects in the treatment model, by crossing or nesting as appropriate.
- Consider that the ‘plot’ factor in the block model references the randomisation units, i.e. those units which received the the genotypes by a randomisation process. Assign to this plot factor a random effect.
- Excluding the terms for randomisation units, nest the year in all the other terms in the block model.
- Combine random effects for randomisation units with the repeated factor, by using the colon operator, in order to derive the correct error terms to accommodate correlation structures.
The models at the different steps are as follows (with R notation):
- treatment model: YIELD ~ GENOTYPE
- block model: YIELD ~ BLOCK + BLOCK:PLOT
- treatment model: YIELD ~ GENOTYPE * YEAR
- block model: YIELD ~ BLOCK + (1|BLOCK:PLOT)
- block model: YIELD ~ BLOCK + BLOCK:YEAR + (1|BLOCK:PLOT)
- block model: YIELD ~ BLOCK + BLOCK:YEAR + (1|BLOCK:PLOT) + (1|BLOCK:PLOT:YEAR)
For analogy with the winter wheat experiment we take the block, year and genotype effects as fixed (but you can change this), so that the final model is:
YIELD ~ BLOCK + BLOCK:YEAR + GENOTYPE * YEAR + (1|BLOCK:PLOT) + (1|BLOCK:PLOT:YEAR)
where the last term does not need to be fitted in R, as it is the residual term, that is fitted by default. The resulting analysis (with ‘lme’) is:
mod2 <- lme(Yield ~ Block + Block:Year + Genotype*Year,
random = ~1|Plot,
data = dataset)
anova(mod2)
## numDF denDF F-value p-value
## (Intercept) 1 137 3449.090 <.0001
## Block 3 57 0.272 0.8450
## Genotype 19 57 0.586 0.9009
## Year 1 137 102.171 <.0001
## Block:Year 3 137 0.008 0.9990
## Year:Genotype 19 137 0.078 1.0000
emmeans(mod2, ~Genotype)
## Genotype emmean SE df lower.CL upper.CL
## 4cascine 11.59 0.883 57 9.82 13.4
## casalina 12.46 0.883 57 10.69 14.2
## classe 11.59 0.883 57 9.83 13.4
## costanza 9.89 0.883 57 8.13 11.7
## dimitra 11.75 0.883 57 9.99 13.5
## FDL0101 12.22 0.883 57 10.45 14.0
## garisenda 11.76 0.883 57 9.99 13.5
## LaBellaCampagnola 12.17 0.883 57 10.40 13.9
## LaTorre 11.36 0.883 57 9.59 13.1
## linfa 11.23 0.883 57 9.46 13.0
## marina 11.76 0.883 57 10.00 13.5
## Palladiana 10.89 0.883 57 9.12 12.7
## PicenaGR 12.12 0.883 57 10.35 13.9
## PR56S82 11.56 0.883 57 9.79 13.3
## PR57Q53 11.70 0.883 57 9.93 13.5
## prosementi 11.79 0.883 57 10.02 13.6
## RivieraVicentina 9.98 0.883 57 8.21 11.8
## robot 12.11 0.883 57 10.34 13.9
## Selene 12.11 0.883 57 10.34 13.9
## Zenith 11.94 0.883 57 10.17 13.7
##
## Results are averaged over the levels of: Block
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
If we compare the above analyses with a ‘naive’ (wrong) analysis that neglects the repeated measures, we see big differences and, especially, we see that the SE for genotype means is much higher in the correct analysis.
mod3 <- lm(Yield ~ Year/Block + Genotype*Year,
data = dataset)
anova(mod3)
## Analysis of Variance Table
##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 956.50 956.50 103.7377 <2e-16 ***
## Genotype 19 104.27 5.49 0.5952 0.9072
## Year:Block 3 7.65 2.55 0.2766 0.8422
## Year:Genotype 19 13.95 0.73 0.0796 1.0000
## Residuals 197 1816.41 9.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod3, ~Genotype)
## Genotype emmean SE df lower.CL upper.CL
## 4cascine 11.59 0.877 197 9.86 13.3
## casalina 12.46 0.877 197 10.73 14.2
## classe 11.59 0.877 197 9.87 13.3
## costanza 9.89 0.877 197 8.17 11.6
## dimitra 11.75 0.877 197 10.03 13.5
## FDL0101 12.22 0.877 197 10.49 14.0
## garisenda 11.76 0.877 197 10.03 13.5
## LaBellaCampagnola 12.17 0.877 197 10.44 13.9
## LaTorre 11.36 0.877 197 9.63 13.1
## linfa 11.23 0.877 197 9.50 13.0
## marina 11.76 0.877 197 10.04 13.5
## Palladiana 10.89 0.877 197 9.16 12.6
## PicenaGR 12.12 0.877 197 10.39 13.8
## PR56S82 11.56 0.877 197 9.83 13.3
## PR57Q53 11.70 0.877 197 9.97 13.4
## prosementi 11.79 0.877 197 10.06 13.5
## RivieraVicentina 9.98 0.877 197 8.25 11.7
## robot 12.11 0.877 197 10.38 13.8
## Selene 12.11 0.877 197 10.38 13.8
## Zenith 11.94 0.877 197 10.21 13.7
##
## Results are averaged over the levels of: Block
## Confidence level used: 0.95
I just want to conclude by saying that the above mixed model analyses regards the design as a sort of ‘split-plot in time’ and it is not necessarily correct, as it assumes that the within-plot correlation is the same for all pairs of observations, regardless of their distance in time. Further analyses might be necessary to assess whether serial correlation structures are necessary. But we’ll consider this in a future post.
Thanks for reading and happy coding! And … don’t forget to check out my new book!
Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Send comments to: andrea.onofri@unipg.it
Reference
- Piepho, H.-P., Büchse, A., Richter, C., 2004. A Mixed Modelling Approach for Randomized Experiments with Repeated Measures. Journal of Agronomy and Crop Science 190, 230–247.
