Subsampling in field experiments

Andrea Onofri ·  Added on March 29, 2023 ·  11 min read


Subsampling is very common in field experiments in agriculture. It happens when we collect several random samples from each plot and we submit them to some sort of measurement process. Some examples? Let’s imagine that we have randomised field experiments with three replicates and, either,:

  1. we collect the whole grain yield in each plot, select four subsamples and measure, in each subsample, the oil content or some other relevant chemical property, or
  2. we collect, from each plot, four plants and measure their heights, or
  3. we collect a representative soil sample from each plot and perform chemical analyses in triplicate.

For all the above examples, we end up with 3 by 4 equal 12 data for each treatment level. The question is: do we have 12 replicates? This is exactly the point: subsamples should never be mistaken for true-replicates, as the experimental treatments were not independently allocated to each one of them. In literature, subsamples are usually known as sub-replicates or pseudo-replicates: for the above examples, we have three true-replicates and four pseudo-replicates per true-replicate. Let’s see how to handle pseudo-replicates in data analysis. But, first of all, do not forget that: experiments with pseudo-replicates are valid only when we also have true-replicates! If we only have pseudo-replicates… well, there is nothing we can do in data analysis that transforms our experiment into a valid one…

A motivating example

Let’s consider a dataset from an experiment where we had 30 genotypes in three blocks and recorded the Weight of Thousand Kernels (TKW) in three subsamples per plot, which were labelled by using the ‘Sample’ variable. In the box below, we load the data and all the necessary packages.

rm(list=ls())
library(tidyverse)
library(nlme)
library(emmeans)

filePath <- "https://www.casaonofri.it/_datasets/TKW.csv"
TKW <- read.csv(filePath)
TKW <- TKW %>% 
  mutate(across(1:4, .fns = factor))
head(TKW)
##   Plot Block  Genotype Sample  TKW
## 1    1     1 Meridiano      1 28.6
## 2    2     1     Solex      1 33.3
## 3    3     1  Liberdur      1 22.3
## 4    4     1  Virgilio      1 28.1
## 5    5     1   PR22D40      1 26.7
## 6    6     1    Ciccio      1 34.2

The wrong analysis

A ‘naive’ analysis would be to perform an ANOVA on all data, without making any distinction between true-replicates and sub-replicates. Let’s do this by using the code shown in the box below.

# Naive analysis
mod <- lm(TKW ~ Block + Genotype, data=TKW)
anova(mod)
## Analysis of Variance Table
## 
## Response: TKW
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Block       2  110.3  55.169   7.510 0.0006875 ***
## Genotype   29 7224.7 249.129  33.913 < 2.2e-16 ***
## Residuals 238 1748.4   7.346                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
muG <- emmeans(mod, ~Genotype)
muG
##  Genotype   emmean    SE  df lower.CL upper.CL
##  Achille      34.7 0.903 238     32.9     36.4
##  Alemanno     41.6 0.903 238     39.8     43.4
##  AncoMarzio   32.7 0.903 238     31.0     34.5
##  Arnacoris    24.5 0.903 238     22.7     26.3
##  Casanova     38.6 0.903 238     36.9     40.4
##  Chiara       30.0 0.903 238     28.2     31.8
##  Ciccio       33.5 0.903 238     31.7     35.2
##  Ciclope      37.4 0.903 238     35.6     39.1
##  Claudio      40.7 0.903 238     38.9     42.5
##  Creso        37.8 0.903 238     36.0     39.6
##  Dario        30.9 0.903 238     29.1     32.7
##  Duilio       32.1 0.903 238     30.3     33.9
##  Dylan        36.5 0.903 238     34.8     38.3
##  Imhotep      34.8 0.903 238     33.1     36.6
##  Iride        30.8 0.903 238     29.0     32.6
##  Isildur      26.0 0.903 238     24.2     27.8
##  K26          33.5 0.903 238     31.7     35.2
##  Latinur      37.5 0.903 238     35.7     39.3
##  Liberdur     22.8 0.903 238     21.1     24.6
##  Meridiano    30.8 0.903 238     29.0     32.6
##  Neolatino    36.9 0.903 238     35.1     38.7
##  Normanno     36.8 0.903 238     35.0     38.6
##  PR22D40      26.2 0.903 238     24.4     28.0
##  PR22D89      44.0 0.903 238     42.3     45.8
##  Principe     37.2 0.903 238     35.4     39.0
##  Saragolla    35.9 0.903 238     34.1     37.6
##  Sfinge       40.5 0.903 238     38.8     42.3
##  Simeto       41.1 0.903 238     39.3     42.9
##  Solex        34.6 0.903 238     32.9     36.4
##  Virgilio     29.5 0.903 238     27.7     31.2
## 
## Results are averaged over the levels of: Block 
## Confidence level used: 0.95
pairwise <- as.data.frame(pairs(emmeans(mod, ~Genotype)))
sum(pairwise$p.value < 0.05)
## [1] 225

We see that the SE for genotype means is 0.903, the F test for the genotypes is highly significant and there are 225 significant pairwise comparisons among the 30 genotypes.

As we said, this is simple, but it is also terribly wrong. By putting true-replicates and pseudo-replicates on an equal footing, we have forgotten that the 270 observations are grouped by plot and that the observations in the same plot are more alike than the observations in different plots, because they share the same ‘origin’. We say that the observations in each plot are correlated and, therefore, the basic assumption of independence of residuals is broken. Our analysis is invalid and our manuscript can be very likely rejected.

But, why are the editors so critical when we mistake pseudo-replicates for true-replicates? We’ll see this in a few minutes.

The correct way to go

A fully correct model for our dataset is:

\[ y_{ijks} = \mu + \alpha_i + \beta_j + \gamma_{k} + \varepsilon_{s}\]

where \(y\) is the thousand kernel weight for the ith genotype, jth block, kth plot and sth subsample, \(\alpha\) is the effect of the ith genotype, \(\beta\) is the effect of the jth block, \(\gamma\) is the effect of the the kth plot and \(\varepsilon\) is the effect of the sth subsample. The presence of the \(\gamma\) element accounts for the effects of plots as a grouping factor and restores the independence of model residuals.

Obviously, the difference between plots (for a given genotype and block) must be regarded as a random effect, as well as the difference between subsamples, within each plot. Indeed. we have two random effects and, therefore, this is a mixed model. These two random effects are assumed to be normal, independent from each other, with mean equal to 0 and variances respectively equal to \(\sigma^2_p\) and \(\sigma^2_e\). (BTW: I am regarding the block effect as fixed! You may not agree, but this does not change what I am going to say later…).

We can fit this mixed model by using the lme() function in the nlme package, including the ‘Plot’ (i.e. the grouping factor) as a random effect. Obviously, we need to have a variable in the dataset that uniquely identifies each plot.

# Mixed model fit
mod.mix <- lme(TKW ~ Block + Genotype, 
               random = ~1|Plot, data=TKW)
anova(mod.mix)
##             numDF denDF   F-value p-value
## (Intercept)     1   180 11563.536  <.0001
## Block           2    58     2.005  0.1439
## Genotype       29    58     9.052  <.0001
emmeans(mod.mix, ~Genotype)
##  Genotype   emmean   SE df lower.CL upper.CL
##  Achille      34.7 1.75 58     31.2     38.2
##  Alemanno     41.6 1.75 58     38.1     45.1
##  AncoMarzio   32.7 1.75 58     29.2     36.2
##  Arnacoris    24.5 1.75 58     21.0     28.0
##  Casanova     38.6 1.75 58     35.1     42.1
##  Chiara       30.0 1.75 58     26.5     33.5
##  Ciccio       33.5 1.75 58     30.0     37.0
##  Ciclope      37.4 1.75 58     33.9     40.9
##  Claudio      40.7 1.75 58     37.2     44.2
##  Creso        37.8 1.75 58     34.3     41.3
##  Dario        30.9 1.75 58     27.4     34.4
##  Duilio       32.1 1.75 58     28.6     35.6
##  Dylan        36.5 1.75 58     33.0     40.0
##  Imhotep      34.8 1.75 58     31.3     38.3
##  Iride        30.8 1.75 58     27.3     34.3
##  Isildur      26.0 1.75 58     22.5     29.5
##  K26          33.5 1.75 58     30.0     37.0
##  Latinur      37.5 1.75 58     34.0     41.0
##  Liberdur     22.8 1.75 58     19.3     26.3
##  Meridiano    30.8 1.75 58     27.3     34.3
##  Neolatino    36.9 1.75 58     33.4     40.4
##  Normanno     36.8 1.75 58     33.3     40.3
##  PR22D40      26.2 1.75 58     22.7     29.7
##  PR22D89      44.0 1.75 58     40.5     47.5
##  Principe     37.2 1.75 58     33.7     40.7
##  Saragolla    35.9 1.75 58     32.4     39.4
##  Sfinge       40.5 1.75 58     37.0     44.0
##  Simeto       41.1 1.75 58     37.6     44.6
##  Solex        34.6 1.75 58     31.1     38.1
##  Virgilio     29.5 1.75 58     26.0     33.0
## 
## Results are averaged over the levels of: Block 
## Degrees-of-freedom method: containment 
## Confidence level used: 0.95
pairwise <- as.data.frame(pairs(emmeans(mod.mix, ~Genotype)))
sum(pairwise$p.value < 0.05)
## [1] 91

We see several differences with respect to the previous fit:

  1. in the ‘naive’ model, the error term for the genotype effect (MSe) is 7.346 with 238 DF and it represents the whole subsample-to-subsample variability. In the mixed model, the error term for the genotype effect is not shown, but it has only 58 degrees of freedom.
  2. The SE for genotype means is 1.75 and it is much higher than that from the ‘naive’ fit.
  3. The number of significant pairwise comparisons between genotypes has dropped to 91.

Let’s concentrate on the correct error term for the genotype effect; we know that the error term must be obtained by a comparison of plots treated alike (see Fisher, 1937. The design of experiments). From this, it is immediately clear that the MSe from te ‘naive’ analysis compares the subsamples treated alike (and not the plots). Now, if we take the ‘naive’ analysis and include the ‘plot’ among the experimental factors, we get:

mod2 <- lm(TKW ~ Block + Genotype + Plot, data=TKW)
anova(mod2)
## Analysis of Variance Table
## 
## Response: TKW
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Block       2  110.3  55.169  65.269 < 2.2e-16 ***
## Genotype   29 7224.7 249.129 294.736 < 2.2e-16 ***
## Plot       58 1596.2  27.521  32.559 < 2.2e-16 ***
## Residuals 180  152.1   0.845                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the ‘plot’ effect, once the genotype and block effects have been removed, takes 58 DFs and leave us with 180 DFs in the residuals. Indeed, we have decomposed the MSe from the ‘naive’ analysis in two parts, one measuring the plot-to-plot variability and the other one measuring the subsample-to-subsample variability, within each plot. The MS for this latter element is equal to \(\sigma_e\) from mixed model analysis, while the MS for the former element is the correct error term to test for the genotype effect, because it compares the plots treated alike. It is equal to \(3 \times \sigma^2_p + \sigma^2_e\) and it should be used to calculate the SEs for genotype means, as \(\sqrt{27.521/9} = 1.748689\) and not as \(\sqrt{7.346/9} = 0.90345\).

We can now understand why the editors reject our manuscripts if we do not analyse the data properly: we may strongly overestimate the precision of our experiment and, consequently, commit a lot of false positive errors!

A simpler alternative

In presence of subsampling, we strongly recommend the previous method of data analysis. But, a simpler alternative exists, which is feasible whenever the number of subsamples is the same for all plots: we can proceed in two-steps. In the first step, we calculate the means of subsamples for each plot and, in the second step, we submit the plot means to ANOVA, by considering the genotypes and the blocks as fixed factors.

# First step
TKWm <- aggregate(TKW ~ Block + Genotype, data = TKW, mean)

#Second step
mod2step <- lm(TKW ~ Genotype + Block, data = TKWm)
anova(mod2step)
## Analysis of Variance Table
## 
## Response: TKW
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Genotype  29 2408.24  83.043  9.0522 9.943e-13 ***
## Block      2   36.78  18.390  2.0046    0.1439    
## Residuals 58  532.08   9.174                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod2step, ~Genotype)
##  Genotype   emmean   SE df lower.CL upper.CL
##  Achille      34.7 1.75 58     31.2     38.2
##  Alemanno     41.6 1.75 58     38.1     45.1
##  AncoMarzio   32.7 1.75 58     29.2     36.2
##  Arnacoris    24.5 1.75 58     21.0     28.0
##  Casanova     38.6 1.75 58     35.1     42.1
##  Chiara       30.0 1.75 58     26.5     33.5
##  Ciccio       33.5 1.75 58     30.0     37.0
##  Ciclope      37.4 1.75 58     33.9     40.9
##  Claudio      40.7 1.75 58     37.2     44.2
##  Creso        37.8 1.75 58     34.3     41.3
##  Dario        30.9 1.75 58     27.4     34.4
##  Duilio       32.1 1.75 58     28.6     35.6
##  Dylan        36.5 1.75 58     33.0     40.0
##  Imhotep      34.8 1.75 58     31.3     38.3
##  Iride        30.8 1.75 58     27.3     34.3
##  Isildur      26.0 1.75 58     22.5     29.5
##  K26          33.5 1.75 58     30.0     37.0
##  Latinur      37.5 1.75 58     34.0     41.0
##  Liberdur     22.8 1.75 58     19.3     26.3
##  Meridiano    30.8 1.75 58     27.3     34.3
##  Neolatino    36.9 1.75 58     33.4     40.4
##  Normanno     36.8 1.75 58     33.3     40.3
##  PR22D40      26.2 1.75 58     22.7     29.7
##  PR22D89      44.0 1.75 58     40.5     47.5
##  Principe     37.2 1.75 58     33.7     40.7
##  Saragolla    35.9 1.75 58     32.4     39.4
##  Sfinge       40.5 1.75 58     37.0     44.0
##  Simeto       41.1 1.75 58     37.6     44.6
##  Solex        34.6 1.75 58     31.1     38.1
##  Virgilio     29.5 1.75 58     26.0     33.0
## 
## Results are averaged over the levels of: Block 
## Confidence level used: 0.95
pairwise <- as.data.frame(pairs(emmeans(mod2step, ~Genotype)))
sum(pairwise$p.value < 0.05)
## [1] 91

We see that the results are totally the same as with a mixed model fit, although all Mean Squares in ANOVA are fractions of those obtained by mixed model analysis.

Please, remember that this simple solution is only feasible when we have the same number of subsamples per each plot.

Thanks for reading and happy coding!

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
andrea.onofri@unipg.it


References

  1. Fisher, R.A., 1937. The design of experiments, 2nd ed. Oliver and Boyd, Edinburgh, UK.
  2. Hurlbert, S.H., 1984. Pseudoreplication and the design of ecological experiments. Ecological Monographs 54, 187–211.