Fixing the bridge between biologists and statisticians

Models are wrong... but, some are useful (G. Box)!

GGE analyses for multi-environment studies

Published at May 31, 2023 ·  12 min read

In a recent post we have seen that we can use Principal Component Analyses (PCA) to elucidate the ‘genotype by environment’ relationship (see this post). Whenever the starting point for PCA is the doubly-centered (centered by rows and columns) matrix of yields across environments, we talk about AMMI analysis, which is often used to get insight into the stability of genotype yields across environments.

By changing the starting matrix, we can obtain a different perspective and put focus on the definition of macroenvironments and on the selection of winning genotypes. In particular, if the two-way matrix of yields across environments is only column-centered before PCA, we talk about GGE analysis (Yan et al., 2000). In spite of some academic debate (see Gauch, 2006, Yan et al., 2007, Gauch et al., 2008), AMMI and GGE analyses are both useful and can be used as two complementary tools for the analysis of multi-environment genotype data.


AMMI analyses for multi-environment studies

Published at May 26, 2023 ·  19 min read

Again into a subject that is rather important for most agronomists, i.e. the selection of crop varieties. All farmers are perfectly aware that crop performances are affected both by the genotype and by the environment. These two effects are not purely additive and they often show a significant interaction. By this word, we mean that a genotype can give particularly good/bad performances in some specific environmental situations, which we may not expect, considering its average behaviour in other environmental conditions. The Genotype by Environment (GE) interaction may cause changes in the ranking of genotypes, depending on the environment and may play a key role in varietal recommendation, for a given mega-environment.


Repeated measures with perennial crops

Published at March 30, 2023 ·  8 min read

In this post, I want to discuss a concept that is often mistaken by some of my collegues. With all crops, we are used to repeating experiments across years to obtain multi-year data; the structure of the resulting dataset is always the same and it is exemplified in the box below, that refers to a multi-year genotype experiment with winter wheat.

rm(list = ls())
filePath <- ""
dataset <- read.csv(filePath)
dataset <- dataset %>%
  mutate(across(c(1:3, 5), .fns = factor))
##   Plot Block Genotype Yield Year
## 1    2     1 COLOSSEO  6.73 1996
## 2  110     2 COLOSSEO  6.96 1996
## 3  181     3 COLOSSEO  5.35 1996
## 4    2     1 COLOSSEO  6.26 1997
## 5  110     2 COLOSSEO  7.01 1997
## 6  181     3 COLOSSEO  6.11 1997

We can see that we have a column for the blocks, a column for the experimental factor (the genotype, in this instance), a column for the year and a column for the response variable (the yield, in this instance).


Subsampling in field experiments

Published at March 29, 2023 ·  11 min read

Subsampling is very common in field experiments in agriculture. It happens when we collect several random samples from each plot and we submit them to some sort of measurement process. Some examples? Let’s imagine that we have randomised field experiments with three replicates and, either,:

  1. we collect the whole grain yield in each plot, select four subsamples and measure, in each subsample, the oil content or some other relevant chemical property, or
  2. we collect, from each plot, four plants and measure their heights, or
  3. we collect a representative soil sample from each plot and perform chemical analyses in triplicate.

For all the above examples, we end up with 3 by 4 equal 12 data for each treatment level. The question is: do we have 12 replicates? This is exactly the point: subsamples should never be mistaken for true-replicates, as the experimental treatments were not independently allocated to each one of them. In literature, subsamples are usually known as sub-replicates or pseudo-replicates: for the above examples, we have three true-replicates and four pseudo-replicates per true-replicate. Let’s see how to handle pseudo-replicates in data analysis. But, first of all, do not forget that: experiments with pseudo-replicates are valid only when we also have true-replicates! If we only have pseudo-replicates… well, there is nothing we can do in data analysis that transforms our experiment into a valid one…


Fitting threshold models to seed germination data

Published at March 13, 2023 ·  19 min read

In previous posts we have shown that we can use time-to-event curves to describe the germination pattern of a seed population (see here). We have also shown that these curves can be modified to include the effects of external/internal factors/covariates, such as the genotype, the species, the humidity content and temperature in the substrate (see here and here). These modified time-to-event curves can be fitted in ‘one-step’, i.e., we start from the germination data with the appropriate shape (see here), fit the model and retrieve the estimates of model parameters ( go to here for an example ).


Fitting thermal-time-models to seed germination data

Published at February 10, 2023 ·  7 min read

This is a follow-up post. If you are interested in other posts of this series, please go to: All these posts exapand on a paper that we have recently published in the Journal ‘Weed Science’; please follow this link to the paper.

A motivating examples

In recent times, we wanted to model the effect of temperature on seed germination for Hordeum vulgare and we made an assay with three replicated Petri dishes (50 seeds each) at 9 constant temperature levels (1, 3, 7, 10, 15, 20, 25, 30, 35, 40 °C). Germinated seeds were counted and removed daily for 10 days. This unpublished dataset is available as barley in the drcSeedGerm package, which needs to be installed from github (see below), together with the drcte package for time-to-event model fitting. The following code loads the necessary packages, loads the datasets and shows the first six lines.


Fitting hydro-thermal-time-models to seed germination data

Published at February 10, 2023 ·  15 min read

This is a follow-up post. If you are interested in other posts of this series, please go to: All these posts exapand on a paper that we have recently published in the Journal ‘Weed Science’; please follow this link to the paper.

Germination assay

This dataset was obtained from previously published work (Mesgaran et al., 2017) with Hordeum spontaneum [C. Koch] Thell. The germination assay was conducted using four replicates of 20 seeds tested at six different water potential levels (0, −0.3, −0.6, −0.9, −1.2 and −1.5 MPa). Osmotic potentials were produced using variable amount of polyethylene glycol (PEG, molecular weight 8000) adjusted for the temperature level. Petri dishes were incubated at six constant temperature levels (8, 12, 16, 20, 24 and 28 °C), under a photoperiod of 12 h. Germinated seeds (radicle protrusion > 3 mm) were counted and removed daily for 20 days.


The coefficient of determination: is it the R-squared or r-squared?

Published at November 26, 2022 ·  9 min read

We often use the coefficient of determination as a swift ‘measure’ of goodness of fit for our regression models. Unfortunately, there is no unique symbol for such a coefficient and both \(R^2\) and \(r^2\) are used in literature, almost interchangeably. Such an interchangeability is also endorsed by the Wikipedia (see at: ), where both symbols are reported as the abbreviations for this statistical index.

As an editor of several International Journals, I should not agree with such an approach; indeed, the two symbols \(R^2\) and \(r^2\) mean two different things, and they are not necessarily interchangeable, because, depending on the setting, either of the two may be wrong or ambiguous. Let’s pay a little attention to such an issue.


Multi-environment split-plot experiments

Published at September 13, 2022 ·  7 min read

Have you made a split-plot field experiment? Have you repeated such an experiment in two (or more) years/locations? Have you run into troubles, because the reviewer told you that your ANOVA model was invalid? If so, please, stop for awhile and read: this post might help you understand what was wrong with your analyses.

Motivating example

Let’s think of a field experiment, where 6 genotypes of faba bean were compared under two different sowing times (autumn and spring). For the ease of organisation, such an experiment was laid down as a split-plot, in a randomised complete block design; sowing times were randomly allocated to main-plots, while genotypes were randomly allocated to sub-plots. Let’s stop for a moment… does this sound strange to you? Do you need further information about split-plot designs? You can get some general information at this link and hints on how to analyse the results at this other link


Meta-analysis for a single study. Is it possible?

Published at July 21, 2022 ·  12 min read

We all know that the word meta-analysis encompasses a body of statistical techniques to combine quantitative evidence from several independent studies. However, I have recently discovered that meta-analytic methods can also be used to analyse the results of a single research project. That happened a few months ago, when I was reading a paper from Damesa et al. (2017), where the authors describe some interesting methods of data analyses for multi-environment genotype experiments. These authors gave a few nice examples with related SAS code, that is rooted in mixed models. As an R enthusiast, I was willing to reproduce their analyses with R, but I could not succeed, until I realised that I could make use of the package ‘metafor’ and its bunch of meta-analityc methods.
