Introduction

Seed germination data are characterised by a peculiar source of uncertainty, that is know as ‘censoring’. Due to the monitoring schedule, we are not able to record the exact germination time of each seed in the lot; we are only able to record that germination took place within a certain interval of time. If you would like to know more about censoring, you can find further information in this page here. In that page, we have also shown that neglecting censoring during data analyses may lead to unreliable results. The question is: “how do we account for censoring?”.

There is indeed a class of models that are specifically devised to account for all types of censoring. These models have been largely used in medicine to describe the time to death or other events of interest. This body of methods usually goes under the name of survival analyses, but we’d preferably talk about time-to-events methods. We have extensively talked about these models and how they account for censoring in two of our recent papers and related appendices (Onofri et al., 2018, 2019). Therefore, we will not go into detail, now. We think that the readers want to be mainly convinced that fitting these models is as easy as fitting traditional models, with a clear gain in terms of reliability and efficiency.

A Monte Carlo simulation

In our previous page about censoring we ideally took a seed population, with log-normal distribution of germination time, location parameter \(e = 4.5\) days, scale parameter \(b = 0.6\) and a maximum germinated proportion \(d = 0.85\). For this population, the time to 50% germination (T50) was 5.14 days and the Mean Germination Time (MGT) was 5.39 days. We took a sample of 100 seeds from that population, by using Monte Carlo simulation. For the ease of reading, we replicate the simulation here.

#Real parameter values
e <- 4.5; b <- 0.6; d = 0.85

#Monte Carlo simulation - Step 1
#Simulating the number of germinated seeds
#out of the 100 tesetd
set.seed(1234)
nGerm <- rbinom(1, 100, d)
nGerm
## [1] 89
#Monte Carlo simulation - Step 2
#Simulating the germination time for each seed in the sample
Gtimes <- rlnorm(nGerm, log(e), b)
Gtimes <- c(Gtimes, rep(Inf, 100 - nGerm))
Gtimes
##   [1]  5.424863  5.434125  5.582579  2.903142  4.597512  4.815609 10.603760
##   [8]  8.118291  3.097525  2.901291  3.300505  1.574027  7.630397 10.237768
##  [15]  1.635063  3.088284  4.549728  6.870408  3.052210  7.576004  5.637602
##  [22]  5.420753  4.513539  4.399537  6.948064  3.340211  4.530872  4.526701
##  [29]  6.760117  8.346274  1.594181  1.198977  6.233769  4.558227  4.961047
##  [36]  9.479485  7.197573  4.631820  1.856768  5.844697  4.313697  4.810473
##  [43]  9.661802  3.346942  9.584485  9.722689  2.299752  4.579095  1.412326
##  [50]  6.645991  4.964877  2.740231  3.431359  2.289046  1.559058  2.091263
##  [57]  2.235598 10.176567  3.167262  7.411185 10.289998 11.576995  3.191594
##  [64]  4.522085  3.390061  5.527390  4.005285  4.286991  5.220449  8.824484
##  [71] 16.204791  4.375652  5.469491  4.971624  5.054734  2.880597  8.386469
##  [78] 17.489334 27.570654  4.890293  5.047498  2.859137  5.555705  1.878668
##  [85]  3.429052  5.112954  4.029451  6.863120  4.933304       Inf       Inf
##  [92]       Inf       Inf       Inf       Inf       Inf       Inf       Inf
##  [99]       Inf       Inf
#Monte carlo simulation - Step 3
#Simulating the observed counts
obsT <- seq(1, 15, by=1) #Observation schedule
table( cut(Gtimes, breaks = c(0, obsT)) )
## 
##   (0,1]   (1,2]   (2,3]   (3,4]   (4,5]   (5,6]   (6,7]   (7,8]   (8,9] 
##       0       8       9      11      22      13       6       4       4 
##  (9,10] (10,11] (11,12] (12,13] (13,14] (14,15] 
##       4       4       1       0       0       0
counts <- as.numeric( table( cut(Gtimes, breaks = c(0, obsT)) ) )

Time-to-event methods directly consider the observed counts as the response variable. As the independent variable, they consider both the extremes of each time interval (timeBef and timeAf; see below). The dataset is organised as follows:

timeBef <-  c(0, obsT)
timeAf <- c(obsT, Inf)
Counts <- c(counts, 100 - sum(counts))
df <- data.frame(timeBef, timeAf, Counts)
df
##    timeBef timeAf Counts
## 1        0      1      0
## 2        1      2      8
## 3        2      3      9
## 4        3      4     11
## 5        4      5     22
## 6        5      6     13
## 7        6      7      6
## 8        7      8      4
## 9        8      9      4
## 10       9     10      4
## 11      10     11      4
## 12      11     12      1
## 13      12     13      0
## 14      13     14      0
## 15      14     15      0
## 16      15    Inf     14

The organisation is clear: we see that we had, e.g., 8 seeds germinated between timeBef = 1 and timeAf = 2 and 9 seeds germinated between timeBef = 2 and timeAf = 3. Please, also note that there are 14 seeds with a germination time higher than 15 days (right censoring); they include both the dormant seeds (that were non-germinable) and the germinable seeds with a germination time higher than 15 days. It should be clear that the dataset above includes all forms of uncertainty due to censoring.

If we compare the model definition, nonlinear regression analysis would use the following statement:

CumulativeProportion ~ timeAf

On the other hand, a time-to-event model is defined as:

Count ~ timeBef + timeAf

In order to fit a time-to-event model with log-normal distribution of germination time, we use the fallowing call (please note the argument ‘type’, which is set to ‘event’):

library(drc)
#Time-to-event model
modTE <- drm(Counts ~ timeBef + timeAf, data = df, 
           fct = LN.3(), type = "event")
summary(modTE)
## 
## Model fitted: Log-normal with lower limit at 0 (3 parms)
## 
## Parameter estimates:
## 
##               Estimate Std. Error t-value   p-value    
## b:(Intercept) 1.883589   0.176355  10.681 < 2.2e-16 ***
## d:(Intercept) 0.871272   0.035926  24.252 < 2.2e-16 ***
## e:(Intercept) 4.596088   0.280083  16.410 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Nonlinear regression model
propCum <- I(cumsum(df$Counts)/100)
modNL <- drm(propCum ~ timeAf, data = df, 
           fct = LN.3(), subset=is.finite(df$timeAf)==T)
summary(modNL)
## 
## Model fitted: Log-normal with lower limit at 0 (3 parms)
## 
## Parameter estimates:
## 
##               Estimate Std. Error t-value   p-value    
## b:(Intercept) 1.956496   0.132081  14.813 4.486e-09 ***
## d:(Intercept) 0.877667   0.015404  56.977 5.039e-16 ***
## e:(Intercept) 4.703950   0.103217  45.573 8.060e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error:
## 
##  0.02302153 (12 degrees of freedom)

Please consider that the function ‘LN.3()’ is parameterised by using a scale parameter \(b\), that is the inverse of the value which we used for the simulatyion. Indeed, model fit returns an estimate of 1.9, which is approximately equal to \(1/0.6\), at the net of sampling error.

With respect to a nonlinear regression fit, the estimate of ‘e’ is very close, but the standard error is higher. Let’s compare the two methods with 1000 samples (be patient… it may take a while)

GermSampling <- function(nSeeds, timeLast, stepOss, mu, shape, pMax){
    
    #Draw a sample as above
    germ <- rbinom(1, nSeeds, d)
    Gtimes <- rlnorm(nGerm, log(e), b)
    Gtimes <- c(Gtimes, rep(Inf, 100 - nGerm))
    
    #Generate the observed data
    obsT <- seq(1, timeLast, by=stepOss) #Observation schedule (by 5 o 2)
    counts <- as.numeric( table( cut(Gtimes, breaks = c(0, obsT)) ) )
    propCum <- cumsum(counts)/nSeeds
    timeBef <- c(0, obsT)
    timeAf <- c(obsT, Inf)
    counts <- c(counts, nSeeds - sum(counts))
    
    #Calculate the T50 with two methods
    mod <- drm(propCum ~ obsT, fct = LN.3() )
    modTE <- drm(counts ~ timeBef + timeAf, 
           fct = LN.3(), type = "event")
    c(T501 = summary(mod)[[3]][3,1],
         ES1 = summary(mod)[[3]][3,2],
         T502 = summary(modTE)[[3]][3,1],
         ES2 = summary(modTE)[[3]][3,2] )
}

set.seed(1234)
result <- data.frame()
for (i in 1:1000) {
  res <- GermSampling(100, 15, 1, 4.5, 0.6, 0.85)
  result <- rbind(result, res)
} 
names(result) <- c("T501", "ES1", "T502", "ES2")
apply(result, 2, mean)
##       T501        ES1       T502        ES2 
## 4.52201859 0.09017022 4.53786873 0.32857552
apply(result, 2, sd)
##       T501        ES1       T502        ES2 
## 0.34444186 0.03964712 0.34650017 0.06879141

We see that both methods lead to unbiased estimates. However, the sampling distribution of T50 shows a standard deviation of approximately 0.34 days. This is estimated correctely by the time-to-event method, which, on average, leads to an estimate of 0.33 days. On the contrary, nonlinear regression leads to an average estimate of 0.09, that is almost one fourth of the real value.

#Take-home message

Censoring is peculiar to germination assays and other time-to-event studies. It may have a strong impact on the reliability of our estimates and standard errors and, therefore, it should never be neglected. The best way not to neglect it is to use time-to-event methods to estimate the germination curve, while avoiding traditional non-linear regression.

References

Onofri, A., Benincasa, P., Mesgaran, M.B., Ritz, C., 2018. Hydrothermal-time-to-event models for seed germination. European Journal of Agronomy 101, 129–139.

Onofri, A., Piepho, H.P., Kozak, M., 2019. Analysing censored data in agricultural research: A review with examples and software tips. Annals of Applied Biology.