Archive 2019

How do we combine errors? The linear case

Published at November 22, 2024 ·  7 min read

In our research work, we usually fit models to experimental data. Our aim is to estimate some biologically relevant parameters, together with their standard errors. Very often, these parameters are interesting in themselves, as they represent means, differences, rates or other important descriptors. In other cases, we use those estimates to derive further indices, by way of some appropriate calculations. For example, think that we have two parameter estimates, say Q and W, with standard errors respectively equal to \(\sigma_Q\) and \(\sigma_W\): it might be relevant to calculate the amount:

...


Fitting 'complex' mixed models with 'nlme': Example #2

Published at September 13, 2019 ·  9 min read

A repeated split-plot experiment with heteroscedastic errors

Let’s imagine a field experiment, where different genotypes of khorasan wheat are to be compared under different nitrogen (N) fertilisation systems. Genotypes require bigger plots, with respect to fertilisation treatments and, therefore, the most convenient choice would be to lay-out the experiment as a split-plot, in a randomised complete block design. Genotypes would be randomly allocated to main plots, while fertilisation systems would be randomly allocated to sub-plots. As usual in agricultural research, the experiment should be repeated in different years, in order to explore the environmental variability of results.

...


Testing for interactions in nonlinear regression

Published at September 13, 2019 ·  11 min read

Fitting ‘complex’ mixed models with ‘nlme’: Example #4

Factorial experiments are very common in agriculture and they are usually laid down to test for the significance of interactions between experimental factors. For example, genotype assessments may be performed at two different nitrogen fertilisation levels (e.g. high and low) to understand whether the ranking of genotypes depends on nutrient availability. For those of you who are not very much into agriculture, I will only say that such an assessment is relevant, because we need to know whether we can recommend the same genotypes, e.g., both in conventional agriculture (high nitrogen availability) and in organic agriculture (relatively lower nitrogen availability).

...


Fitting 'complex' mixed models with 'nlme'. Example #1

Published at August 20, 2019 ·  9 min read

The environmental variance model

Fitting mixed models has become very common in biology and recent developments involve the manipulation of the variance-covariance matrix for random effects and residuals. To the best of my knowledge, within the frame of frequentist methods, the only freeware solution in R should be based on the ‘nlme’ package, as the ‘lmer’ package does not easily permit such manipulations. The ‘nlme’ package is fully described in Pinheiro and Bates (2000). Of course, the ‘asreml’ package can be used, but, unfortunately, this is not freeware.

...


Germination data and time-to-event methods: comparing germination curves

Published at July 20, 2019 ·  11 min read

Very often, seed scientists need to compare the germination behaviour of different seed populations, e.g., different plant species, or one single plant species submitted to different temperatures, light conditions, priming treatments and so on. How should such a comparison be performed?

Let’s take a practical approach and start from an appropriate example: a few years ago, some collegues studied the germination behaviour for seeds of a plant species (Verbascum arcturus, BTW…), in different conditions. In detail, they considered the factorial combination of two storage periods (LONG and SHORT storage) and two temperature regimes (FIX: constant daily temperature of 20°C; ALT: alternating daily temperature regime, with 25°C during daytime and 15°C during night time, with a 12:12h photoperiod). If you are a seed scientist and are interested in this experiment, you’ll find detail in Catara et al. (2016).

...


Survival analysis and germination data: an overlooked connection

Published at July 2, 2019 ·  16 min read

The background

Seed germination data describe the time until an event of interest occurs. In this sense, they are very similar to survival data, apart from the fact that we deal with a different (and less sad) event: germination instead of death. But, seed germination data are also similar to failure-time data, phenological data, time-to-remission data… the first point is: germination data are time-to-event data.

You may wonder: what’s the matter with time-to-event data? Do they have anything special? With few exceptions, all time-to-event data are affected by a certain form of uncertainty, which takes the name of ‘censoring’. It relates to the fact that the exact time of event may not be precisely know. I think it is good to give an example.

...


Stabilising transformations: how do I present my results?

Published at June 15, 2019 ·  5 min read

ANOVA is routinely used in applied biology for data analyses, although, in some instances, the basic assumptions of normality and homoscedasticity of residuals do not hold. In those instances, most biologists would be inclined to adopt some sort of stabilising transformations (logarithm, square root, arcsin square root…), prior to ANOVA. Yes, there might be more advanced and elegant solutions, but stabilising transformations are suggested in most traditional biometry books, they are very straightforward to apply and they do not require any specific statistical software. I do not think that this traditional technique should be underrated.

...


Genotype experiments: fitting a stability variance model with R

Published at June 6, 2019 ·  8 min read

Yield stability is a fundamental aspect for the selection of crop genotypes. The definition of stability is rather complex (see, for example, Annichiarico, 2002); in simple terms, the yield is stable when it does not change much from one environment to the other. It is an important trait, that helps farmers to maintain a good income in most years.

Agronomists and plant breeders are continuosly concerned with the assessment of genotype stability; this is accomplished by planning genotype experiments, where a number of genotypes is compared on randomised complete block designs, with three to five replicates. These experiments are repeated in several years and/or several locations, in order to measure how the environment influences yield level and the ranking of genotypes.

...


How do we combine errors, in biology? The delta method

Published at May 25, 2019 ·  7 min read

In a recent post I have shown that we can build linear combinations of model parameters (see here ). For example, if we have two parameter estimates, say Q and W, with standard errors respectively equal to \(\sigma_Q\) and \(\sigma_W\), we can build a linear combination as follows:

\[ Z = AQ + BW + C\]

where A, B and C are three coefficients. The standard error for this combination can be obtained as:

...


Dealing with correlation in designed field experiments: part II

Published at May 10, 2019 ·  12 min read

With field experiments, studying the correlation between the observed traits may not be an easy task. Indeed, in these experiments, subjects are not independent, but they are grouped by treatment factors (e.g., genotypes or weed control methods) or by blocking factors (e.g., blocks, plots, main-plots). I have dealt with this problem in a previous post and I gave a solution based on traditional methods of data analyses.

In a recent paper, Piepho (2018) proposed a more advanced solution based on mixed models. He presented four examplary datasets and gave SAS code to analyse them, based on PROC MIXED. I was very interested in those examples, but I do not use SAS. Therefore, I tried to ‘transport’ the models in R, which turned out to be a difficult task. After struggling for awhile with several mixed model packages, I came to an acceptable solution, which I would like to share.

...


Dealing with correlation in designed field experiments: part I

Published at April 30, 2019 ·  7 min read

Observations are grouped

When we have recorded two traits in different subjects, we can be interested in describing their joint variability, by using the Pearson’s correlation coefficient. That’s ok, altough we have to respect some basic assumptions (e.g. linearity) that have been detailed elsewhere (see here). Problems may arise when we need to test the hypothesis that the correlation coefficient is equal to 0. In this case, we need to make sure that all the couples of observations are taken on independent subjects.

...


Some everyday data tasks: a few hints with R

Published at March 27, 2019 ·  9 min read

We all work with data frames and it is important that we know how we can reshape them, as necessary to meet our needs. I think that there are, at least, four routine tasks that we need to be able to accomplish:

  1. subsetting
  2. sorting
  3. casting
  4. melting

Obviously, there is a wide array of possibilities; I’ll just mention a few, which I regularly use.

Subsetting the data

Subsetting means selecting the records (rows) or the variables (columns) which satisfy certain criteria. Let’s take the ‘students.csv’ dataset, which is available on one of my repositories. It is a database of student’s marks in a series of exams for different subjects.

...


Drowning in a glass of water: variance-covariance and correlation matrices

Published at February 19, 2019 ·  3 min read

One of the easiest tasks in R is to get correlations between each pair of variables in a dataset. As an example, let’s take the first four columns in the ‘mtcars’ dataset, that is available within R. Getting the variances-covariances and the correlations is straightforward.

data(mtcars)
matr <- mtcars[,1:4]

#Covariances
cov(matr)
##              mpg        cyl       disp        hp
## mpg    36.324103  -9.172379  -633.0972 -320.7321
## cyl    -9.172379   3.189516   199.6603  101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp   -320.732056 101.931452  6721.1587 4700.8669
#Correlations
cor(matr)
##             mpg        cyl       disp         hp
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475
## disp -0.8475514  0.9020329  1.0000000  0.7909486
## hp   -0.7761684  0.8324475  0.7909486  1.0000000

It’s really a piece of cake! Unfortunately, a few days ago I had a covariance matrix without the original dataset and I wanted the corresponding correlation matrix. Although this is an easy task as well, at first I was stuck, because I could not find an immediate solution… So I started wondering how I could make it.

...


Going back to the basics: the correlation coefficient

Published at February 7, 2019 ·  7 min read

A measure of joint variability

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. It is often measured by the Pearson correlation coefficient:

\[\rho _{X,Y} =\textrm{corr} (X,Y) = \frac {\textrm{cov}(X,Y) }{ \sigma_X \sigma_Y } = \frac{ \sum_{1 = 1}^n [(X - \mu_X)(Y - \mu_Y)] }{ \sigma_X \sigma_Y }\]

Other measures of correlation can be thought of, such as the Spearman \(\rho\) rank correlation coefficient or Kendall \(\tau\) rank correlation coefficient.

...


Some useful equations for nonlinear regression in R

Published at January 8, 2019 ·  22 min read

Introduction

Very rarely, biological processes follow linear trends. Just think about how a crop grows, or responds to increasing doses of fertilisers/xenobiotics. Or think about how an herbicide degrades in the soil, or about the germination pattern of a seed population. It is very easy to realise that curvilinear trends are far more common than linear trends. Furthermore, asymptotes and/or inflection points are very common in nature. We can be sure: linear equations in biology are just a way to approximate a response over a very narrow range for the independent variable.

...