Fixing the bridge between biologists and statisticians

Models are wrong... but, some are useful (G. Box)!


Dealing with correlation in designed field experiments: part II

Published at May 10, 2019 ·  12 min read

With field experiments, studying the correlation between the observed traits may not be an easy task. Indeed, in these experiments, subjects are not independent, but they are grouped by treatment factors (e.g., genotypes or weed control methods) or by blocking factors (e.g., blocks, plots, main-plots). I have dealt with this problem in a previous post and I gave a solution based on traditional methods of data analyses.

In a recent paper, Piepho (2018) proposed a more advanced solution based on mixed models. He presented four examplary datasets and gave SAS code to analyse them, based on PROC MIXED. I was very interested in those examples, but I do not use SAS. Therefore, I tried to ‘transport’ the models in R, which turned out to be a difficult task. After struggling for awhile with several mixed model packages, I came to an acceptable solution, which I would like to share.

...

Dealing with correlation in designed field experiments: part I

Published at April 30, 2019 ·  7 min read

Observations are grouped

When we have recorded two traits in different subjects, we can be interested in describing their joint variability, by using the Pearson’s correlation coefficient. That’s ok, altough we have to respect some basic assumptions (e.g. linearity) that have been detailed elsewhere (see here). Problems may arise when we need to test the hypothesis that the correlation coefficient is equal to 0. In this case, we need to make sure that all the couples of observations are taken on independent subjects.

...

How do we combine errors? The linear case

Published at April 15, 2019 ·  7 min read

In our research work, we usually fit models to experimental data. Our aim is to estimate some biologically relevant parameters, together with their standard errors. Very often, these parameters are interesting in themselves, as they represent means, differences, rates or other important descriptors. In other cases, we use those estimates to derive further indices, by way of some appropriate calculations. For example, think that we have two parameter estimates, say Q and W, with standard errors respectively equal to \(\sigma_Q\) and \(\sigma_W\): it might be relevant to calculate the amount:

...

Some everyday data tasks: a few hints with R

Published at March 27, 2019 ·  9 min read

We all work with data frames and it is important that we know how we can reshape them, as necessary to meet our needs. I think that there are, at least, four routine tasks that we need to be able to accomplish:

  1. subsetting
  2. sorting
  3. casting
  4. melting

Obviously, there is a wide array of possibilities; I’ll just mention a few, which I regularly use.

Subsetting the data

Subsetting means selecting the records (rows) or the variables (columns) which satisfy certain criteria. Let’s take the ‘students.csv’ dataset, which is available on one of my repositories. It is a database of student’s marks in a series of exams for different subjects.

...

Drowning in a glass of water: variance-covariance and correlation matrices

Published at February 19, 2019 ·  3 min read

One of the easiest tasks in R is to get correlations between each pair of variables in a dataset. As an example, let’s take the first four columns in the ‘mtcars’ dataset, that is available within R. Getting the variances-covariances and the correlations is straightforward.

data(mtcars)
matr <- mtcars[,1:4]

#Covariances
cov(matr)
##              mpg        cyl       disp        hp
## mpg    36.324103  -9.172379  -633.0972 -320.7321
## cyl    -9.172379   3.189516   199.6603  101.9315
## disp -633.097208 199.660282 15360.7998 6721.1587
## hp   -320.732056 101.931452  6721.1587 4700.8669
#Correlations
cor(matr)
##             mpg        cyl       disp         hp
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475
## disp -0.8475514  0.9020329  1.0000000  0.7909486
## hp   -0.7761684  0.8324475  0.7909486  1.0000000

It’s really a piece of cake! Unfortunately, a few days ago I had a covariance matrix without the original dataset and I wanted the corresponding correlation matrix. Although this is an easy task as well, at first I was stuck, because I could not find an immediate solution… So I started wondering how I could make it.

...

Going back to the basics: the correlation coefficient

Published at February 7, 2019 ·  7 min read

A measure of joint variability

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. It is often measured by the Pearson correlation coefficient:

\[\rho _{X,Y} =\textrm{corr} (X,Y) = \frac {\textrm{cov}(X,Y) }{ \sigma_X \sigma_Y } = \frac{ \sum_{1 = 1}^n [(X - \mu_X)(Y - \mu_Y)] }{ \sigma_X \sigma_Y }\]

Other measures of correlation can be thought of, such as the Spearman \(\rho\) rank correlation coefficient or Kendall \(\tau\) rank correlation coefficient.

...

My first experience with blogdown

Published at November 15, 2018 ·  1 min read

This is my first day at work with blogdown. I must admit it is pretty overwhelming at the beginning …

I thought that it might be useful to write down a few notes, to summarise my steps ahead, during the learning process. I do not work with blogdown everyday and I tend to forget things quite easily. Therefore, these notes may help me recap how far I have come. And they might also help other beginners, to speed up their initial steps with such a powerful blogging platform.

...

Sample variance and population variance: which of the two?

Published at November 9, 2018 ·  7 min read

Teaching experimental methodology in agriculture related master courses poses some peculiar problems. One of these is to explain the difference between sample variance and population variance. For the students it is usually easy to grasp the idea that, being the mean the ‘center’ of the dataset, it is relevant to measure the average distance to the mean for all individuals in the dataset. Of course, we need to take the sum of squared distances, otherwise negative and positive residuals cancel each other out.

...

Is R dangerous? Side effects of free software for biologists

Published at June 8, 2014 ·  3 min read

When I started my career in the biological field (it’s already 25 years ago), only the luckiest of us had access to very advanced statistical software. Licenses were very expensive and it was not easy to convince the boss that they were really necessary: “Why do you need to spend so much money to perform an ANOVA?”. Indeed, simple one-way or two-ways ANOVAs were quite easy to perform and one of the people in my group had already built the appropriate routines for several designs, by using the GW-BASIC language. But I wanted more!

...