Is R dangerous? Side effects of free software for biologists

Andrea Onofri ·  Added on June 8, 2014 ·  3 min read

When I started my career in the biological field (it’s already 25 years ago), only the luckiest of us had access to very advanced statistical software. Licenses were very expensive and it was not easy to convince the boss that they were really necessary: “Why do you need to spend so much money to perform an ANOVA?”. Indeed, simple one-way or two-ways ANOVAs were quite easy to perform and one of the people in my group had already built the appropriate routines for several designs, by using the GW-BASIC language. But I wanted more!

My agriculture experiments often showed complex designs and split-plot, strip-plot, subsampling and repeated measures/experiments were much more than an exception. I decided to start writing several Quick-BASIC routines to implement those types of ANOVAs on my PC. At the beginning of the ’90s, nonlinear response models became in fashion and I had to programme my first Gauss-Newton optimiser, also with Quick-BASIC. GLMs were not yet widespread among biologists and I mainly relied on stabilising transformations for those cases where the basic assumptions for linear/nonlinear models were not met.

This ‘poor and humble’ statistical life gave me an undeniable advantage: it forced me into thoroughly studying and understanding the principles of each new technique and algorithm, before I could be able to programme it. I had to become acquainted with all those strange mathematical objects, such as matrices, eigenvalues and determinants, which are not usually a part of the mathematical background of biologists (at least in Italy). And my BASIC routines, once completed, could only do what they were programmed to do; only one specific solution to one specific task, no further options and no error management. In one sentence: no general solutions.

Nowadays I have R: it is free and everything is possible and smooth. A few lines of code and I can fit whatever model comes to my mind in a few minutes. I can try several options: which is the best one? Which is the one that makes my data tell the story I would like to tell? Biometry books have changed as well; they have taken a more ‘algorithmic’ approach and math is confined within boxes that may be easily skipped. I have to admit that I frequently skip them: code snippets are more than enough to do the trick and I can also find thousands of them in the Internet. In other words, why should I bother studying such an abstract thing called statistics when I have R?

Obviously this is just an exaggeration. However, I have the feeling that there might be some drawbacks relating to the availability of such a powerful free software. Biologists (especially students) may mistake studying R for studying stats. I am very much surprised to see how many complex models are fit on these days, with hardly any biological and statistical justifications and with very little care about the basic assumptions that these models make. A few days ago a PhD student at my Department showed me the results of fitting a reduced rank regression to a biological dataset. He was very proud of how he mastered the R coding process: by using the correct option (found after a thorough search over the Internet). He had even managed to avoid a ‘pretty strange’ warning message. Unfortunately, that warning message had been misinterpreted and therefore the analysis was wrong from the very beginning.

A warning message for all biologists (including myself): R is really wonderful, but it will not necessarily bring to sound data analyses. Let’s use R, but let’s study stats first!