In our research work, we usually fit models to experimental data. Our aim is to estimate some biologically relevant parameters, together with their standard errors. Very often, these parameters are interesting in themselves, as they represent means, differences, rates or other important descriptors. In other cases, we use those estimates to derive further indices, by way of some appropriate calculations. For example, think that we have two parameter estimates, say Q and W, with standard errors respectively equal to \(\sigma_Q\) and \(\sigma_W\): it might be relevant to calculate the amount:
\[Z = aQ + bW + c\]
where \(a\), \(b\) and \(c\) are three coefficients. The above operation is named ‘linear combination’; it is a sort of a weighted sum of model parameters. The question is: what is the standard error of Z? I would like to show this by way of a simple biological example.
Example
We have measured the germination rate for seeds of Brassica rapa at six levels of water potential in the substrate (-1, -0.9, -0.8, -0.7, -0.6 and -0.5 MPa). We would like to predict the germination rate for a water potential level of -0.75 MPa.
Literature references suggest that the relationship between germination rate and water potential in the substrate is linear. Therefore, as the first step, we fit a linear regression model to our observed data. If we are into R, the code to accomplish this task is shown below.
GR <- c(0.11, 0.20, 0.34, 0.46, 0.56, 0.68)
Psi <- c(-1, -0.9, -0.8, -0.7, -0.6, -0.5)
lMod <- lm(GR ~ Psi)
summary(lMod)
##
## Call:
## lm(formula = GR ~ Psi)
##
## Residuals:
## 1 2 3 4 5 6
## 0.0076190 -0.0180952 0.0061905 0.0104762 -0.0052381 -0.0009524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.25952 0.02179 57.79 5.37e-07 ***
## Psi 1.15714 0.02833 40.84 2.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01185 on 4 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.997
## F-statistic: 1668 on 1 and 4 DF, p-value: 2.148e-06
It is clear that we can use the fitted model to calculate the GR-value at -0.75 MPa, as \(GR = 1.26 - 1.16 \times 0.75 = 0.39\). This is indeed a linear combination of model parameters, as we have shown above. Q and W are the estimated model parameters, while \(a = 1\), \(b = -0.75\) and \(c = 0\).
Of course, the derived response is also an estimate and we need to give a measure of uncertainty. For both model parameters we have standard errors; the question is: how does the uncertainty in parameter estimates propagates to their linear combination? In simpler words: it is easy to build a weighted sum of model parameters, but, how do we make a weighted sum of their standard errors?
Sokal and Rohlf (1981) at pag. 818 of their book, show that, in general, it is:
\[ \textrm{var}(a \, Q + b \, W) = a^2 \sigma^2_Q + b^2 \sigma^2_W + 2ab \sigma_{QW} \]
where \(\sigma_{QW}\) is the covariance of Q and W. I attach the proof below; it is pretty simple to understand and it is worth the effort. However, several students in biology are rather reluctant when they have to delve into maths. Therefore, I would also like to give an empirical ‘proof’, by using some simple R code.
Let’s take two samples (Q and W) and combine them by using two coefficients \(a\) and \(b\). Let’s calculate the variance for the combination:
Q <- c(12, 14, 11, 9)
W <- c(2, 4, 7, 8)
a <- 2
b <- 3
c <- 4
var(a * Q + b * W + c)
## [1] 35.58333
r
a^2 * var(Q) + b^2 * var(W) + 2 * a * b * cov(Q, W)
## [1] 35.58333
We see that the results match exactly! In our example, the variance and covariance of estimates are retrieved by using the ‘vcov()’ function:
vcov(lMod)
## (Intercept) Psi
## (Intercept) 0.0004749433 0.0006020408
## Psi 0.0006020408 0.0008027211
r
sigma2Q <- vcov(lMod)[1,1]
sigma2W <- vcov(lMod)[2,2]
sigmaQW <- vcov(lMod)[1,2]
The standard error for the prediction is simply obtained as:
sqrt( sigma2Q + 0.75^2 * sigma2W - 2 * 0.75 * sigmaQW )
## [1] 0.004838667
The functions ‘predict()’ and ‘glht()’
Now that we have understood the concept, we can use R to make the calculations. For this example, the ‘predict()’ method represents the most logical approach. We only need to pass the model object and the X value which we have to make a prediction for. This latter value needs to be organised as a data frame, with column name(s) matching the name(s) of the X-variable(s) in the original dataset:
predict(lMod, newdata = data.frame(Psi = -0.75),
se.fit = T)
## $fit
## 1
## 0.3916667
##
## $se.fit
## [1] 0.004838667
##
## $df
## [1] 4
##
## $residual.scale
## [1] 0.01185227
Apart from the predict method, there is another function of more general interest, which can be used to build linear combinations of model parameters. It is the ‘glht()’ function in the ‘multcomp’ package. To use this function, we need a model object and we need to organise the coefficients for the transformation in a matrix, with as many rows as there are combinations to calculate. When writing the coefficients, we disregard C: we have seen that every constant value does not affect the variance of the transformation.
For example, just imagine that we want to predict the GR for two levels of water potential, i.e. -0.75 (as above) and -0.55 MPa. The coefficients (A, B) for the combinations are:
Z1 <- c(1, -0.75)
Z2 <- c(1, -0.55)
We pile up the two vectors in one matrix with two rows:
M <- matrix(c(Z1, Z2), 2, 2, byrow = T)
And now we pass that matrix to the ‘glht()’ function as an argument:
library(multcomp)
lcombs <- glht(lMod, linfct = M)
summary(lcombs)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = GR ~ Psi)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 0.391667 0.004839 80.94 2.30e-07 ***
## 2 == 0 0.623095 0.007451 83.62 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
The above method can be easily extended to other types of linear models and linear combinations of model parameters. The ‘adjust’ argument is useful whenever we want to obtain familywise confidence intervals, which can account for the multiplicity problem. But this is another story…
Happy work with these functions!
Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Send comments to: andrea.onofri@unipg.it
Appendix
We know that the variance for a random variable is defined as:
\[{\begin{array}{l} var(Z) = \frac{1}{n-1}\sum \left(Z - \hat{Z}\right)^2 = \\ = \frac{1}{n-1}\sum \left(Z - \frac{1}{n} \sum{Z}\right)^2 \end{array}}\]
If \(Z = aQ + bW + c\), where \(a\), \(b\) and \(c\) are the coefficients and Q and W are two random variables, we have:
\[ {\begin{array}{l} var(Z) = \frac{1}{n-1}\sum \left[aQ + bW + c - \frac{1}{n} \sum{ \left(aQ + bW + c\right)}\right]^2 = \\ = \frac{1}{n-1}\sum \left[aQ + bW + c - \frac{1}{n} \sum{ aQ} - \frac{1}{n} \sum{ bW} - \frac{1}{n} \sum{ c}\right]^2 = \end{array}} \]
\[{\begin{array}{l} = \frac{1}{n-1}\sum \left[aQ + bW + c - a \hat{Q} - b \hat{W} - c \right]^2 =\\ = \frac{1}{n-1}\sum \left[\left( aQ - a \hat{Q}\right) + \left( aW - a\hat{W}\right) \right]^2 = \end{array}}\]
\[{\begin{array}{l} =\frac{1}{n-1}\sum \left[a \left( Q - \hat{Q}\right) + b \left( W - \hat{W}\right) \right]^2 =\\ = \frac{1}{n-1}\sum \left[a^2 \left( Q - \hat{Q}\right)^2 + b^2 \left( W - \hat{W}\right)^2 + 2ab \left( Q - \hat{Q}\right) \left( W - \hat{W}\right)\right] = \end{array}} \]
\[ = a^2 \frac{1}{n-1} \sum{\left( Q - \hat{Q}\right)^2} + b^2 \frac{1}{n-1}\sum \left( W - \hat{W}\right)^2 + 2ab \frac{1}{n-1}\sum{\left[\left( Q - \hat{Q}\right) \left( W - \hat{W}\right)\right]}\]
Therefore:
\[ \textrm{var}(Z) = a^2 \sigma^2_Q + b^2 \sigma^2_W + 2ab \sigma_{Q,W}\]