We often use the coefficient of determination as a swift ‘measure’ of goodness of fit for our regression models. Unfortunately, there is no unique symbol for such a coefficient and both \(R^2\) and \(r^2\) are used in literature, almost interchangeably. Such an interchangeability is also endorsed by the Wikipedia (see at: https://en.wikipedia.org/wiki/Coefficient_of_determination ), where both symbols are reported as the abbreviations for this statistical index.
As an editor of several International Journals, I should not agree with such an approach; indeed, the two symbols \(R^2\) and \(r^2\) mean two different things, and they are not necessarily interchangeable, because, depending on the setting, either of the two may be wrong or ambiguous. Let’s pay a little attention to such an issue.
What are the ‘r’ and ‘R’ indices?
When studying the relationship between quantitative variables, we have two main statistical indices:
- the Pearson’s (simple linear) correlation coefficient, that is almost always indicated as the \(r\) coefficient. Correlation is different from regression, as it does not assume any sort of dependency between two quantitative variables and it is only meant to express their joint variability;
- the coefficient of multiple correlation, that is usually indicated with \(R\) and represents (definition from Wikipedia) a measure of how well a given variable can be predicted using a linear function of a set of other variables. Although \(R\) is based on correlation (it is the correlation between the observed values for the dependent variable and the predictions made by the model), it is used in the context of multiple regression, where we are studying a dependency relationship.
And, what about the coefficient of determination? It is yet another concept and another index, measuring (again from Wikipedia) the proportion of the variation in the dependent variable that is predictable from the independent variable(s). As you see, we are still in the context of regression and our aim is to describe the goodness of fit.
To start with, let’s abbreviate the coefficient of determination as \(CD\), in order to avoid any confusion with \(r\) and \(R\); this index can be be obtained as:
\[ CD_1 = \frac{SS_{reg}}{SS_{tot}}\]
or as:
\[ CD_2 = 1 - \frac{SS_{res}}{SS_{tot}}\]
where: \(SS_{reg}\) is the regression sum of squares, \(SS_{tot}\) is the total sum of squares and \(SS_{res}\) is the residual sum of squares, after a linear regression fit. The second formula is preferable: sum of squares are always positive and, thus, we clearly see that \(CD_2\) may not be higher than 1 (this is less obvious, for \(CD_1\).
So far, so good; we have three different indices and three different symbols: \(r\), \(R\) and \(CD\). But, in practice, things did not go that smoothly! The early statisticians, instead of proposing a brand new symbol for the coefficient of determination, made the choice of highlighting the connections with \(r\) and \(R\). For example, Sokal and Rohlf, in their very famous biometry book, at page 570 (2nd. Edition) demonstrated that the coefficient of determination could be obtained as the square of the coefficient of correlation and, thus, they used the symbol \(r^2\). Later, in the same book (pag. 660), these same authors demonstrated that the coefficient of multiple correlation \(R\) was equal to the positive square root of the coefficient of multiple determination, for which they used the symbol \(R^2\).
Obviously, Sokal and Rohlf (and other authors of other textbooks) meant to say that the coefficient of determination, depending on the context, can be obtained either as the square of the correlation coefficient, or as the square of the multiple correlation coefficient. An uncareful interpretation led to the idea that the coefficient of determination can be indicated either as the \(R^2\) or as the \(r^2\) and that the two symbols are always interchangeable. But, is this really true? No, it depends on the context.
Simple linear regression
Let’s have a look at the following example: we fit a simple linear regression model to a dataset and retrieve the coefficient of determination by using the summary()
method.
X <- 1:20
Y <- c(17.79, 18.81, 19.02, 14.14, 16.72, 16.05, 13.99, 13.26,
12.48, 11.33, 11.07, 9.68, 9.19, 9.44, 9.75, 7.71, 6.47,
5.22, 4.55, 7.7)
mod <- lm(Y ~ X)
summary(mod)$r.squared # Coeff. determination with R
## [1] 0.9270622
It is very easy to see that \(R = |r|\) and it is also easy to demonstrate that \(r^2 = CD_1\) (look, e.g., at Sokal and Rohlf for a mathematical proof). Furthermore, due to the equality \(SS_{tot} = SS_{reg} + SS_{res}\), it is also easy to see that \(CD_1 = CD_2\). We are ready to draw our first conclusion.
SSreg <- sum((predict(mod) - mean(Y))^2)
SStot <- sum((Y - mean(Y))^2)
SSres <- sum(residuals(mod)^2)
SSreg/SStot
## [1] 0.9270622
r
1 - SSres/SStot
## [1] 0.9270622
r
r.coef <- cor(X, Y)
R.coef <- cor(Y, fitted(mod))
r.coef^2
## [1] 0.9270622
r
R.coef^2
## [1] 0.9270622
CONCLUSION 1. For simple linear regression, the coefficient of determination is always equal to \(R^2 = r^2\) and both symbols are acceptable (and correct).
Polynomial regression and multiple regression
Apart from simple linear regression, for all other types of linear models, provided that an intercept is fitted, it is not, in general, true that \(R = |r|\), while it is, in general, true that that the coefficient of determination is equal to the squared coefficient of multiple correlation \(R^2\). I’ll show a swift example with a polynomial regression in the box below.
mod2 <- lm(Y ~ X + I(X^2))
cor.coef <- cor(X, Y)
R.coef <- cor(Y, fitted(mod2))
# R and r are not equal
cor.coef; R.coef
## [1] -0.9628407
## [1] 0.9652451
r
# The coefficient of determination is R2
R.coef^2; summary(mod2)$r.squared
## [1] 0.931698
## [1] 0.931698
Furthermore, when we have several predictors (e.g., multiple regression), the correlation coefficient is not unique and we could calculate as many \(r\) values as there are predictors in the model.
In the box below I show another example where I analysed the ‘mtcars’ dataset by using multiple regression; we see that \(R^2 = CD_1 = CD_2\).
data(mtcars)
mod <- lm(mpg ~ wt+disp+hp - 1, data = mtcars)
summary(mod)$r.squared # Coeff. determination with R
## [1] 0.8328665
r
SSreg <- sum((predict(mod) - mean(mtcars$mpg))^2)
SStot <- sum((mtcars$mpg - mean(mtcars$mpg))^2)
SSres <- sum(residuals(mod)^2)
SSreg/SStot
## [1] 0.9852479
r
1 - SSres/SStot
## [1] -1.084229
r
R.coef <- cor(mtcars$mpg, fitted(mod))
R.coef^2
## [1] 0.002746157
We are now ready to draw our second conclusion.
CONCLUSION 2: with all linear models, apart from simple linear regression, the \(r^2\) symbol should not be used for the coefficient of determination, because this latter index IS NOT, in general, equal to the square of the coefficient of correlation. The \(R^2\) symbol is a much better choice.
Linear models with no intercept
The situation becomes much more complex for linear models with no intercept. For these models, the squared multiple correlation coefficient IS NOT ALWAYS equal to the proportion of variance accounted for. Let’s look at the following example:
mod2 <- lm(Y ~ - 1 + X)
summary(mod2)$r.squared # Proportion of variance accounted for
## [1] 0.4390065
r
R.coef <- cor(Y, fitted(mod2))
R.coef^2
## [1] 0.9270622
In other words, the coefficient of determination IS NOT ALWAYS the \(R^2\); however, the coefficient of determination can be calculated by using \(CD_1 = CD_2\), provided that \(SS_{tot}\), \(SS_{reg}\) and \(SS_{res}\) are obtained in a way that accounts for the missing intercept. Schabenberger and Pierce recommend the following equations and the symbols they use clearly reflect that those equations do not return the squared multiple correlation coefficient:
\[R^2_{noint} = \frac{\sum_{i = 1}^{n}{\hat{y_i}^2}}{\sum_{i = 1}^{n}{y_i^2}} \quad \textrm{or} \quad R^{2*}_{noint} =1 - \frac{SS_{res}}{\sum_{i = 1}^{n}{y_i^2}}\]
SSreg <- sum(fitted(mod2)^2)
SStot <- sum(Y^2)
SSres <- sum(residuals(mod2)^2)
SSreg/SStot # R^2 ok
## [1] 0.4390065
r
1 - SSres/SStot # R^2 ok
## [1] 0.4390065
We are ready for our third conclusion.
CONCLUSION 3: in the case of models with no intercept, neither the \(r^2\) nor the \(R^2\) symbols should be used for the coefficient of determination. The proportion of variability accounted for by the model can be calculated by using a modified formula and should be reported by using a different symbol (e.g. \(R^2_{noint}\) or \(R^2_0\) or similar).
Nonlinear regression
With this class of models, we have two main problems:
- they do not have an intercept term, at least, not in the usual sense. Consequently, the square of the multiple coefficient of correlation does not represent the proportion of variance accounted for by the model;
- the equality \(SS_{tot} = SS_{reg} + SS_{res}\) may not hold and thus the equations for \(CD_1\) and \(CD_2\) may produce different results.
In contrast to linear models with no intercept, for nonlinear models we do not have any general modified formula that consistently returns the proportion of variance accounted for by the model (i.e., the coefficient of determination). However, Schabenberger and Pierce (2002) suggested that we can still use \(CD_2\) as a swift measure of goodness of fit, but they also proposed that we use the term ‘Pseudo-\(R^2\)’ instead oft \(R^2\). Why ‘Pseudo’?. For two good reasons:
- the ’Pseudo-\(R^2\)’cannot exceed 1, but it may lower than 0;
- the ‘Pseudo-\(R^2\)’ cannot be interpreted as the proportion of variance explained by the model.
In R, the ‘Pseudo-R2’ can be calculated by using the R2.nls()
function in the ‘aomisc’ package, for nonlinear models fitted with both the nls()
and drm()
functions (this latter function is in the ‘drc’ package).
library(aomisc)
X <- c(0.1, 5, 7, 22, 28, 39, 46, 200)
Y <- c(1, 13.66, 14.11, 14.43, 14.78, 14.86, 14.78, 14.91)
# nls fit
library(aomisc)
model <- nls(Y ~ SSmicmen(X, Vm, K))
R2nls(model)$PseudoR2
## [1] 0.9930399
r
# It is not the R2, in strict sense
R.coef <- cor(Y, fitted(model))
R.coef^2
## [1] 0.9957255
r
# It cannot be calculated by the usual expression!
SSreg <- sum(fitted(model) - mean(Y))
SStot <- sum( (Y - mean(Y))^2 )
SSreg/SStot
## [1] 0.003622005
r
# It can be calculated by using the alternative form
# that is no longer equivalent
SSres <- sum(residuals(model)^2)
1 - SSres/SStot
## [1] 0.9930399
We may now come to our final conclusion.
CONCLUSION 4. With nonlinear models, we should never use either \(r^2\) or \(R^2\) because they are both wrong. If we need a swift measure of goodness of fit, we can use the \(CD_2\) index above, but we should not name it as the \(R^2\), because, in general, it does not correspond to the coefficient of determination. We should better use the term Pseudo-R2.
Hope this was useful; for those who are interested in the use of the Pseud-\(R^2\) in nonlinear regression, I hav already published one post at this link: https://www.statforbiology.com/2021/stat_nls_r2/ .
Thanks for reading and happy coding!
Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
andrea.onofri@unipg.it
References
- Schabenberger, O., Pierce, F.J., 2002. Contemporary statistical models for the plant and soil sciences. Taylor & Francis, CRC Press, Books.
- Sokal, R.R., Rohlf F.J., 1981. Biometry. Second Edition, W.H. Freeman and Company, USA.