In a previous post we have presented our new ‘lmDiallel’ package (see this link here and see also the original paper in Theoretical and Applied Genetics). This package provides an extensions to fit a class of linear models of interest for plant breeders or geneticists, the so-called diallel models. In this post and other future posts we would like to present some examples of how to use this package: please, sit back and relax and, if you have comments, let us know, using the email link at the bottom of this post.

# But… what is a ‘diallel’?

If you are not a plant breeder or a geneticist in general, you may be asking this question. From the ancient Greek language, the ‘diallel’ word means ‘reciprocating’ and a diallel cross is a set of several possible crosses and selfs between some parental lines. For example, if we take the male lines A, B and C together with the same female lines A, B and C and we imagine to cross those lines with one another, we obtain:

- the selfs A\(\times\)A, B\(\times\)B and C\(\times\)C,
- the crosses A\(\times\)B, A\(\times\)C and B\(\times\)C,
- and, in some instances, the reciprocals B\(\times\)A, C\(\times\)A and C\(\times\)B (where the father and mother are swapped).

The performances of crosses and/or selfs and/or reciprocals can be compared by planning field experiments, usually known as **diallel experiments** and designed as randomised complete blocks with 3-4 replicates.

# The example

Depending on how the experiment is planned, we can have four experimental methods:

- Crosses + reciprocals + selfs (complete diallel)
- Crosses and reciprocals (no selfs)
- Crosses and selfs (no reciprocals)
- Only crosses (no selfs, no reciprocals)

In this post we will concentrate on the first design (complete diallel) and we will use a simple example with three parental lines (A, B and C). The csv file (‘diallel1.csv’) is available in an external repository; in the box below we load the data and we use the `group_by()`

function in the ‘dplyr’ package to obtain the means for all crosses and selfs.

```
library(tidyverse)
rm(list = ls())
df <- read_csv("https://www.casaonofri.it/_datasets/diallel1.csv")
df$Block <- factor(df$Block)
dfM <- df %>%
group_by(Par1, Par2) %>%
summarise(YieldM = mean(Yield), SEs = sd(Yield/sqrt(4)))
dfM
## # A tibble: 9 x 4
## # Groups: Par1 [3]
## Par1 Par2 YieldM SEs
## <chr> <chr> <dbl> <dbl>
## 1 A A 12 0.740
## 2 A B 13 0.600
## 3 A C 14 0.498
## 4 B A 11 1.00
## 5 B B 15 0.332
## 6 B C 21 0.273
## 7 C A 17 0.295
## 8 C B 16 0.166
## 9 C C 19 1.90
```

# What model do we use?

In order to describe the above dataset, we might think of a two-way ANOVA model, where the ‘father’ and ‘mother’ lines (the ‘Par1’ and ‘Par2’ variables, respectively) are used as the explanatory factors.

This is a very tempting solution, but we should resist: a two way ANOVA model regards the ‘father’ and ‘mother’ effects as two completely different series of treatments, neglecting the fact that they are, indeed, the same genotypes in different combinations. That is exactly why we need specific **diallel models** to describe the results of diallel experiments!

# The Hayman’s model type 1

The first diallel model was proposed by Hayman (1954) and it was devised for complete diallel experiments, where reciprocals are available. Neglecting the design effects (blocks and/or environments), the Hayman’s model is defined as:

\[y _{ijk} = \mu + \textrm{g}_i + \textrm{g}_j + \textrm{ts}_{ij} + \textrm{rg}^a_{i} + \textrm{rg}^b_{j} + rs_{ij} + \varepsilon_{ijk} \quad\quad\quad (Eq. 1)\]

where \(\mu\) is expected value (the overall mean, in the balanced case) and \(\varepsilon_{ijk}\) is the residual random error terms for the observation in the \(k^{th}\) block and with the parentals \(i\) and \(j\). All the other terms correspond to genetic effects, namely:

- the \(\textrm{g}_i\) and \(\textrm{g}_j\) terms are the
**general combining abilities**(GCAs) of the \(i^{th}\) and \(j^{th}\) parents. Each term relates to the average performances of a parental line in all its hybrid combination, under the sum-to-zero constraint (i.e. the sum of \(g\) values for all parentals must be zero). For example, with our balanced experiment, the overall mean is \(\mu = 15.33\), while the mean for the A parent when used as the ‘father’ is \(\mu_{A.} = 13\) and the mean for the same parent A when used as the ‘mother’ is \(\mu_{.A} = 13.33\). Consequently: \[g_A = \left(13 + 13.33 \right)/2 - 15.33 = -2.167\] Analogously, it is \(g_B = -0.167\). - The \(rg^a_i\) and \(rg^b_j\) terms are the
**reciprocal general combining abilities**(RGCAs) for the \(i^{th}\) and \(j^{th}\) parents. Each term relates to the discrepancy between the effect of a parent when it is used as father/mother and its average effect in all its combinations. For example, considering the parent A, the term \(rg^a_A\) is: \[rg^a_A = \mu_{A.} - \frac{\mu_{A.} + \mu_{.A}}{2} = 13 - 13.167 = -0.167\] Obviously, it must be \(rg^a_A = - rg^b_B\) and it must also be that the sum of all \(rg^a\) terms is zero (sum-to-zero constraint). - The \(\textrm{ts}_{ij}\) term is the total
**specific combining ability**(tSCA) for the combination between the \(i^{th}\) and \(j^{th}\) parents. It relates to the discrepancy from additivity for a specific combination of two parentals. For example, considering the ‘A \(\times\) B’ cross, the expected yield under additivity would be: \[\mu_{A:B} = \mu + \textrm{g}_A + \textrm{g}_B +\textrm{rg}^a_{A} + \textrm{rg}^b_{B} =\] \[ = 15.33 - 2.167 - 0.167 - 0.167 - 0.5 = 12.333\] while the observed yield is 13, with a with a difference of \(-0.667\). On the other hand, considering the ‘B \(\times\) A’ reciprocal cross, the expected yield under additivity would be: \[\mu_{A:B} = \mu + \textrm{g}_A + \textrm{g}_B +\textrm{rg}^a_{B} + \textrm{rg}^b_{A} =\] \[= 15.33 - 2.167 - 0.167 + 0.167 + 0.5 = 13.667\] while the observed yield is 11, with a difference of \(2.667\). The tSCA for the cross between A and B (regardless of the reciprocal) is the average difference, that is \(\textrm{ts}_{AB} = (-0.667 + 2.667)/2 = 1\). - The \(rs_{ij}\) term is the
**reciprocal specific combining ability**(RSCA) for a specific \(ij\) combination, that is the discrepancy between the performances of the two reciprocals (e.g, A \(\times\) B vs. B \(\times\) A). For example, the \(\textrm{rs}_{AB}\) term is equal to \(-0.667 - 1 = -1.667\), that is the opposite of \(\textrm{rs}_{BA}\).

# Model fitting with R

Hands-calculations based on means may be useful to understand the meaning of genetical effects, although they are biased with unbalanced designs and, above all, they are totally uninteresting from a practical point of view: we’d rather fit the model by using a statistical software.

Let’s assume that all effects are fixed, apart from the residual standard error. This is a reasonable assumption, as we have a very low number of parentals, which would make the estimation of variance components totally unreliable. We clearly see that the Hayman’s model above is a specific parameterisation of a general linear model and we should be able to fit it by the usual `lm()`

function and related methods. We can, indeed, do so by using our ‘lmDiallel’ extension package, that provides the facilities to generate the correct design matrices for the Hayman’s model (and for other diallel models, as we will show in future posts).

At the beginning, we have to install (if necessary) and load the ‘lmDiallel’ package (see box below). Model fitting can be performed by using the `GCA()`

, `tSCA()`

, `RGCA()`

and `RSCA()`

functions as shown in the box below: the resulting `lm`

object can be explored by the usual R methods, such as `summary()`

and `anova()`

.

```
# library(devtools) # Install if necessary
# install_github("OnofriAndreaPG/lmDiallel")
library(lmDiallel)
dMod <- lm(Yield ~ Block + GCA(Par1, Par2) + tSCA(Par1, Par2) +
RGCA(Par1, Par2) + RSCA(Par1, Par2), data = df)
summary(dMod)
##
## Call:
## lm(formula = Yield ~ Block + GCA(Par1, Par2) + tSCA(Par1, Par2) +
## RGCA(Par1, Par2) + RSCA(Par1, Par2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3500 -0.5644 0.0606 0.4722 2.7911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.558e+01 5.780e-01 26.962 < 2e-16 ***
## Block2 -3.772e-01 8.174e-01 -0.461 0.648613
## Block3 -3.011e-01 8.174e-01 -0.368 0.715830
## Block4 -3.261e-01 8.174e-01 -0.399 0.693458
## GCA(Par1, Par2)g_A -2.167e+00 2.890e-01 -7.497 9.77e-08 ***
## GCA(Par1, Par2)g_B -1.667e-01 2.890e-01 -0.577 0.569516
## tSCA(Par1, Par2)ts_A:A 1.000e+00 5.780e-01 1.730 0.096457 .
## tSCA(Par1, Par2)ts_A:B -1.000e+00 4.570e-01 -2.188 0.038609 *
## tSCA(Par1, Par2)ts_B:B 1.634e-16 5.780e-01 0.000 1.000000
## RGCA(Par1, Par2)rg_A -1.667e-01 2.890e-01 -0.577 0.569516
## RGCA(Par1, Par2)rg_B 1.333e+00 3.389e-01 3.934 0.000622 ***
## RSCA(Par1, Par2) 2.500e+00 5.309e-01 4.709 8.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.734 on 24 degrees of freedom
## Multiple R-squared: 0.8269, Adjusted R-squared: 0.7476
## F-statistic: 10.42 on 11 and 24 DF, p-value: 1.129e-06
anova(dMod)
## Analysis of Variance Table
##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Block 3 0.784 0.261 0.0869 0.9665
## GCA(Par1, Par2) 2 244.000 122.000 40.5743 1.999e-08 ***
## tSCA(Par1, Par2) 3 24.000 8.000 2.6606 0.0710 .
## RGCA(Par1, Par2) 2 9.333 4.667 1.5520 0.2323
## RSCA(Par1, Par2) 1 66.667 66.667 22.1717 8.710e-05 ***
## Residuals 24 72.164 3.007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

For the sake of simplicity, we also built a wrapper function named `lm.diallel()`

, which can be used in the very same fashion as `lm()`

. The syntax is:

`lm.diallel(formula, Block, Env, data, fct)`

where ‘formula’ specifies the response variable and the two variables for parentals (e.g., Yield ~ Par1 + Par2) and the two arguments ‘Block’ and ‘Env’ are used to specify optional variables, coding for blocks and environments, respectively. The argument ‘data’ is a ‘dataframe’ where to look for the explanatory variables and, finally, ‘fct’ is a string variable coding for the selected model (“HAYMAN1”, for this example; see below).

```
dMod2 <- lm.diallel(Yield ~ Par1 + Par2, Block = Block,
data = df, fct = "HAYMAN1")
summary(dMod2)
##
## Call:
## lm.diallel(formula = Yield ~ Par1 + Par2, Block = Block, fct = "HAYMAN1",
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3500 -0.5644 0.0606 0.4722 2.7911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Intercept 1.533e+01 2.890e-01 53.056 < 2e-16 ***
## Block1 2.511e-01 5.006e-01 0.502 0.620484
## Block2 -1.261e-01 5.006e-01 -0.252 0.803236
## Block3 -5.000e-02 5.006e-01 -0.100 0.921264
## g_A -2.167e+00 2.890e-01 -7.497 9.77e-08 ***
## g_B -1.667e-01 2.890e-01 -0.577 0.569516
## ts_A:A 1.000e+00 5.780e-01 1.730 0.096457 .
## ts_A:B -1.000e+00 4.570e-01 -2.188 0.038609 *
## ts_B:B 7.155e-16 5.780e-01 0.000 1.000000
## rg_A -1.667e-01 2.890e-01 -0.577 0.569516
## rg_B 1.333e+00 3.389e-01 3.934 0.000622 ***
## rs_A:B 2.500e+00 5.309e-01 4.709 8.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.734 on 24 degrees of freedom
## Multiple R-squared: 0.8269, Adjusted R-squared: 0.7476
## F-statistic: 10.42 on 11 and 24 DF, p-value: 1.129e-06
anova(dMod2)
## Analysis of Variance Table
##
## Response: Yield
## Df Sum Sq Mean Sq F value Pr(>F)
## Block 3 0.784 0.261 0.0869 0.9665
## GCA 2 244.000 122.000 40.5743 1.999e-08 ***
## tSCA 3 24.000 8.000 2.6606 0.0710 .
## RGCA 2 9.333 4.667 1.5520 0.2323
## RSCA 1 66.667 66.667 22.1717 8.710e-05 ***
## Residuals 24 72.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The above function works very much like the `lm()`

function and makes use of the general purpose linear model solver `lm.fit()`

. Apart from simplicity, another advantage is that the call to `lm.diallel()`

returns an object of both ‘lm’ and ‘diallel’ classes. For this latter class, we built several specific S3 methods, such as the usual `anova()`

, `summary()`

and `model.matrix()`

methods, partly shown in the box above.

Considering that diallel models are usually fitted to determine genetical parameters, we also built the `glht.diallelMod()`

method and the `diallel.eff()`

function, which can be used with the ‘multcomp’ package, to retrieve the complete list of genetical parameters, as shown in the box below.

```
library(multcomp)
gh <- glht(linfct = diallel.eff(dMod2))
summary(gh, test = adjusted(type = "none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Intercept == 0 1.533e+01 2.890e-01 53.056 < 2e-16 ***
## g_A == 0 -2.167e+00 2.890e-01 -7.497 5.85e-08 ***
## g_B == 0 -1.667e-01 2.890e-01 -0.577 0.569106
## g_C == 0 2.333e+00 2.890e-01 8.074 1.49e-08 ***
## ts_A:A == 0 1.000e+00 5.780e-01 1.730 0.095471 .
## ts_A:B == 0 -1.000e+00 4.570e-01 -2.188 0.037819 *
## ts_A:C == 0 9.992e-16 4.570e-01 0.000 1.000000
## ts_B:A == 0 -1.000e+00 4.570e-01 -2.188 0.037819 *
## ts_B:B == 0 7.155e-16 5.780e-01 0.000 1.000000
## ts_B:C == 0 1.000e+00 4.570e-01 2.188 0.037819 *
## ts_C:A == 0 9.992e-16 4.570e-01 0.000 1.000000
## ts_C:B == 0 1.000e+00 4.570e-01 2.188 0.037819 *
## ts_C:C == 0 -1.000e+00 5.780e-01 -1.730 0.095471 .
## rg_A == 0 -1.667e-01 2.890e-01 -0.577 0.569106
## rg_B == 0 1.333e+00 3.389e-01 3.934 0.000555 ***
## rg_C == 0 -1.167e+00 3.389e-01 -3.443 0.001962 **
## rs_A:B == 0 2.500e+00 5.309e-01 4.709 7.25e-05 ***
## rs_A:C == 0 -2.500e+00 5.309e-01 -4.709 7.25e-05 ***
## rs_B:A == 0 -2.500e+00 5.309e-01 -4.709 7.25e-05 ***
## rs_C:A == 0 2.500e+00 5.309e-01 4.709 7.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
```

# Model fitting in two steps

In some cases, the analysis is performed in two steps and a diallel model is fitted to the means of selfs and crosses, which are calculated in the first step. Under the assumption of variance homogeneity and equal number of replicates, we can fit the Hayman’s model by using the `lm.diallel()`

function without the ‘Block’ argument.

```
dMod3 <- lm.diallel(YieldM ~ Par1 + Par2,
data = dfM, fct = "HAYMAN1")
```

In this case, we have no reliable estimate of residual error, but the `summary()`

and `anova()`

methods have been enhanced to give us the possibility of passing some information from the first step, i.e. an appropriate estimate of the residual mean square and degrees of freedom; the residual mean square from the first step needs to be appropriately weighted for the number of replicates (i.e., for this example, MSE = 3.007/4 with 24 degrees of freedom).

```
anova(dMod3, MSE = 3.007/4, dfr = 24)
## Analysis of Variance Table
##
## Response: YieldM
## Df Sum Sq Mean Sq F value Pr(>F)
## GCA 2 61.000 30.5000 40.5720 2.000e-08 ***
## tSCA 3 6.000 2.0000 2.6605 0.07101 .
## RGCA 2 2.333 1.1667 1.5519 0.23236
## RSCA 1 16.667 16.6667 22.1705 8.713e-05 ***
## Residuals 24 0.7518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(dMod3, MSE = 3.007/4, dfr = 24)
## Estimate SE t value Pr(>|t|)
## Intercept 1.533333e+01 0.2890117 5.305436e+01 2.157713e-26
## g_A -2.166667e+00 0.2890117 -7.496812e+00 9.771159e-08
## g_B -1.666667e-01 0.2890117 -5.766779e-01 5.695269e-01
## ts_A:A 1.000000e+00 0.5780235 1.730034e+00 9.646589e-02
## ts_A:B -1.000000e+00 0.4569677 -2.188339e+00 3.861373e-02
## ts_B:B 2.294461e-15 0.5780235 3.969495e-15 1.000000e+00
## rg_A -1.666667e-01 0.2890117 -5.766779e-01 5.695269e-01
## rg_B 1.333333e+00 0.3388963 3.934340e+00 6.219023e-04
## rs_A:B 2.500000e+00 0.5309484 4.708555e+00 8.712864e-05
```

The genetical parameters can be obtained by using the `glht()`

function and passing the information from the first step within the call to the `diallel.eff()`

function.

```
gh2 <- glht(linfct = diallel.eff(dMod3, MSE = 3.007/4, dfr = 24))
summary(gh2, test = adjusted(type = "none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Intercept == 0 1.533e+01 2.890e-01 53.054 < 2e-16 ***
## g_A == 0 -2.167e+00 2.890e-01 -7.497 5.85e-08 ***
## g_B == 0 -1.667e-01 2.890e-01 -0.577 0.569117
## g_C == 0 2.333e+00 2.890e-01 8.073 1.49e-08 ***
## ts_A:A == 0 1.000e+00 5.780e-01 1.730 0.095480 .
## ts_A:B == 0 -1.000e+00 4.570e-01 -2.188 0.037824 *
## ts_A:C == 0 -1.110e-15 4.570e-01 0.000 1.000000
## ts_B:A == 0 -1.000e+00 4.570e-01 -2.188 0.037824 *
## ts_B:B == 0 2.294e-15 5.780e-01 0.000 1.000000
## ts_B:C == 0 1.000e+00 4.570e-01 2.188 0.037824 *
## ts_C:A == 0 -1.110e-15 4.570e-01 0.000 1.000000
## ts_C:B == 0 1.000e+00 4.570e-01 2.188 0.037824 *
## ts_C:C == 0 -1.000e+00 5.780e-01 -1.730 0.095480 .
## rg_A == 0 -1.667e-01 2.890e-01 -0.577 0.569117
## rg_B == 0 1.333e+00 3.389e-01 3.934 0.000555 ***
## rg_C == 0 -1.167e+00 3.389e-01 -3.443 0.001962 **
## rs_A:B == 0 2.500e+00 5.309e-01 4.709 7.25e-05 ***
## rs_A:C == 0 -2.500e+00 5.309e-01 -4.709 7.25e-05 ***
## rs_B:A == 0 -2.500e+00 5.309e-01 -4.709 7.25e-05 ***
## rs_C:A == 0 2.500e+00 5.309e-01 4.709 7.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
```

# Estimation of variance components (random genetic effects)

In some cases, genetic effects are regarded as random and the aim is to estimate variance components. For this, we can use the `mmer()`

function in the ‘sommer’ package (Covarrubias-Pazaran, 2016), although we need to code a few dummy variables, which may make the task difficult for practitioners. Therefore, we coded a wrapper for the `mmer()`

function (`mmer.diallel()`

)that uses the same syntax as `lm.diallel()`

.

It would make no sense to estimate the variance components for genetic effects with a diallel experiment based on three parentals and, therefore, we give an example based on the ‘hayman54’ dataset, as available in the ‘lmDiallel’ package and relating to a complete diallel experiment with eight parentals (Hayman, 1954).

```
rm(list=ls())
data(hayman54)
mod.ran <- mmer.diallel(Ftime ~ Par1 + Par2, Block = Block,
data = hayman54, fct = "HAYMAN1")
mod.ran$varcomp
## VarComp VarCompSE
## Block 0.00000 9.321698
## GCA 1276.73142 750.174164
## RGCA 17.97647 19.909911
## tSCA 1110.99398 330.172943
## RSCA 30.53937 46.467163
## Residuals 418.47875 74.563526
```

We do hope that you enjoyed this post; if you are interested in diallel models, please, stay tuned: we have other examples on the way.

Thanks for reading

Prof. Andrea Onofri

Prof. Luigi Russi

Dr. Niccolò Terzaroli

Department of Agricultural, Food and Environmental Sciences

University of Perugia (Italy)

Send comments to: andrea.onofri@unipg.it

# References

- Covarrubias-Pazaran, G., 2016. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLOS ONE 11, e0156744. https://doi.org/10.1371/journal.pone.0156744
- Hayman, B.I., 1954. The Analysis of Variance of Diallel Tables. Biometrics 10, 235. https://doi.org/10.2307/3001877
- Möhring, J., Melchinger, A.E., Piepho, H.P., 2011b. REML-Based Diallel Analysis. Crop Science 51, 470–478. https://doi.org/10.2135/cropsci2010.05.0272
- Onofri, A., Terzaroli, N., Russi, L., 2020. Linear models for diallel crosses: a review with R functions. Theoretical Applied Genetics, https://doi.org/10.1007/s00122-020-03716-8