**2. Statistical methods used in genomic selection**

Finding a causal association between a genetic element and the characteristic of interest is a common selection framework used in all pedigree-based phenotypic selections, classic marker-assisted selection, and genomic selection. Typical QTL mapping and MASs either overstate or ignore the marker effect. The desire to use high-density genotyping technology for complicated trait prediction led to the creation of genomic selection. Avoiding marker selection minimizes bias during effect estimation and genetic value computations, according to Meuwissen et al. [1]. Because marker selection produces a bigger predictor effect (P) than the number of data (smaller n). When there is not enough degree of freedom, the ordinary least square estimator fails to estimate all of the predictors' effects, resulting in an over-fitted model. As a result, the ordinary least square's prediction performance suffers. To address this issue, various genomic selection models have been developed. Models like the shrinkage methods, variable selection models, kernel approaches, and dimension reduction methods can be mentioned.

#### **2.1 The basic genetic model and variance decomposition**

The basic genetic model that relates the phenotype (P) of an individual with summation of the genetic values (G) by assuming that only effects of the genetic factors were inherited to the next generations. The genetic values include genetic, dominance and epistatic effects and the residual environmental effect (E). It is mathematically denoted as:

$$P = G + E \tag{1}$$

In absence of G and E interaction, the covariance between G and E becomes zero. Therefore, the phenotypic variance V (P) can be expressed as [2]:

$$V(P) = V(G) + V(E) + 2COV(G, E) \tag{2}$$

$$V(P) = V(G) + V(E) + \mathbf{0} = V(G) + V(E) \tag{3}$$

The GEBV is generally equal to G.

#### **2.2 Heritability**

The fraction of phenotypic variation (V(P)) owing to variation in genetic value (V (G)) is known as heritability. It assesses how well a population's phenotypic characteristics are passed on to the following generation. There are two ways to explain heritability: broad and narrow sense approaches. The fraction of phenotypic variation owing to genetic value is captured by broad-sense heritability (H<sup>2</sup> ). It concentrates on all genetic influences, including additive, dominance, and epistatic effects. Therefore, it can be mathematically represented as

$$H^2 = \frac{V(G)}{V(P)}\tag{4}$$

While the narrow-sense heritability (h<sup>2</sup> ) captures only the proportion of genetic variation that is due to additive genetic effect (V (A)) and the residual effect variance denoted as V(ɛ). It is represented by *<sup>h</sup>*<sup>2</sup> <sup>¼</sup> *V G*ð Þ *V P*ð Þ. Therefore, for h<sup>2</sup> , the genetic model can be rewritten as:

$$V(P) = V(A) + V(\varepsilon) \tag{5}$$

where V (ɛ) represents the residual effects that are not included in the additive genetic effect (A) such as the dominant and epistatic effects.

The narrow-sense heritability is the most important in plant selection because it accounts for nearly all of the genetic variance that affect response to selection (close to 100%). Meuwissen et al. [1] suggested that V(A) might be broken down into various DNA markers effect like V(A1), V(A2), V(A3), and so on. This made it easier to calculate the breeding value of a plant using markers that covered the complete genome.

#### **2.3 Breeding value**

In animal breeding, the word "breeding value" refers to how many beneficial genes one animal passes on to its progeny. The genotype value and the breeding value can be equivalent. However, owing to dominance or epistatic situations, this is not always the case. Alleles at the loci that affect phenotype are heritable. Knowing the effect of an allele in a population can assist in predicting the progeny's phenotype. The deciding

variables of a given trait in a population are allele frequencies and the effect of each genotype that includes the allele. It is also referred to as the allele's average effect.

An individual's breeding value is the total of the average effects of all the alleles the individual bears [3]. An AB heterozygote, for example, has a breeding value of 3 if an A allele is worth +5 and a B allele is worth �2. It is an individual's genetic value added together. Breeding value (BV) can alternatively be described as the departure of offspring's phenotypic mean value from the population phenotypic mean value using the narrow sense heritability concept (h<sup>2</sup> ). This can be expressed numerically as:

$$BV = \overline{m}\_0 + h^2 \left( y\_i - m\_0 \right) = m\_0 + \left( y\_i - m\_0 \right) \frac{V(A)}{V(P)} \tag{6}$$

where yi is the phenotypic value of individual i (i = 1, 2, ... n) and mo denotes the population's mean phenotypic value. Estimate of breeding value (EBV) is a term used to describe a breeding value that is estimated based on heredity. Genomic selection, on the other hand, employs genome-wide markers to evaluate genotype effect and breeding value, resulting in GEBVs (genomic estimate of breeding values) [2].

### **3. Models**

#### **3.1 The linear model**

A linear model or its extension can be used to describe the causal link between phenotype and genotype. For the pair of observed phenotype and genotype of the marker of ith individual (yi, x1i), i.e. (y1, x11), (y2, x12), .., (yN, x1N) in the training population, which assumes N individuals and M biallelic markers. N individuals' phenotypes are normally distributed, and based on their marker genotype, they get an additional normally distributed phenotypic value of β1, depending on their marker genotype. The phenotype (yi) can be modeled using genetic value gi = x1iβ1 as a parametric regression on marker covariate x1i as follows: yi = β0 + x1iβ1 + εi, where, β0 is the intercept (overall mean) and β1 is the marker effect (regression coefficient), x1i is the genotype value of marker 1 for individual i. The values of β0 and β1 are the parameters that need to be determined, and εi is an error term that is usually assumed to have a normal distribution with a mean of zero. To determine the unknown parameters, least-squares estimation, such that the summation of εi 2 , that is an error function *<sup>E</sup>* <sup>¼</sup> <sup>P</sup> *i* ð Þ *yi* � *<sup>β</sup><sup>o</sup>* � *<sup>x</sup>*1*iβ*<sup>1</sup> <sup>2</sup> , is minimized and the line is fitted to the phenotype. However, applying the model for P and N number of markers and individuals, respectively, result over fitting. To avoid overfitting, a penalty term is introduced in the error function, i.e.,

$$E = \sum\_{i=1}^{N} \left( \text{yi} - \sum\_{j=0}^{M} \text{x} \ddot{\text{y}} \beta \dot{\text{j}}^2 \right) + \lambda \sum\_{j=0}^{M} |\beta \dot{\text{j}}|^q \tag{7}$$

where, effect of the penalty term is controlled by λ.

To incorporate genome-wide markers in the model, the above formula can be extended into a multiple linear regression model, which gives the following formula [2]:

$$
\rho\_i = \beta \mathbf{0} + \varkappa \mathbf{1}\_i \beta \mathbf{1} + \varkappa \mathbf{2}\_i \beta \mathbf{2} \cdots \varkappa m\_i \beta m\_i + \varepsilon\_i \tag{8}
$$

*Case Studies of Breeding Strategies in Major Plant Species*

$$\boldsymbol{y}\_{i} = \boldsymbol{\beta}\_{0} + \sum\_{j=0}^{M} \boldsymbol{x}j\boldsymbol{i}\boldsymbol{\beta}\boldsymbol{j} + \boldsymbol{\varepsilon}\boldsymbol{i} \tag{9}$$

where yi = the phenotypic value of the individual i and xji is the genotype value of the jth marker in ith individual. The coefficient βj is the effect of marker j on the phenotype or regression of yi on the jth marker covariate xij and ε<sup>i</sup> is the random error assumed [2]. X0i = 1 is a dummy variable. Similarly, the coefficients were determined by minimizing the error function,

$$E = \sum\_{i=1}^{N} \left( \text{y}i - \sum\_{j=0}^{M} \text{x}j i \theta j^2 \right) \tag{10}$$

In genomic selection, the focus is given to calculations of the genome enhanced breeding value rather than the exact location of the QTL; therefore, using the link function of linear model assumption, which provides relationship between linear predictor and the mean of the distribution function and error variance of regression, it can be rewritten as [2]:

$$y\_i = \sum\_{j=0}^{M} xj i\theta j + \epsilon i,\tag{11}$$

A number of models, including random regression best linear unbiased prediction (RR-BLUP), least absolute shrinkage and selection operator (LASSO), reproducing kernel Hilbert spaces (RKHS) and support vector machine regression, Bayesian methods, and collaborative filtering recommender system [5] have been developed using the above fundamental concepts. The majority of GS models aim to reduce the cost function [6].

#### **3.2 Evaluating genomic prediction accuracy**

Candidates for selection have no phenotypic information. As a result, their GEBV predictive performance may be evaluated using either a group of validation individuals with highly accurate EBVs and many progenies or cross validation. Both methods necessitate a reference population that contains both marker genotypes and phenotypic information.

#### *3.2.1 Correlation studies between GEBV and observed EBV value*

The r(GEBV: EBV) correlation between the GEBVs and empirically determined breeding values (observed) is used to assess the GEBVs' prediction accuracy (predicted). The EBV can be produced in a number of ways, the most basic of which is as a phenotypic mean. This relationship establishes a direct link between GEBV prediction accuracy and selection response, as well as a rough estimate of selection accuracy. Other statistics are occasionally used, such as mean-square error (MSE). The correlation between GEBV and true breeding value (TBV), that is, r(GEBV:TBV) is used to quantify genomic selection accuracy. Due to the fact that we can only measure r(GEBV: EBV), we must transform this value to an estimate of r(GEBV:TBV). To do so,

*Genomic Selection: A Faster Strategy for Plant Breeding DOI: http://dx.doi.org/10.5772/intechopen.105398*

$$r(GEBV:EBV) = r(GEBV:TBV) \* (EBV:TBV) \tag{12}$$

This assumption is accurate if the TBV is the only component that the GEBV and the EBV have in common. In other words,

$$\text{GEBV} = \text{TBV} + \text{e1} \tag{13}$$

and

$$EBV = TBV + e2\tag{14}$$

where e1 and e2 are uncorrelated error residuals, the assumption holds. If the training and validation data were obtained in the same setting, the assumption may be broken. In that instance, a common component of error in both GEBV and EBV would be generated by genotype by environment (GxE) interaction, biasing their correlation higher. To obtain accurate estimations of GEBV prediction accuracy, training and validation data should be collected in various environments. The r(EBV:TBV) correction accommodates for the fact that the EBV in the validation set is not error-free. Within the validation set, r(EBV:TBV) equals the square root of heritability (h) when the EBVs are phenotypes [7].

#### *3.2.2 Evaluating GEBV accuracy through cross validation (CV)*

Cross validation is used in GS research to evaluate GEBV accuracy on empirical data (CV). The reference population is divided into subsets in cross validation, such as a training set and a validation/testing set. Similar genetic backgrounds and relationships of validation and selection individuals to the reference population are required for cross validation, so that the accuracies achieved for selection candidates resemble those estimated using the reference population. The size of the subset determines accuracy; higher sizes usually result in lower sampling variance of anticipated and observed correlations [8].

The number of observations in each set varies, but a fivefold CV is frequently employed, in which the data set is divided into five sets at random, four of which are combined to form the training set, and the remaining set is designated as the validation set. Each subset of the data is used as a validation set once, and the model's correctness should be evaluated before it is applied to the breeding population. To do so, the majority of the training population is utilized to build a prediction model, which is then used to estimate the genomic estimated breeding values of the remaining individuals in the training population based solely on genotypic data. This allows researchers to "test" and develop the prediction model to ensure that it has high enough prediction accuracy that future predictions can be trusted. Once validated, the model is frequently used to calculate GEBVs of lines for which genotypical but not phenotypical information is available [8, 9].

#### **3.3 Factors affecting genomic selection accuracy**

The response of genomic selection is the result of numerous elements that contribute to the accuracy of GEBV estimation. These components are intricately linked in a comprehensive and complex way. The extent and distribution of linkage disequilibrium between individuals, as well as model performance, sample size and relatedness, marker density, gene effect, heritability, and genetic design are all factors to consider.

a. Marker density

Marker density and TP sizes required for satisfactory accuracy are heavily influenced by factors such as effective population size and QTL number. Minimum number of markers that cover the complete genome were used based on LD decay, with at least one marker in LD with each gene area. When there were a lot of LD and dense markers, the prediction was better [10]. However, unless the marker density is extremely low, marker density has minimal effect on prediction accuracy within families. Furthermore, some GS models, such as Bayes B, do not require a particularly dense marker for good breeding value prediction. The required marker density is also determined by the type of marker. For example, bi-allelic markers like SNP required two to three times the density of multi-allelic markers like SSR [3, 4, 11].

b. Size and composition of training population

GS accuracy is affected by the size of the training population. Up to the highest size possible, Vanraden et al. [12] found that the connection between accuracy and training population (TP) size was nearly linear. In other words, when the training population size was big, the maximum GS accuracies were achieved. Furthermore, population structure, training population age, and numerous generations of training all have an impact on accuracy. A near ancestor or parents, older lines, related lines, and multiple generations of training have good accuracy [4, 10]. Additionally, using a pooled training set of heterotic groups could improve accuracy [13]. As a result, the under-selection population's parents or recent ancestors can be used as the training population in a repeated generation of training to achieve high accuracy.

To maintain accuracy when using landrace or exotic germplasm in GS, very high marker density and a large training population size are required [1]. In addition, the training population's unrelatedness and single crosses cause marker effects become inconsistent. Due to the presence of various alleles, allelic frequencies, genetic background, or epistatic interaction, erroneous assessment of marker effect and GEBVs may occurs [14].

c. Number of QTL

The number of QTL and trait heritability determines the appropriate marker density and training population size. Even traits with low heritability can be accurately predicted in the context of a large training population [15]. For this prediction, a model like BLUP can be employed, which captures a lot of modest effect QTL that may not be in LD with the marker.

d. Heritability

Lower GEBV accuracies are associated with low heritability of a trait. High accuracy can only be maintained in the case of low heritability traits

(particularly for h<sup>2</sup> 0.4) by utilizing a large training population with many phenotypic data [10, 15]. Consider a population with an effective size of 1000 individuals and an accuracy of 0.70. If the heritability, h<sup>2</sup> , is 0.2, it is expected that the training population (TP) size will need to be 9000, however, if h2 is 0.50, a TP size of fewer than 3000 will be required. Responses to genomic selection were 18–43 percent higher than MARS across varied population sizes, QTL numbers, and heritability [16].

e. Linkage disequilibrium (LD)

LD refers to the nonrandom linkage of alleles at different loci. Marker density and GS accuracy can be estimated using the rate of LD decay across the genome. It has been found that for high heritability traits, an average nearby marker LD value (r2 ) of 0.15 is sufficient, but increasing the r2 value to 0.2 enhances GEBV prediction accuracy for low heritability traits.
