**2. Strategies to infer adaptability of wild relatives and landraces to their natural habitats**

Understanding the genomic signatures associated with environmental variation provides insights into how species adapt to their environment [13–15]. Recent genomic studies in wild populations have demonstrated that genome-environment associations, which are associations between SNP alleles and accessions' environment of origin, can indeed be used to identify adaptive loci and predict phenotypic variation. For instance, Turner *et al*. [16] predicted genetic adaptive variation to serpentine soils in *Arabidopsis lyrata;* Hancock *et al*. [17] identified climate-adaptive genetic loci among a set of geographically diverse *Arabidopsis thaliana;* Fischer *et al*. [18] predicted adaptive variation to topoclimatic factors in *Arabidopsis halleri;* Pluess *et al*. [19] predicted genetic local adaptation to climate at a regional scale in *Fagus sylvatica;* and Yeaman *et al*. [20] detected convergent local adaptation in two distantly related species of conifers.

This genome-environment association approach has also been explored in some crop accessions as a prospection strategy of germplasm, alternative to traditional phenotyping. For example, Yoder *et al*. [21] was able to capture adaptive variation to thermal tolerance, drought tolerance, and resistance to pathogens in *Medicago truncatula;* Lasky *et al*. [22] predicted genotype-by-environment interactions to drought stress and aluminum toxicity in *Sorghum bicolor;* and Berthouly-Salazar *et al*. [23] uncovered genomic regions involved in adaption to abiotic and biotic stress on two climate gradients in *Cenchrus americanus*.

**Figure 1.**

Modified from Cortés *et al*. [12].

Geographic distribution of wild relatives (light gray) and landraces (dark gray) of common bean and its diversity in terms of seed size, colors, and patterns.

116 Rediscovery of Landraces as a Resource for the Future

Nonetheless, since genomic signatures associated with habitat heterogeneity can result from causes other than adaptation and selection [24, 25], for example, random genetic drift (**Figure 2**), and are also influenced by differences in ancestral variation and recombination in the genome [27–29], some further approaches need to be undertaken to clarify the truthful nature of the divergent regions. For instance, the origin of habitat-associated variants from novel or standing genetic variation leads to distinctively different patterns of genomic divergence [30–32]. One approach that can help to distinguish these underlying causes of divergence is comparing summary statistics (i.e., Tajima's D) from different genomic sections because demographic processes usually leave genome-wide signatures while selection tends to imprint more localized regions [33]. Specifically, habitat-mediated purifying selection is associated with localized low values of nucleotide diversity (π) [34] and Tajima's D [35] and high scores of the Watterson's theta (θ) estimator [36] because only low-frequency polymorphisms can avoid being eliminated by widespread directional selection. Although recent population bottlenecks tend to achieve the same reduction in nucleotide variation, this pattern is expected at a more genome-wide level. Similarly, local adaptation tends to homogenize haplotypes within the same niche, fix polymorphisms in different populations, and eliminate lowfrequency polymorphism. Consequently, few haplotypes with high frequency are retained, corresponding to high values of nucleotide diversity (π) and Tajima's D and low scores of the Watterson's theta (θ) estimator [33]. While independent domestication events, extensive population structure, and population expansions after bottlenecks can produce the same patterns, these demographic processes also imprint genomes at a more genome-wide level.

Lessons from Common Bean on How Wild Relatives and Landraces Can Make Tropical Crops...

http://dx.doi.org/10.5772/intechopen.71669

119

In the following two sub-sections, we explain how to implement genome-environment associations in order to infer the adaptability of wild relatives and landraces to their natural habitats (Section 2.1) and discuss ways to account for causes, other than adaptation and selection, that may be shaping the genomic landscape of signatures associated with habitat

First of all, in order to account for possible demographic effects, subpopulation structure must be determined in geo-referenced landraces and wild accessions using principal coordinates analysis (PCoA) implemented in the software Trait Analysis by aSSociation, Evolution and Linkage, TASSEL v.5 [37]. The same dataset and software can be used to perform association

As a rule of thumb, a total of ten generalized (GLM) and mixed (MLM) linear models should be compared [40]. Within each model family, five models are usually built as follows: (1) models with the gene pool identity and the first two PCoA axes scores as covariates; (2) models with the within-gene pool subpopulation identity (e.g. [41]) and the first two PCoA axes scores as covariates; (3) models with the first two PCoA axes scores as covariates; (4) models with the within-gene pool subpopulation identity (e.g. [41]) as covariates; and (5) models with the gene pool identity as covariates. All five MLMs usually use a centered IBS kinship matrix as a random effect to control for genomic background implementing the EMMA and P3D algorithms to reduce computing time [42]. QQ-plots of the P-values should be inspected to assess whether excessive numbers of false positives are generated and choose in this way the optimum model. Significant associations are determined using strict Bonferroni corrections of P-values at alpha = 0.001, leading, for

ized PCoA and Manhattan diagrams can be carried out with the software R v.3.3.1 (R Core Team). Finally, candidate genes for habitat adaptation can be identified within the 1000 bp sections, flanking each SNP marker that is associated with a bioclimatic-based index by using the corresponding reference genome (e.g. [5]) and the PhytoMine and BioMart tools in Phytozome

**2.2. Accounting for genomic constrains by inspecting genome-wide patterns of variation**

In order to identify causes other than adaptation and selection that may be shaping the genomic landscape of signatures associated with habitat heterogeneity (i.e., genomic constrains and genetic drift), sliding window approaches (e.g., window size = 1 × 10<sup>6</sup> bps, step size = 200 kb) can be implemented to describe patterns of variation and overall divergence across the genome. For instance, SNP density, nucleotide diversity as measured by π [34],

in a usual dataset of ca. 23,000 SNP markers

) = 7.36. The construction of custom-

**2.1. Using genome-environment association scans to identify loci associated with** 

analyses between the SNP markers and bioclimatic-based indexes (e.g. [12, 38, 39]).

heterogeneity (Section 2.2).

**bioclimatic-based indexes**

example, to a significance threshold of 4.4 × 10−<sup>8</sup>

v.12 (phytozome.jgi.doe.gov).

(0.001 divided by the number of markers) or −log10(4.4 × 10−<sup>8</sup>

**Figure 2.** Multiple causes explain genome-environment associations. External processes, such as divergent selection, which is the main focus when assessing adaptation in wild relatives and landraces of crops, is only one of many possible causes. At the same time, the genomic background may be homogenized by gene flow. Similarly, background selection and genomic features in regions of reduced recombination rate and shared ancestral polymorphism (more prone to genetic drift due to their reduced effective population size) could induce hotspots of spurious genome-environment associations. Therefore, besides external processes driven by natural selection, both inherent properties of the genome and the demographic and evolutionary history of the crop influence the extent of the genome-environment associations. Modified from Ravinet, Faria [26].

corresponding to high values of nucleotide diversity (π) and Tajima's D and low scores of the Watterson's theta (θ) estimator [33]. While independent domestication events, extensive population structure, and population expansions after bottlenecks can produce the same patterns, these demographic processes also imprint genomes at a more genome-wide level.

Nonetheless, since genomic signatures associated with habitat heterogeneity can result from causes other than adaptation and selection [24, 25], for example, random genetic drift (**Figure 2**), and are also influenced by differences in ancestral variation and recombination in the genome [27–29], some further approaches need to be undertaken to clarify the truthful nature of the divergent regions. For instance, the origin of habitat-associated variants from novel or standing genetic variation leads to distinctively different patterns of genomic divergence [30–32]. One approach that can help to distinguish these underlying causes of divergence is comparing summary statistics (i.e., Tajima's D) from different genomic sections because demographic processes usually leave genome-wide signatures while selection tends to imprint more localized regions [33]. Specifically, habitat-mediated purifying selection is associated with localized low values of nucleotide diversity (π) [34] and Tajima's D [35] and high scores of the Watterson's theta (θ) estimator [36] because only low-frequency polymorphisms can avoid being eliminated by widespread directional selection. Although recent population bottlenecks tend to achieve the same reduction in nucleotide variation, this pattern is expected at a more genome-wide level. Similarly, local adaptation tends to homogenize haplotypes within the same niche, fix polymorphisms in different populations, and eliminate lowfrequency polymorphism. Consequently, few haplotypes with high frequency are retained,

118 Rediscovery of Landraces as a Resource for the Future

**Figure 2.** Multiple causes explain genome-environment associations. External processes, such as divergent selection, which is the main focus when assessing adaptation in wild relatives and landraces of crops, is only one of many possible causes. At the same time, the genomic background may be homogenized by gene flow. Similarly, background selection and genomic features in regions of reduced recombination rate and shared ancestral polymorphism (more prone to genetic drift due to their reduced effective population size) could induce hotspots of spurious genome-environment associations. Therefore, besides external processes driven by natural selection, both inherent properties of the genome and the demographic and evolutionary history of the crop influence the extent of the genome-environment associations.

Modified from Ravinet, Faria [26].

In the following two sub-sections, we explain how to implement genome-environment associations in order to infer the adaptability of wild relatives and landraces to their natural habitats (Section 2.1) and discuss ways to account for causes, other than adaptation and selection, that may be shaping the genomic landscape of signatures associated with habitat heterogeneity (Section 2.2).

## **2.1. Using genome-environment association scans to identify loci associated with bioclimatic-based indexes**

First of all, in order to account for possible demographic effects, subpopulation structure must be determined in geo-referenced landraces and wild accessions using principal coordinates analysis (PCoA) implemented in the software Trait Analysis by aSSociation, Evolution and Linkage, TASSEL v.5 [37]. The same dataset and software can be used to perform association analyses between the SNP markers and bioclimatic-based indexes (e.g. [12, 38, 39]).

As a rule of thumb, a total of ten generalized (GLM) and mixed (MLM) linear models should be compared [40]. Within each model family, five models are usually built as follows: (1) models with the gene pool identity and the first two PCoA axes scores as covariates; (2) models with the within-gene pool subpopulation identity (e.g. [41]) and the first two PCoA axes scores as covariates; (3) models with the first two PCoA axes scores as covariates; (4) models with the within-gene pool subpopulation identity (e.g. [41]) as covariates; and (5) models with the gene pool identity as covariates. All five MLMs usually use a centered IBS kinship matrix as a random effect to control for genomic background implementing the EMMA and P3D algorithms to reduce computing time [42]. QQ-plots of the P-values should be inspected to assess whether excessive numbers of false positives are generated and choose in this way the optimum model. Significant associations are determined using strict Bonferroni corrections of P-values at alpha = 0.001, leading, for example, to a significance threshold of 4.4 × 10−<sup>8</sup> in a usual dataset of ca. 23,000 SNP markers (0.001 divided by the number of markers) or −log10(4.4 × 10−<sup>8</sup> ) = 7.36. The construction of customized PCoA and Manhattan diagrams can be carried out with the software R v.3.3.1 (R Core Team).

Finally, candidate genes for habitat adaptation can be identified within the 1000 bp sections, flanking each SNP marker that is associated with a bioclimatic-based index by using the corresponding reference genome (e.g. [5]) and the PhytoMine and BioMart tools in Phytozome v.12 (phytozome.jgi.doe.gov).

#### **2.2. Accounting for genomic constrains by inspecting genome-wide patterns of variation**

In order to identify causes other than adaptation and selection that may be shaping the genomic landscape of signatures associated with habitat heterogeneity (i.e., genomic constrains and genetic drift), sliding window approaches (e.g., window size = 1 × 10<sup>6</sup> bps, step size = 200 kb) can be implemented to describe patterns of variation and overall divergence across the genome. For instance, SNP density, nucleotide diversity as measured by π [34], Watterson's theta (θ) estimator [36], and Tajima's D [35] can be computed using the software TASSEL v.5 [37] and customized R scripts. Results of all windowed analyses are usually plotted against window midpoints in millions of base pairs (Mb) in the software R v.3.3.1 (R Core Team). The centromeres can be marked to visualize the extent of the centromeric repeats and its correlation with overall patterns of diversity and divergence.

from the association analyses between those SNP markers and a bioclimatic-based drought index [12], indicated that GLM analyses likely had excessive rates of false positives whereas MLM models controlling for population structure and using a kinship matrix reduced more

Lessons from Common Bean on How Wild Relatives and Landraces Can Make Tropical Crops...

http://dx.doi.org/10.5772/intechopen.71669

121

In that particular case, the MLM model with the first two PCoA axes scores used as covariates was the best at controlling for false positives. This model yielded a total of 115 SNP markers associated with the bioclimatic-based drought index at a Bonferroni-corrected significance threshold of 7.36 –log10(P-value). These markers explained on average 51.3% ± 0.4 of the variation in the bioclimatic-based drought index. The 115 SNPs were clustered in 90 different regions, defined as overlapping 1000 bp sections that flanked associated markers (**Figure 3**).

Following the previous example, chromosomes Pv3 and Pv8 had the highest number of associated SNPs with 21 and 32 SNPs clustered in 16 and 21 different regions, respectively. Chromosomes Pv1, Pv2, Pv4, Pv5, Pv6, and Pv9 contained an intermediate number of associated SNPs with 11, 6, 11, 7, 12, and 9 SNPs clustered in 11, 6, 8, 6, 8, and 9 different regions, respectively. Chromosomes Pv7, Pv10, and Pv11 had the fewest number of associated SNPs with 3, 2, and 1 SNPs clustered in 3, 1, and 1 different regions, respectively. Chromosome Pv8 had more regions with at least 2 associated SNPs than any other chromosome, and these regions had more associated SNPs than in any other chromosome for a total of 5 regions with an average number of associated SNPs of 3.2. The single region that contained more associated SNPs was also situated in chromosome Pv8 with 6 SNPs explaining on average 51.1% ± 0.3 of the variation in the bioclimatic-based drought index. After chromosome Pv8, Pv3 was also outstanding, having 4 regions (with at least 2 associated SNPs) with an average number of associated SNPs of 2.5. Therefore, a total of 75 regions, comprising 99 SNP markers associated with the bioclimatic-based drought index, contained at least 1 gene for a total of 77 genes. Most genes were in chromosomes Pv1, Pv3, and Pv8 with 11, 14, and 16 genes. Only two regions, at chromosomes Pv1 and Pv8 and containing a total of seven different SNPs, spanned two or more genes. The one in Pv8 was the region with more associated SNPs (six in total). One of the two genes in this region encoded an Ankyrin repeat-containing protein, which was associated with osmotic regulation via the assembly of cation channels in the membranes [45]. Among other identified candidate genes, there was a phototropic-responsive

Associated SNPs and regions were widespread in all 11 common bean chromosomes.

**3.2. Rampant divergent selection: interpreting genomic signatures of adaptation** 

As a follow-up of the previous example, associated genomic windows were enriched for SNP density and positive Tajima's D scores. This conclusion was achieved after implementing a sliding window analysis to explore the patterns of genome-wide diversity (**Figure 3**). Marker density decayed drastically toward the centromeres. This decay in diversity proportional to the decay in the rate of recombination was first described in *D. melanogaster* and has been confirmed in many organisms since then. The correlation was initially understood as an effect of genetic hitchhiking, but background selection has been increasingly appreciated as a contributing factor

effectively the false positive rate.

NPH3 gene [46] in Pv3.

**in common bean beyond genomic constrains**

[28], perhaps in many cases the dominating one.

It is advisable to calculate bootstrap-based means and 95% confidence intervals around the mean for some summary statistics (i.e., SNP density, π, θ, and Tajima's D) when computed in sliding windows that contained or did not contain at least one marker that was associated with a bioclimatic-based index. For this, each summary statistic of windows containing and not containing associated SNPs should be randomly resampled with replacement (bootstrapping) across windows within grouping factors (associated vs. not associated). The overall mean is then stored for each grouping factor. This step should be iterated at least 1000 times using customized R scripts. Bootstrapping must be performed independently for each summary statistic in order to eliminate correlations among these.
