**Association Mapping in Plant Genomes**

Braulio J. Soto-Cerda1,2 and Sylvie Cloutier1

*1Cereal Research Center, Agriculture and Agri-Food Canada 2Agriaquaculture Nutritional Genomic Center, CGNA (R10C1001) Genomics and Bioinformatics Unit 1Canada 2Chile* 

#### **1. Introduction**

28 Genetic Diversity in Plants

Sefc, K.M.; Lopes, M.S.; Mendonça, D.; Rodrigues dos Santos, M.; Câmara Machado, M.L. &

Sesli, M. & Yegenoglu, E.D. (2009). Genetic analysis on wild olives by using RAPD markers.

Shahriari, M.; Omrani, A.; Falahati-Anbaran, A.; Ghareyazei, B. & Nankali, A. (2008).

Tanksley, S.D. & Orton, T.J. (1983). Isozymes in plant genetics and breeding. Elsevier,

Tautz, D.; Trick, M. & Dover, G.A. (1986). Cryptic simplicity in DNA is a major source of

Terzopoulos, P.J.; Kolano, B.; Bebeli, P.J.; Kaltsikes, P.J. & Metzidakis, I. (2005). Identification

Trujillo, I.; Rallo, L. & Arus, P. (1995). Identifying olive cultivars by isozyme analysis. *Journal* 

Tsitsipis, J.A.; Varikou, K.; Kalaitzaki, A.; Alexandrakis, V.; Margaritopoulos, J. & Skouras,

Vidal, J.R.; Delavault, P.; Coarer, M. & Defontaine, A. (2000). Design of grapevine (*Vitis* 

Vos, P.; Hogers, R.; Bleeker, M.; Reijans, M.; van de Lee, T.; Hornes, M.; Friters, A.; Pot, J.;

Welsh, J. & McClelland, M. (1990). Fingerprinting genomes using PCR with arbitrary

Williams, J.G.K.; Kubelik, A.R.; Livak, K.J.; Rafalski, J.A. & Tingey, S.V. (1990). DNA

Wu, K.S. & Tanksley, S.D. (1993). Abundance, polymorphism and genetic mapping of microsatellites in rice*. Molecular General Genetics*, Vol. 241, pp. 225-235. Wu, S.B.; Collins, G. & Sedgley, M. (2004). A molecular linkage map of olive (*Olea europaea* L.) based on RAPD, microsatellite, and SCAR markers. *Genome*, Vol.47, pp. 26-35. Zhang, Y. & Stommel, J.R. (2001). Development of SCAR and CAPS markers linked to the

Zietkiewicz, E.; Rafalski, A. & Labuda, D. (1994). Genome fingerprinting and simple

Zohary, D. (1994). The wild genetic resources of the cultivated olive. *Acta Horticulturae*,

sequence repeat (SSR)-anchored polymerase chain reaction amplification. *Genomics*,

fingerprinting. *Nucleic Acids Research,* Vol.21, pp. 4407-4414.

primers. *Nucleic Acids Research*, Vol.18, pp. 7213-7218.

Beta gene in tomato. *Crop Science*, Vol.41, pp.1602-1608.

*Nucleic Acids Research*, Vol.18, pp. 6531-6535.

*of the American Society for Horticultural Science*, Vol.120, pp. 318–324.

populations. *Genome*, Vol.49, pp.1606–1615.

*Acta Horticulturae*, Vol.791, pp. 109-115. j

*Horticulturae*, Vol.105, pp. 45–51.

*Applied Genetics*, Vol.101, pp.1194–1201.

Abstract book pp. 31.

Vol.20, pp. 176-183.

Vol.356, pp. 62-65.

genetic variation. *Nature*, Vol.322, pp. 652-656.

*African Journal of Agriculture Research,* Vol.8, pp. 707-712.

*Ecology*, Vol.9, pp. 1171–1173.

Amsterdam.

discriminating among olive cultivars and assigning them to geographically defined

Câmara Machado, A. (2000). Identification of microsatellite loci in olive (*Olea europaea*) and their characterization in Italian and Iberian olive trees. *Molecular* 

Identification of Iranian olive cultivars by using RAPD and microsatellite markers.

of *Olea europaea* L. cultivars using inter-simple sequence repeat markers. *Scientia* 

P. (2009). Chemical control of olive pests: Blessing or curse?. 4th European Meeting of the IOBC/WPRS Working Group Integrated Protection of Olive Crops'..

*vinifera* L.) cultivar-specific SCAR primers for PCR fingerprinting. *Theoretical and* 

Paleman, J.; Kuiper, M. & Zabeau, M. (1995). AFLP: a new technique for DNA

polymorphisms amplified by arbitrary primers are useful as genetic markers.

One of the many goals of plant geneticists and breeders pertains to the explanation of phenotypic variation as it relates to changes in DNA sequence (Myles et al., 2009). The development of molecular markers for the detection and exploitation of DNA polymorphisms in plant systems is one of the most significant developments in the field of molecular biology and biotechnology. Linkage mapping has been a key tool for identifying the genetic basis of quantitative traits in plants. However, for linkage studies, suitable crosses, sometimes limited by low polymorphism or small population size, are required. In addition, only two alleles per locus and few recombination events are considered to estimate the genetic distances between marker loci and to identify the causative genomic regions for quantitative trait loci (QTL), thereby limiting the mapping resolution. To circumvent these limitations, linkage disequilibrium (LD) mapping or association mapping (AM) has been used extensively to dissect human diseases (Slatkin, 2008). This approach has received increased attention during the last few years. AM has the potential to identify a single polymorphism within a gene that is responsible for phenotypic differences. AM involves searching for genotype-phenotype correlations among unrelated individuals. Its high resolution is accounted for by the historical recombination accumulated in natural populations and collections of landraces, breeding materials and varieties. By exploiting broader genetic diversity, AM offers three main advantages over linkage mapping: mapping resolution, allele number and time saving in establishing a marker-trait association and its application in a breeding program (Flint-Garcia et al., 2003). Although AM presents clear advantages over linkage mapping, they are often applied in conjunction, especially to validate the associations identified by AM, thus reducing spurious associations.

The inherent nature of AM brings its own limits such as the fact that biological and evolutionary factors affect LD distribution and mapping resolution. The strength of AM relies on the analysis of common variants, which explain at most 5%-10% of the heritable component of human diseases (Asimit & Zeggini, 2010). The role of rare variants in explaining the remaining heritable variation is becoming more important. New statistical models for AM are being developed to better consider rare variants because early methods allocated most of the statistical power to higher frequency alleles.

Since most of the traits important for environmental fitness and agricultural value are quantitative in nature (Yu & Buckler, 2006), there is tremendous interest in using AM to examine them. In this chapter, we introduce the concept of linkage disequilibrium, which plays a central role in association analysis. For this reason, it is critical to understand LD measurement, its variation across the genome and how it is affected by population structure and relatedness. Recent AM studies in plants, advantages and disadvantages of AM, and its integration with other mapping methods are also reviewed and discussed. An overview of the software currently available for AM and their main characteristics is presented. Future perspectives of AM in plants, application in other emerging research areas, potential usefulness for new cultivar development and for the conservation of adaptive genetic variation are outlined.

#### **2. Linkage disequilibrium and association mapping concepts**

The terms LD and AM have often been used interchangeably in the literature. However, they present subtle differences. According to Gupta et al. (2005), AM refers to the significant association of a marker locus with a phenotype trait while LD refers to the non random association between two markers or two genes/QTLs (Figure 1). Thus, AM is actually an application of LD. In other words, two markers in LD represent a non random association between alleles, but do not necessarily correlate/associate with a particular phenotype, whereas association implies a statistical significance and refers to the covariance of a marker and a phenotype of interest. Although it lies outside the scope of this section, we would like to also clarify the difference between linkage and LD because they too are commonly confused. Linkage refers to the correlated inheritance of loci through the physical connection on a chromosome, whereas LD refers to the correlation between alleles in a population (Flint-Garcia et al., 2003). Although tight linkage between alleles on the same chromosome generally translate into high LD, significant LD may also exist between distant loci, and even between loci located on different chromosomes. The latter, reviewed in depth below, is the result of other forces such as selection, mutation, mating system, population structure, etc.

Both QTL and AM approaches are therefore based on LD between molecular markers and functional loci. In QTL mapping, LD is generated by the mating design while in AM, LD is a reflection of the germplasm collection under study (Stich & Melchinger, 2010). In a mapping population, LD is influenced only by recombination in the absence of segregation distortion. In AM, LD may also be influenced by other forces such as those mentioned above as well as by recombination.

The concept of LD was first described by Jennings in 1917, and its quantification (*D*) was developed by Lewontin in 1964 (Abdurakhmonov & Abdukarimov, 2008). The simplified explanation of the commonly used LD measure, *D* or *D′* (standardized version of *D*), is the difference between the observed gametic frequencies of haplotypes and the expected gametic frequencies of haplotype under linkage equilibrium.

$$\mathbf{D} = \mathbf{P}\_{\mathbf{AB}} \mathbf{ - P}\_{\mathbf{A}} \mathbf{P}\_{\mathbf{B}} \tag{1}$$

Where *PAB* is the frequency of gametes carrying allele A and B at two loci; *PA* and *PB* are the product of the frequencies of the allele A and B, respectively. In the absence of other forces, recombination through random mating breaks down the LD with *Dt* = *D0* (1 – *r*)t , where *Dt* is the remaining LD between two loci after *t* generations of random mating from the original *D0* (Zhu et al., 2008).

Since most of the traits important for environmental fitness and agricultural value are quantitative in nature (Yu & Buckler, 2006), there is tremendous interest in using AM to examine them. In this chapter, we introduce the concept of linkage disequilibrium, which plays a central role in association analysis. For this reason, it is critical to understand LD measurement, its variation across the genome and how it is affected by population structure and relatedness. Recent AM studies in plants, advantages and disadvantages of AM, and its integration with other mapping methods are also reviewed and discussed. An overview of the software currently available for AM and their main characteristics is presented. Future perspectives of AM in plants, application in other emerging research areas, potential usefulness for new cultivar development and for the conservation of adaptive genetic

The terms LD and AM have often been used interchangeably in the literature. However, they present subtle differences. According to Gupta et al. (2005), AM refers to the significant association of a marker locus with a phenotype trait while LD refers to the non random association between two markers or two genes/QTLs (Figure 1). Thus, AM is actually an application of LD. In other words, two markers in LD represent a non random association between alleles, but do not necessarily correlate/associate with a particular phenotype, whereas association implies a statistical significance and refers to the covariance of a marker and a phenotype of interest. Although it lies outside the scope of this section, we would like to also clarify the difference between linkage and LD because they too are commonly confused. Linkage refers to the correlated inheritance of loci through the physical connection on a chromosome, whereas LD refers to the correlation between alleles in a population (Flint-Garcia et al., 2003). Although tight linkage between alleles on the same chromosome generally translate into high LD, significant LD may also exist between distant loci, and even between loci located on different chromosomes. The latter, reviewed in depth below, is the result of

**2. Linkage disequilibrium and association mapping concepts** 

other forces such as selection, mutation, mating system, population structure, etc.

gametic frequencies of haplotype under linkage equilibrium.

Both QTL and AM approaches are therefore based on LD between molecular markers and functional loci. In QTL mapping, LD is generated by the mating design while in AM, LD is a reflection of the germplasm collection under study (Stich & Melchinger, 2010). In a mapping population, LD is influenced only by recombination in the absence of segregation distortion. In AM, LD may also be influenced by other forces such as those mentioned above as well as

The concept of LD was first described by Jennings in 1917, and its quantification (*D*) was developed by Lewontin in 1964 (Abdurakhmonov & Abdukarimov, 2008). The simplified explanation of the commonly used LD measure, *D* or *D′* (standardized version of *D*), is the difference between the observed gametic frequencies of haplotypes and the expected

Where *PAB* is the frequency of gametes carrying allele A and B at two loci; *PA* and *PB* are the product of the frequencies of the allele A and B, respectively. In the absence of other forces,

the remaining LD between two loci after *t* generations of random mating from the original

recombination through random mating breaks down the LD with *Dt* = *D0* (1 – *r*)t

D = PAB - PAPB (1)

, where *Dt* is

variation are outlined.

by recombination.

*D0* (Zhu et al., 2008).

Fig. 1. Principles of linkage disequilibrium and association mapping. a. Linkage disequilibrium. Locus 1 and Locus 2 present an unusual pattern of association between alleles A-G and T-C, which deviate from Hardy-Weinberg expectations, but without any statistical correlation with a phenotype. b. Association mapping. Locus 1 and Locus 2 are in LD. Significant covariance with the seed colour phenotype is considered evidence of association.

Several statistics have been proposed for LD, and these measurements generally differ in how they are affected by marginal allele frequencies and sample sizes. Here, we introduce the two most utilized statistics for LD. Both *D′* (Lewontin, 1964) and *r*2, the square of the correlation coefficient between two loci (Hill & Robertson, 1968), reflect different aspects of LD and perform differently under various conditions. *D′* only reflects the recombinational history and is therefore a more accurate statistic for estimating recombination differences, whereas *r*2 summarizes both recombinational and mutational history (Flint-Garcia et al., 2003). For two biallelic loci, *D′* and *r*2 have the following formula:

$$D' = \lfloor D \rfloor \nmid D\_{\text{max}} \tag{2}$$

$$D\_{\text{max}} = \min \left( P\_A \, P\_{\text{b}} \, P\_a \, P\_B \right) \text{if } D \ge 0;$$

$$\mathbf{D}\_{\text{max}} = \min \left( \mathbf{P}\_A \, \mathbf{P}\_{\text{b}} \, P\_a \, P\_b \right) \text{if } D \le 0$$

$$\mathbf{r}^2 = \mathbf{D}^2 \nmid \, P\_A \, P\_A \, P\_B \, P\_b \tag{3}$$

*D* is limited because its range is determined by allele frequencies. *D′* was developed to partially normalize *D* with respect to the maximum value possible for the allele frequencies and give it a range between 0 and 1 (Zhu et al., 2008). The *r*2 statistic has an expectation of 1/(1+4*Nc*), where *N* is the effective population size and *c* is the recombination rate, and it also varies between 0 and 1 (Hill & Robertson, 1968).

Choosing the appropriate LD statistics depends on the objective of the study. Most studies on LD in animal populations used *D′* to measure population-wide LD of microsatellite data (Du et al., 2007). However, *D′* is inflated by small sample size and low allele frequencies; therefore, intermediate values of *D′* are unsafe for comparative analyses of different studies and should be verified with *r*2 before being used for quantification of the extent of LD (Oraguzie et al., 2007). Although *r*2 is still considered to be allele frequency dependent, the bias due to allele frequency is considerably smaller than in *D′* (Ardlie et al., 2002). Currently, most LD mapping studies in plants use *r*2 for LD quantification because it also provides information about the correlation between markers and QTL of interest (Flint-Garcia et al., 2003; Gupta et al., 2005). Typically, *r*2 values of 0.1 or 0.2 are often considered the minimum thresholds for significant association between pairs of loci and to describe the maximum genetic or physical distance at which LD is significant (Zhu et al., 2008).

### **3. Visualization and statistical significance of LD**

Since *D′* and *r*2 are pairwise measurements between polymorphic loci, it is difficult to obtain summary statistics of LD across a region (Gupta et al., 2005). There are two common ways to visualize the extent of LD and the genomic regions or haplotype blocks found to be in significant LD. LD scatter plots are used to estimate the rate at which LD declines with genetic or physical distance (Figure 2a). An average genome-wide decay of LD can be estimated by plotting LD values, from a data set covering an entire genome, against distance. Alternatively, the extent of LD can be estimated for a particular region carrying a gene/QTL of interest previously identified by linkage mapping. These scatter plots are useful to determine the average effective distance threshold above which significant LD (commonly 0.5 for *D′* and 0.1 for *r*2) is expected based on the curve of a nonlinear logarithmic trend drawn through the data points of the scatter plot (Breseghello & Sorrells, 2006). Disequilibrium matrices or LD heat maps are also very useful for visualizing the linear arrangement of LD between polymorphic sites within a short physical distance such as a gene, along an entire chromosome or across the whole genome (Figure 2b) (Flint-Garcia et al., 2003). LD heat maps are colour-coded triangular plots where the diagonal represents ordered loci and the different intensity coloured pixels depict significant pairwise LD level expressed as *D′* or *r*2. Blocks of high intensity pixels afford an easy visualization of loci in significant LD. In this figure, the larger the blue blocks of haplotypes along the diagonal of the triangular plot, the higher the level and extent of LD between adjacent loci in the blocks, meaning that there has been either limited or no recombination since the LD block formation (Abdurakhmonov & Abdukarimov, 2008). These graphical representations enable us to determine the optimum number of markers to detect significant marker-trait associations and the resolution at which a QTL can be mapped. Because LD estimation based on *D′* or *r*2 can be sensitive to marker density, highly saturated and representative linkage groups are ideal for LD calculations.

The statistical significance of LD is typically determined using a χ² test of a 2 × 2 contingency table. A *p*-value threshold of 0.05 is often used to declare lack of independence of alleles at two loci, thus suggesting association (Gupta et al., 2005). From a 2 × 2 contingency table, the probability (*P*) of independence of alleles at two loci is generally calculated through a Fisher's exact test (Fisher, 1935; as cited in Gupta et al., 2005). Statistically significant LD can also be calculated using a multifactorial permutation analysis to compare sites with more than two alleles such as microsatellite markers. These statistical methods are implemented in software such as PowerMarker (Liu & Muse, 2005) and TASSEL (Trait Analysis by aSSociation Evolution and Linkage) (Bradbury et al., 2007).

extent of LD (Oraguzie et al., 2007). Although *r*2 is still considered to be allele frequency dependent, the bias due to allele frequency is considerably smaller than in *D′* (Ardlie et al., 2002). Currently, most LD mapping studies in plants use *r*2 for LD quantification because it also provides information about the correlation between markers and QTL of interest (Flint-Garcia et al., 2003; Gupta et al., 2005). Typically, *r*2 values of 0.1 or 0.2 are often considered the minimum thresholds for significant association between pairs of loci and to describe the maximum genetic or physical distance at which LD is significant (Zhu

Since *D′* and *r*2 are pairwise measurements between polymorphic loci, it is difficult to obtain summary statistics of LD across a region (Gupta et al., 2005). There are two common ways to visualize the extent of LD and the genomic regions or haplotype blocks found to be in significant LD. LD scatter plots are used to estimate the rate at which LD declines with genetic or physical distance (Figure 2a). An average genome-wide decay of LD can be estimated by plotting LD values, from a data set covering an entire genome, against distance. Alternatively, the extent of LD can be estimated for a particular region carrying a gene/QTL of interest previously identified by linkage mapping. These scatter plots are useful to determine the average effective distance threshold above which significant LD (commonly 0.5 for *D′* and 0.1 for *r*2) is expected based on the curve of a nonlinear logarithmic trend drawn through the data points of the scatter plot (Breseghello & Sorrells, 2006). Disequilibrium matrices or LD heat maps are also very useful for visualizing the linear arrangement of LD between polymorphic sites within a short physical distance such as a gene, along an entire chromosome or across the whole genome (Figure 2b) (Flint-Garcia et al., 2003). LD heat maps are colour-coded triangular plots where the diagonal represents ordered loci and the different intensity coloured pixels depict significant pairwise LD level expressed as *D′* or *r*2. Blocks of high intensity pixels afford an easy visualization of loci in significant LD. In this figure, the larger the blue blocks of haplotypes along the diagonal of the triangular plot, the higher the level and extent of LD between adjacent loci in the blocks, meaning that there has been either limited or no recombination since the LD block formation (Abdurakhmonov & Abdukarimov, 2008). These graphical representations enable us to determine the optimum number of markers to detect significant marker-trait associations and the resolution at which a QTL can be mapped. Because LD estimation based on *D′* or *r*2 can be sensitive to marker density, highly saturated and representative linkage groups are ideal

The statistical significance of LD is typically determined using a χ² test of a 2 × 2 contingency table. A *p*-value threshold of 0.05 is often used to declare lack of independence of alleles at two loci, thus suggesting association (Gupta et al., 2005). From a 2 × 2 contingency table, the probability (*P*) of independence of alleles at two loci is generally calculated through a Fisher's exact test (Fisher, 1935; as cited in Gupta et al., 2005). Statistically significant LD can also be calculated using a multifactorial permutation analysis to compare sites with more than two alleles such as microsatellite markers. These statistical methods are implemented in software such as PowerMarker (Liu & Muse, 2005) and TASSEL (Trait Analysis by aSSociation Evolution and Linkage) (Bradbury et al., 2007).

**3. Visualization and statistical significance of LD** 

et al., 2008).

for LD calculations.

Fig. 2. Visualization of linkage disequilibrium in flax (*Linum usitatissimum* L.). a. Scatter plot of LD decay (*r*2) against genetic distance (cM), representing a measure of an average genome-wide LD. b. Heatmap of LD variation between pairwise polymorphic loci of four linkage groups. Blocks in significant LD are highlighted by red triangles. LD distribution is heterogeneous within and between linkage groups.
