**5. Constructing a core collection to make use of genetic diversity**

A large number of accessions in the germplasm collection (7128 accessions in our case) makes it difficult to choose the most promising ones for utilization. One feasible method is the development of core collections. A core collection is intended to contain, with a minimum repetitiveness, the maximum genetic diversity of a crop species and its wild relatives (Brown, 1989a, 1989b; Frankel and Brown, 1984). The development of a core collection could enhance the utilization of germplasm collections in crop improvement programs and simplify their management. Selection of an appropriate sampling strategy is an important prerequisite to construct a core collection with appropriate size in order to adequately represent the genetic spectrum and maximally capture the genetic diversity in available crop collections. Our studies were aimed to evaluate how sample size, clustering methods, sampling methods, and different data types affected the construction of core

Fig. 4. Modified Roger's distance (MRD) between waxy and non-waxy rice across the genome. Dashed lines indicate average MRD across the genome and dotted lines average MRD for each chromosome. Vertical lines at the x axis indicate genetic map positions of

A large number of accessions in the germplasm collection (7128 accessions in our case) makes it difficult to choose the most promising ones for utilization. One feasible method is the development of core collections. A core collection is intended to contain, with a minimum repetitiveness, the maximum genetic diversity of a crop species and its wild relatives (Brown, 1989a, 1989b; Frankel and Brown, 1984). The development of a core collection could enhance the utilization of germplasm collections in crop improvement programs and simplify their management. Selection of an appropriate sampling strategy is an important prerequisite to construct a core collection with appropriate size in order to adequately represent the genetic spectrum and maximally capture the genetic diversity in available crop collections. Our studies were aimed to evaluate how sample size, clustering methods, sampling methods, and different data types affected the construction of core

**5. Constructing a core collection to make use of genetic diversity** 

the SSR markers on the chromosome.

collection and tried to find out an optimal strategy concerning the above factors for core collection construction.

By using three sampling strategies, three kinds of trait data, eight hierarchical clustering methods, and 15 kinds of different sampling proportions were applied to choose the optimal constructing strategies. Analysis of variance (ANOVA) and multiple comparisons were applied to compare different strategies. In order to choose the optimal constructing strategies, 12 evaluated parameters were applied to evaluate the validity of sampling.

The ANOVA analysis showed significant difference for different clustering methods, data types, sample size and sampling methods (Table 2). Furthermore, there were significant interaction effects between these factors except clustering method and sampling method, sampling size and sampling method. The results indicated that these factors as well as their interaction would affect the construction strategy and must be considered carefully.

For different sampling methods, preferred sampling plus multiple clustering and sampling on the degree of variation is better than preferred sampling plus multiple clustering and random sampling, and the completely random sampling is the worst; For the eight clustering methods, clustering analysis with shortest distances has the best of genetic diversity index, average Shano-Weaver index, phenotype retained percentage, and variance of phenotypic frequency; For the three different data types (qualitative trait data, quantitative trait data, intergrated qualitative and quantitative trait data), the core collections constructed by integrated qualitative and quantitative trait data retain the greatest genetic diversity and is the best one. For the sampling rate, the sampling rate of 3.4% ∼ 24% is sufficient to retain the greatest genetic diversity of the initial population (Table 4-6).

Finally, a core collection was constructed by using preferred sampling plus multiple clustering and sampling on the degree of variation, clustering analysis with shortest distances, and based on the integrated qualitative and quantitative data. This core collection contains 150 accessions out of 2262 original collection with full recorded data from Ting's collection, accounting for 6.6% of the initial collection.


Table 2. Analyis of variance for Shanno-weaver diversity index of different subsets


\* Clustering with 1. shortest distance, 2. complete linkage method, 3. median distance, 4.the centroid method, 5. average linkage, 6. the flexible-beta method, 7. the flexible method, 8. Ward's minimum variance method, and 9. completely random method.

\*\* the data with different alphabet show significance at 0.01 level.

Table 3. Multiple comparison for different clustering methods\*\*


\* 1. preferred sampling plus multiple clustering and randomly sampling, 2. preferred sampling plus multiple clustering the clustering with sampling on degree of variation, 3. sampling with completely random method

\*\* the data with different alphabet show significance at 0.01 level.

Table 4. Multiple comparison for different sampling methods\*\*


\*1. integrated qualitative and quantitative trait data, 2. qualitative trait data, and 3. quantitative trait data \*\* the data with different alphabet show significance at 0.01 level.

Table 5. Multiple comparison for different data types\*\*

#### **6. Association mapping with the rice core collection**

Though a large number of exotic genes exist in crop germplasm resources, the rich genetic variations in crop germplasm resources haven't been fully explored and utilized because of being lack of appropriate statistical methods.

In general, conventional method for mining gene from germplasm is linkage mapping. Identifying QTLs by linkage mapping needs to construct one or several segregating

1 0.9983A 0.1526C 2.0142A 2 0.9968BC 0.1619B 2.013A 3 0.9973BC 0.1630B 1.9958C 4 0.9977ABC 0.1659A 2.0032B 5 0.9978BA 0.1616B 2.0138A 6 0.9968C 0.1620B 2.0123A 7 0.9970BC 0.1621B 2.0129A 8 0.9972BC 0.1617B 2.0129A 9 0.9512D 0.1636B 1.974D \* Clustering with 1. shortest distance, 2. complete linkage method, 3. median distance, 4.the centroid method, 5. average linkage, 6. the flexible-beta method, 7. the flexible method, 8. Ward's minimum

> 1 0.581A 2.005B 2 0.577A 2.013A 3 0.558B 1.974C

1 0.996A 0.1509C 2.019A 2 0.9923B 0.18A 1.99B 3 0.9958A 0.1524B 2.017A \*1. integrated qualitative and quantitative trait data, 2. qualitative trait data, and 3. quantitative trait data

Though a large number of exotic genes exist in crop germplasm resources, the rich genetic variations in crop germplasm resources haven't been fully explored and utilized because of

In general, conventional method for mining gene from germplasm is linkage mapping. Identifying QTLs by linkage mapping needs to construct one or several segregating

\* 1. preferred sampling plus multiple clustering and randomly sampling, 2. preferred sampling plus multiple clustering the clustering with sampling on degree of variation, 3. sampling with completely

Variance of phenotypic frequency

For qualitative trait Average diversity index

Variance of phenotypic frequency

Average diversity index

Average diversity index

Clustering method\*

random method

Ratio of phenotype retained

variance method, and 9. completely random method.

\*\* the data with different alphabet show significance at 0.01 level. Table 3. Multiple comparison for different clustering methods\*\*

Sampling method\* Diversity index

\*\* the data with different alphabet show significance at 0.01 level. Table 4. Multiple comparison for different sampling methods\*\*

\*\* the data with different alphabet show significance at 0.01 level. Table 5. Multiple comparison for different data types\*\*

**6. Association mapping with the rice core collection** 

phenotype retained

Data type\* Ratio of

being lack of appropriate statistical methods.


\*\* the data with different alphabet show significance at 0.01 level.

Table 6. Multiple comparison for different sample size\*\*

populations by crossing between parents (e.g., F2, Double haploid, Backcross population) and linkage mapping would be done in these segregation populations. The accuracy of QTL mapping is dependent largely on selected but limited parents and only two or a few alleles from the parents were detected. Moreover, abundant genetic variation stored in germplasm have not been developed and utilized due to lack of appropriate statistical methods. Provided using conventional QTL linkage mapping method for mining the abundant genetic variations in a large germplasm resources population, it is necessary to make diallel crossing with all studied accessions, which is hard to develop such mapping population and would take much more time, cost, space and analysis.

An alternative, association mapping based on linkage disequilibrium (LD) analysis might be an effective way to identify the function of the gene (or targeted high-resolution QTL), which has been successfully applied in human genetics to detect QTL coding for simple as well as complex diseases (Corder et al., 1994; Kerem et al., 1989). This method uses the LD between DNA polymorphisms and genes underlying traits. LD refers to the non-random combination among different genetic markers. The main mechanism for LD existing in a population is genetic linkage among different loci. Therefore, it is possible to detect QTLs by identifying LD between markers loci and potential QTLs. Through detecting abundant genetic markers loci locating in genome or those nearby candidate genes, the loci which link tightly with QTLs and show correlated to QTLs can be found. The application of association mapping to plant breeding appears to be a promising approach to overcome the limitations of conventional linkage mapping (Kraakman et al., 2004).

Furthermore, choice of an appropriate germplasm to maximize the genetic diversity and the number of historical recombinations and mutation events (and thus reduce LD) within and around the gene of interest is critical for the success of association analysis (Yan et al., 2011). As described above, core collections are the core subset of the original collections with minimum samples while having the maximum genetic variability contained within the gene pool. Therefore, association mapping with a core collection population helps to catch as more phenotypic variation as possible and would make use of both the advantages of association mapping and core collection, thus could be an effective way to mine and utilize the abundant genetic diversity in the crop germplasm resources.

#### **6.1 Population structure and LD pattern**

Population structure is an important component in association mapping analysis because it can reduce both type I and II errors between molecular markers and traits of interest in an inbreeding specie. Moreover, low level of LD could lead to impractical whole-genome scanning because of the excessive number of markers required. Furthermore, the resolution of association studies in a test sample depends on the structure of LD across the genome. Therefore, information about the population structure and extent of LD within the population is of fundamental importance for association mapping.

The rice core collection consisting of 150 varieties were genotyped with 274 SSR markers. Based on the genotyping data, STRUCTURE software was run to detect the number of subgroups within the core collection population and assign the varieties into different subgroups with the membership probability of 0.80 as a threshold. To compare and confirm the STRUCTURE subgroups, a additional principal component analysis was done.

STRUCTURE indicated that the entire population could be divided into two subgroups (i.e. SG 1 and SG 2) (Fig. 5). With the membership probabilities of ≧ 0.80, 111 varieties were assigned to SG 1, 21 varieties were assigned to SG 2 and 18 varieties were retained to the admixed group (AD) (Fig. 5). Principal component analysis confirmed the population structure, i.e. the varieties from SG 1 and SG 2 located in two distinct clusters, and those from AD located between the two subgroups (Fig. 5). The varieties in SG 1 are mainly *indica* rice, and those in SG 2 *japonica* rice, whereas those in AD intermediate (*indica*- or *japonica*-inclined) rice. Furthermore, the varieties from the same cultivated zone were clustered closely.

LD measured as squared correlation of allele frequencies (r2) between loci pairs in the core collection and different germplasm types were calculated (Table 7). The average r2 between linked loci (the loci at the same chromosome) varied between 0.0188 and 0.1. Using the 95% quantile of r2 between unlinked loci pairs as a threshold, 6.23% linked loci pairs were in significant LD. For different germplasm types (*indica*, *japonica*, early-seasonal, late-seasonal, waxy, non-waxy rice), the percentage of loci pairs in LD varied between 5.33 and 6.36%. LD (r2) against genetic map distance (cM) between linked loci pairs was plotted and a nonlinear regression of r2 vs. genetic map distance according to Heuertz et al., (2006) was performed (Fig. 6). The LD decays against the genetic distance, which indicated the linkage might be the main reason for the causes of LD. The LD decays to the threshold, i.e. the 95% quantile of r2 between unlinked loci pairs, at 1.03 cM in the entire collection (Table 7). The cut-off decay

mapping to plant breeding appears to be a promising approach to overcome the limitations

Furthermore, choice of an appropriate germplasm to maximize the genetic diversity and the number of historical recombinations and mutation events (and thus reduce LD) within and around the gene of interest is critical for the success of association analysis (Yan et al., 2011). As described above, core collections are the core subset of the original collections with minimum samples while having the maximum genetic variability contained within the gene pool. Therefore, association mapping with a core collection population helps to catch as more phenotypic variation as possible and would make use of both the advantages of association mapping and core collection, thus could be an effective way to mine and utilize

Population structure is an important component in association mapping analysis because it can reduce both type I and II errors between molecular markers and traits of interest in an inbreeding specie. Moreover, low level of LD could lead to impractical whole-genome scanning because of the excessive number of markers required. Furthermore, the resolution of association studies in a test sample depends on the structure of LD across the genome. Therefore, information about the population structure and extent of LD within the

The rice core collection consisting of 150 varieties were genotyped with 274 SSR markers. Based on the genotyping data, STRUCTURE software was run to detect the number of subgroups within the core collection population and assign the varieties into different subgroups with the membership probability of 0.80 as a threshold. To compare and confirm

STRUCTURE indicated that the entire population could be divided into two subgroups (i.e. SG 1 and SG 2) (Fig. 5). With the membership probabilities of ≧ 0.80, 111 varieties were assigned to SG 1, 21 varieties were assigned to SG 2 and 18 varieties were retained to the admixed group (AD) (Fig. 5). Principal component analysis confirmed the population structure, i.e. the varieties from SG 1 and SG 2 located in two distinct clusters, and those from AD located between the two subgroups (Fig. 5). The varieties in SG 1 are mainly *indica* rice, and those in SG 2 *japonica* rice, whereas those in AD intermediate (*indica*- or *japonica*-inclined) rice.

LD measured as squared correlation of allele frequencies (r2) between loci pairs in the core collection and different germplasm types were calculated (Table 7). The average r2 between linked loci (the loci at the same chromosome) varied between 0.0188 and 0.1. Using the 95% quantile of r2 between unlinked loci pairs as a threshold, 6.23% linked loci pairs were in significant LD. For different germplasm types (*indica*, *japonica*, early-seasonal, late-seasonal, waxy, non-waxy rice), the percentage of loci pairs in LD varied between 5.33 and 6.36%. LD (r2) against genetic map distance (cM) between linked loci pairs was plotted and a nonlinear regression of r2 vs. genetic map distance according to Heuertz et al., (2006) was performed (Fig. 6). The LD decays against the genetic distance, which indicated the linkage might be the main reason for the causes of LD. The LD decays to the threshold, i.e. the 95% quantile of r2 between unlinked loci pairs, at 1.03 cM in the entire collection (Table 7). The cut-off decay

the STRUCTURE subgroups, a additional principal component analysis was done.

Furthermore, the varieties from the same cultivated zone were clustered closely.

of conventional linkage mapping (Kraakman et al., 2004).

the abundant genetic diversity in the crop germplasm resources.

population is of fundamental importance for association mapping.

**6.1 Population structure and LD pattern** 

Fig. 5. Principal component analysis for the rice core collection combined with STRUCTURE subgroup assignment. PC 1 and PC 2 refer to the first and second principal components, respectively. The numbers in parentheses refer to the proportion of variance explained by the principal components. Symbols indicate different type of rice, and colors indicate different subgroups from STRUCTURE software. FJ-Foreign *japonica*, IG-glutinous *Indica*, Jearly seasonal *Japonica*, J II- late seasonal *Japonica*, JG-glutinous *Japonica*, P-early seasonal *Indica* from Pearl river region or south China, P II-late seasonal *Indica* from Pearl river region, R-early seasonal Red grain rice, R II-late seasonal Red grain rice, Y- early seasonal *Indica* from Yangtze River region, Y II- late seasonal *Indica* from Yangtze River region, and N-Unknown origin.


Table 7. Linkage disequilibrium (measured as R2 value) for linked loci, percentage of linked loci pairs in LD, and the cut-off decay distance for the core collection and different germplasm types.

distances for *indica*, *japonica*, early-seasonal, late-seasonal, waxy, non-waxy rice were 0.89, 1.10, 1.00, 1.04, 0.87, and 1.01cM, which were about 200-500kb physical distance. The results indicated that choice of the core collection could maximize the number of historical recombinations and mutation events and thus reduce LD within and around the gene of interest which is critical for the success of association analysis (Yan et al., 2011). Such short

Fig. 6. Plot of linkage disequilibrium measured as squared correlation of allele frequencies (r2) against genetic map distance (cM) between linked loci pairs in the core collection. The red line is the nonlinear regression trend line of r2 vs. genetic map distance. The dashed line indicates the 95% quantile of r2 between unlinked loci pairs.

LD decay distance suggested that fine mapping with a core collection for desirable genes could be possible. However, due to low percentage of linked loci pairs in LD and the quick decay of LD, in turn it indicated that the density of markers for genome-wide association mapping should be greatly increased as compared to our study. Considered for the LD decay distance in the core collection and 1700cM map distance of rice genome, it might at least in theoretically require more than 1700 markers for a genome-wide association mapping with such a core collection. If higher power is needed, the number of required markers could be even more.

#### **6.2 Association mapping**

Mining the elite genes within rice germplasm is of importance to the improvement of cultivated rice. Therefore, genome-wide association mapping was applied with the rice core collection using 274 SSR markers.

All of the 150 rice varieties were cultivated at the farm of South China Agricultural University, Guangzhou (23°16N, 113°8E), during the late season (July-November) for two consecutive years (2008 and 2009). The yield related traits (such as grain weight, filled grains, tillers per plants, etc.) were measured for both years. As for an example, the trait yield per panicle (gram) was furthered used for association mapping.

The software STRUCTURE was applied to infer historical lineages that show clusters of similar genotypes and get the Q matrices (Pritchard et al., 2000). Kinship matrix (K) was calculated by software SPAGeDi (Hardy and Vekemans 2002). The quantile–quantile plots of estimated log10(p) were displayed using the observed p values from marker-trait associations and the expected p values assuming that no associations happened between markers and any trait in the software SAS. Using the TASSEL software and the mixed linear regression model (MLM), association test was performed for the yield trait, incorporating K and Q matrices.

Fig. 6. Plot of linkage disequilibrium measured as squared correlation of allele frequencies (r2) against genetic map distance (cM) between linked loci pairs in the core collection. The red line is the nonlinear regression trend line of r2 vs. genetic map distance. The dashed

LD decay distance suggested that fine mapping with a core collection for desirable genes could be possible. However, due to low percentage of linked loci pairs in LD and the quick decay of LD, in turn it indicated that the density of markers for genome-wide association mapping should be greatly increased as compared to our study. Considered for the LD decay distance in the core collection and 1700cM map distance of rice genome, it might at least in theoretically require more than 1700 markers for a genome-wide association mapping with such a core collection. If higher power is needed, the number of required

Mining the elite genes within rice germplasm is of importance to the improvement of cultivated rice. Therefore, genome-wide association mapping was applied with the rice core

All of the 150 rice varieties were cultivated at the farm of South China Agricultural University, Guangzhou (23°16N, 113°8E), during the late season (July-November) for two consecutive years (2008 and 2009). The yield related traits (such as grain weight, filled grains, tillers per plants, etc.) were measured for both years. As for an example, the trait

The software STRUCTURE was applied to infer historical lineages that show clusters of similar genotypes and get the Q matrices (Pritchard et al., 2000). Kinship matrix (K) was calculated by software SPAGeDi (Hardy and Vekemans 2002). The quantile–quantile plots of estimated log10(p) were displayed using the observed p values from marker-trait associations and the expected p values assuming that no associations happened between markers and any trait in the software SAS. Using the TASSEL software and the mixed linear regression model (MLM),

line indicates the 95% quantile of r2 between unlinked loci pairs.

yield per panicle (gram) was furthered used for association mapping.

association test was performed for the yield trait, incorporating K and Q matrices.

markers could be even more.

**6.2 Association mapping** 

collection using 274 SSR markers.

The QQ plot of observed vs. expected p values (Fig. 7) indicated that MLM model incorporating K and Q matrix was suitable for the association analysis for the yield trait. A total of 17 markers in 2008 and 15 markers in 2009 were detected to significantly (P < 0.05) be associated with the yield traits (Table 8 and 9). 12 marker-phenotype associations were confirmed by previous researches (either using linkage mapping or association mapping) for 2008 year's results. And it was 7 for the 2009 year's results. Moreover, two markerphenotype associations were located in the similar position (RM471 with RM218, PSM188 with RM235) for both years. The genetic variants explained by the markers varied between 3.49% and 24.86% in 2008, while it was between 3.02% and 13.87% in 2009. The genetic variants explained by the marker RM346 and PSM336 were more than 15%. It is worth to note that less common marker-phenotype associations were detected in both years, which indicated the yield trait might be easily influenced by the environment. Such problem might be overcome by using the yield data for multiple locations and years. The markerphenotype associations could be further used in rice breeding by marker-assisted selection.

Fig. 7. Plots of observed vs. expected p values for MLM (Q+K) model for the yield trait.


§The Bonferroni threshold (< 0.0036); ¥supported by previous researches; R2 represents the genetic variants explained by the marker.

Table 8. Association mapping results for yield per panicle in 2008 using MLM models.


§The Bonferroni threshold (< 0.0036); ¥supported by previous researches; R2 represents the genetic variants explained by the marker.

Table 9. Association mapping results for yield per panicle in 2009 using MLM models.
