**5. Molecular markers for germplasm characterisation**

Molecular markers are DNA tags or sequence differences (polymorphisms) that provide the ability to quickly track the presence of specific DNA regions associated with a trait, through a segregating population. They are widely used for basic research in plant biology and genetics and are now essential to the CSIRO cotton breeding program for selecting homozygous transgenic lines on a large scale, monitoring the purity of transgenic breeding lines, and stacking traits by marker assisted selection. Upland cotton has limited intra-specific genetic polymorphism compared to other crops as revealed by a number of molecular marker and genetic diversity analyses [76-78], due to the relatively recent polyploidisation (estimated 1-2 mya) [79, 80] that created the species as well as domestication and selection. This has hindered application of molecular markers to breeding, as to obtain large numbers of polymorphic markers for detailed mapping required crossing *G. hirsutum* to *G. barbadense* (inter-specific), which are largely unsuitable for cultivar improvement in breeding programs.

Two major technological breakthroughs have occurred in the marker field since ~2010 that have removed many limitations associated with marker technology and its application in cotton.


#### **5.1. SNP discovery in germplasm important for Australia**

cotton. The CSIRO derived sub-clover duplicated stunt7 viral promoter was deregulated in 2008 as a commercial event in cotton (T304-40) driving the *Cry1Ab* insect resistance gene and forms part of Bayer CropScience's TwinLink product. A cotton rubisco small subunit promoter has also been shown to be expressed at high levels in green photosynthetic tissues throughout the development of cotton in the field and hence a useful promoter for expressing transgenes

Molecular markers are DNA tags or sequence differences (polymorphisms) that provide the ability to quickly track the presence of specific DNA regions associated with a trait, through a segregating population. They are widely used for basic research in plant biology and genetics and are now essential to the CSIRO cotton breeding program for selecting homozygous transgenic lines on a large scale, monitoring the purity of transgenic breeding lines, and stacking traits by marker assisted selection. Upland cotton has limited intra-specific genetic polymorphism compared to other crops as revealed by a number of molecular marker and genetic diversity analyses [76-78], due to the relatively recent polyploidisation (estimated 1-2 mya) [79, 80] that created the species as well as domestication and selection. This has hindered application of molecular markers to breeding, as to obtain large numbers of polymorphic markers for detailed mapping required crossing *G. hirsutum* to *G. barbadense* (inter-specific),

Two major technological breakthroughs have occurred in the marker field since ~2010 that have removed many limitations associated with marker technology and its application in

**1.** Next-generation sequencing (NGS) can sequence DNA millions of more times than previous methods [81]. This has opened up the possibility to discover a larger number of single nucleotide polymorphisms (SNP, single DNA base changes) in cotton that are the most common form of difference between individuals or cultivars, and should be sufficiently abundant to discriminate between any two *G. hirsutum* lines. NGS has also enabled the completion of the genome sequence of the diploid cotton *Gossypium raimon‐ dii* [82] that is related to the D-genome present in Upland cotton. The genome sequence of a diploid A-genome containing cotton species is likely to be publicly available in the near future, and will make it easy to compare and align any short sequence reads from NGS to find large numbers of widely distributed nucleotide differences between cultivars

**2.** New high-throughput genotyping (HTG) platforms based on SNPs have been developed that can accurately call millions of SNPs in large populations [83, 84]. These technologies enable complete genome coverage of markers between *G. hirsutum* cultivars in a fraction

**5. Molecular markers for germplasm characterisation**

which are largely unsuitable for cultivar improvement in breeding programs.

that can be used for mapping and breeding.

of the time required by conventional marker approaches.

in leaves [75].

16 World Cotton Germplasm Resources

cotton.

To ensure that SNPs are informative to the breeding populations being developed in Australia, it is essential to find SNPs among cotton cultivars that constitute the major germplasm sources of the elite cultivars we are developing. The most straightforward method to identifying SNPs, in the absence of the Upland cotton genome sequence, is to sequence expressed gene transcripts (RNA-seq) by isolating mRNA and converting it into cDNA for sequencing. This method has been used successfully for many other plants such as maize and wheat [85]. RNA-seq targets SNP discovery to genes that are actively transcribed and therefore more likely to be associated with conferring trait differences. The disadvantage is that cDNA is likely to have lower SNP frequencies than non-expressed regions as they are constrained by the genes function. CSIRO RNA-seq data was generated on a set of 18 cultivars that represented significant genetic variation present within current Australian commercial cultivars were selected; containing old Australian and US cultivars, as well as cultivars from China and India. Over 50 million reads (of ~90 bp) for each Upland cotton sample was obtained. We found the key to identifications of varietal SNPs confidently was when a sub-genome-specific SNP was also found in close proximity to the varietal SNP (Figure 2). This enabled representative sequences from both genomes in each cultivar to be identified and compared. From the 18 cultivars ~38,000 varietal SNPs were identified. A selected subset of >1,500 of these putative SNPs were analysed using a combination of SNP platforms (GoldenGate and Sequenom) and it was found that these SNP could be validated at a rate >90% [Zhu, Q-H, pers comm.]. the Upland cotton genome sequence, is to sequence expressed gene transcripts (RNA-seq) by isolating mRNA and converting it into cDNA for sequencing. This method has been used successfully for many other plants such as maize and wheat [85]. RNA-seq targets SNP discovery to genes that are actively transcribed and therefore more likely to be associated with conferring trait differences. The disadvantage is that cDNA is likely to have lower SNP frequencies than non-expressed regions as they are constrained by the genes function. CSIRO RNA-seq data was generated on a set of 18 cultivars that represented significant genetic variation present within current Australian commercial cultivars were selected; containing old Australian and US cultivars, as well as cultivars from China and India. Over 50 million reads (of ~90 bp) for each Upland cotton sample was obtained. We found the key to identifications of varietal SNPs confidently was when a sub-genome-specific SNP was also found in close proximity to the varietal SNP (Figure 2). This enabled representative sequences from both genomes in each cultivar to be identified and compared. From the 18 cultivars ~38,000 varietal SNPs were identified. A selected subset of >1,500 of these putative SNPs were analysed using a combination of SNP platforms (GoldenGate and Sequenom) and it was found that these SNP could be validated at a rate >90% [Zhu, Q-H, pers comm.].


Figure 2. Stretch of DNA sequence from two cultivars showing the sequences for both the A and D genomes. SNPs with the best validation rates are where a varietal SNP (in red) is in close proximity to a sub-genome-specific SNP (in blue) that allows determination of which genome the specific short reads sequences are derived from. **Figure 2.** Stretch of DNA sequence from two cultivars showing the sequences for both the A and D genomes. SNPs with the best validation rates are where a varietal SNP (in red) is in close proximity to a sub-genome-specific SNP (in blue) that allows determination of which genome the specific short reads sequences are derived from.

Although RNASeq data has allowed us to progress towards being able to effectively use SNPs for genotyping in breeding projects, the protein coding regions of genomes have been found to have a significantly lower level of DNA polymorphism than non-coding regions, so polymorphisms within genes between closely related cultivars are going to be less frequent and hence less useful. With the availability of the assembled *G. raimondii* genome and possibly of the *G. arboreum* genome soon, to serve as a framework for short read sequence alignment, our SNP identification will in future be performed using genomic DNA sources. **5.2. International cotton SNP consortium**  Although RNASeq data has allowed us to progress towards being able to effectively use SNPs for genotyping in breeding projects, the protein coding regions of genomes have been found to have a significantly lower level of DNA polymorphism than non-coding regions, so polymorphisms within genes between closely related cultivars are going to be less frequent and hence less useful. With the availability of the assembled *G. raimondii* genome and possibly of the *G. arboreum* genome soon, to serve as a framework for short read sequence alignment, our SNP identification will in future be performed using genomic DNA sources.

#### SNP Chips enable high throughput parallel analysis (millions at a time) whereas older markers like **5.2. International cotton SNP consortium**

significant improvement over older technologies for large-scale genotyping. Recently an international cotton consortium was formed to create a 70,000 public Illumina Infinium SNP array for cotton. This array was made available for purchase in late 2013 and contains ~ 50,000 SNP Chips enable high throughput parallel analysis (millions at a time) whereas older markers like SSR markers tend to be performed only one or a few at a time. Therefore SNP Chips represent a significant improvement over older technologies for large-scale genotyping.

SSR markers tend to be performed only one or a few at a time. Therefore SNP Chips represent a

intra‐specific *G. hirsutum* SNPs, *~*16,000 inter-specific SNPs predominantly from *G. barbadense* but also *G. tomentosum* and *G. mustelinum*, and small numbers (~4,000) of SNPs from two diploids *G. longicalyx*  and *G. armourianum*. The publicly available SNPs were provided by a number of international groups including; CSIRO, Texas A&M, University of California-Davis, Cotton Incorporated, Brigham Young University and United States Department of Agriculture-Agricultural Research Service, Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD), Council of Scientific and Industrial Research- National Botanical Research Institute (CSIR-NBRI), and Dow Recently an international cotton consortium was formed to create a 70,000 public Illumina Infinium SNP array for cotton. This array was made available for purchase in late 2013 and contains ~ 50,000 intra‐specific *G. hirsutum* SNPs, *~*16,000 inter-specific SNPs predominantly from *G. barbadense* but also *G. tomentosum* and *G. mustelinum*, and small numbers (~4,000) of SNPs from two diploids *G. longicalyx* and *G. armourianum*. The publicly available SNPs were provided by a number of international groups including; CSIRO, Texas A&M, University of California-Davis, Cotton Incorporated, Brigham Young University and United States Depart‐ ment of Agriculture-Agricultural Research Service, Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD), Council of Scientific and Industrial Research-National Botanical Research Institute (CSIR-NBRI), and Dow AgroSciences. These arrays will provide unprecedented numbers of makers to be screened across cotton germplasm and will likely be the genotyping method of choice for a number of years.

#### **5.3. Genotype-by-sequencing**

SNP Chips represent a major advance in genotyping of cotton. However, SNP Chip platforms still possess limitations. Since cultivated cotton is an allotetraploid it is often difficult for these platforms to differentiate between the two cotton sub-genomes, and from our experience only ~34% of the polymorphic SNP markers act as co-dominant markers (can discriminate between both homozygous alleles and their heterozygote class, see Figure 3A) on a GoldenGate or Fluidigm platform, whereas the majority act as dominant markers (can only differentiate accurately between both homozygote classes but the heterozygote class cannot be differenti‐ ated from one of the homozygote classes, see Figure 3B). Dominant markers reduce the amount of information that can be obtained from an individual, but they are still very useful for mapping (eg., commonly used AFLP markers are all dominant markers). In addition SNP platforms can only interrogate the SNPs that have already been indentified and placed on the chip, and once manufactured, new SNP Chips are unlikely to be remade for several years, so any newly discovered SNPs will be unable to be included and assayed.

**Figure 3.** Typical *G. hirsutum* varietal SNP profiles using the Sequenom platform. Each spot represents a different culti‐ var being assessed for the presence of a specific SNP. A) Co-dominant example, a particular polymorphic nucleotide is scored either as homozygous A (Green) or C (Blue) or heterozygous for C and A (Yellow). B) Dominant example, a par‐ ticular polymorphic nucleotide can only be scored either as homozygous A (Blue) or heterozygous CA (Yellow), homo‐

Australian Cotton Germplasm Resources http://dx.doi.org/10.5772/58414 19

Possessing significant genetic diversity within a breeding program is extremely important for future crop improvement. Where detailed pedigree information is known, breeders can specifically select parents for crossing based on their trait package and degree of relatedness. However, it is often the case that detailed knowledge about the pedigrees of cultivars is lacking, especially for imported cultivars, and this may hamper their use in cotton improvement. Molecular markers provide an alternative means of determining levels of relatedness and ancestry between cultivars. Many studies have used markers to determine the diversity levels with cotton populations [88], however most have suffered from insufficient numbers of marker required to generate accurate estimates. The first major use of a large public cotton SNP chip and other high throughput genotyping technologies by CSIRO will be to enable more accurate diversity estimates within breeding populations and seed repositories. Breeders should be quickly able to determine diversity estimates and phylogenies between cultivars, as well as identify shared genetic regions between members of the same pedigree. This data will enable the use of more diverse germplasm which may contain unique traits to be effectively integrated

zygous C (blue) cannot be scored due to tetraploid nature [Zhu, Q-H, pers comm.].

**5.4. Genetic diversity analysis**

into the breeding program.

We are investigating genotyping using NGS alone, which is called Genotype-by-Sequencing (GBS). GBS uses an amplified subset of genome from individual lines or plants to identify base differences between them [86, 87]. Using bioinformatic analyses the SNPs are found on-the-go when comparing the two sets of sequences, so no prior information about the genotypes is required. The SNPs found are then analysed as separate markers. The advantage of GBS is that SNPs are found in the analysis of the DNA fragments compared, and so informative SNPs linked to a trait are more likely to be identified. Also SNPs from different genomes can be easily separated based on the presence/absence of specific sub-genome related SNPs (which are much more common than varietal SNPs) and so most markers can be selected to be co-dominant (i.e., heterozygous alleles can be scored). This makes this genotyping technology especially suitable for polyploids such as cotton, and is becoming the genotyping method of choice in, for example, wheat which has three similar genomes (A, B and D) [86].

**Figure 3.** Typical *G. hirsutum* varietal SNP profiles using the Sequenom platform. Each spot represents a different culti‐ var being assessed for the presence of a specific SNP. A) Co-dominant example, a particular polymorphic nucleotide is scored either as homozygous A (Green) or C (Blue) or heterozygous for C and A (Yellow). B) Dominant example, a par‐ ticular polymorphic nucleotide can only be scored either as homozygous A (Blue) or heterozygous CA (Yellow), homo‐ zygous C (blue) cannot be scored due to tetraploid nature [Zhu, Q-H, pers comm.].

#### **5.4. Genetic diversity analysis**

Recently an international cotton consortium was formed to create a 70,000 public Illumina Infinium SNP array for cotton. This array was made available for purchase in late 2013 and contains ~ 50,000 intra‐specific *G. hirsutum* SNPs, *~*16,000 inter-specific SNPs predominantly from *G. barbadense* but also *G. tomentosum* and *G. mustelinum*, and small numbers (~4,000) of SNPs from two diploids *G. longicalyx* and *G. armourianum*. The publicly available SNPs were provided by a number of international groups including; CSIRO, Texas A&M, University of California-Davis, Cotton Incorporated, Brigham Young University and United States Depart‐ ment of Agriculture-Agricultural Research Service, Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD), Council of Scientific and Industrial Research-National Botanical Research Institute (CSIR-NBRI), and Dow AgroSciences. These arrays will provide unprecedented numbers of makers to be screened across cotton germplasm

SNP Chips represent a major advance in genotyping of cotton. However, SNP Chip platforms still possess limitations. Since cultivated cotton is an allotetraploid it is often difficult for these platforms to differentiate between the two cotton sub-genomes, and from our experience only ~34% of the polymorphic SNP markers act as co-dominant markers (can discriminate between both homozygous alleles and their heterozygote class, see Figure 3A) on a GoldenGate or Fluidigm platform, whereas the majority act as dominant markers (can only differentiate accurately between both homozygote classes but the heterozygote class cannot be differenti‐ ated from one of the homozygote classes, see Figure 3B). Dominant markers reduce the amount of information that can be obtained from an individual, but they are still very useful for mapping (eg., commonly used AFLP markers are all dominant markers). In addition SNP platforms can only interrogate the SNPs that have already been indentified and placed on the chip, and once manufactured, new SNP Chips are unlikely to be remade for several years, so

We are investigating genotyping using NGS alone, which is called Genotype-by-Sequencing (GBS). GBS uses an amplified subset of genome from individual lines or plants to identify base differences between them [86, 87]. Using bioinformatic analyses the SNPs are found on-the-go when comparing the two sets of sequences, so no prior information about the genotypes is required. The SNPs found are then analysed as separate markers. The advantage of GBS is that SNPs are found in the analysis of the DNA fragments compared, and so informative SNPs linked to a trait are more likely to be identified. Also SNPs from different genomes can be easily separated based on the presence/absence of specific sub-genome related SNPs (which are much more common than varietal SNPs) and so most markers can be selected to be co-dominant (i.e., heterozygous alleles can be scored). This makes this genotyping technology especially suitable for polyploids such as cotton, and is becoming the genotyping method of choice in, for

and will likely be the genotyping method of choice for a number of years.

any newly discovered SNPs will be unable to be included and assayed.

example, wheat which has three similar genomes (A, B and D) [86].

**5.3. Genotype-by-sequencing**

18 World Cotton Germplasm Resources

Possessing significant genetic diversity within a breeding program is extremely important for future crop improvement. Where detailed pedigree information is known, breeders can specifically select parents for crossing based on their trait package and degree of relatedness. However, it is often the case that detailed knowledge about the pedigrees of cultivars is lacking, especially for imported cultivars, and this may hamper their use in cotton improvement. Molecular markers provide an alternative means of determining levels of relatedness and ancestry between cultivars. Many studies have used markers to determine the diversity levels with cotton populations [88], however most have suffered from insufficient numbers of marker required to generate accurate estimates. The first major use of a large public cotton SNP chip and other high throughput genotyping technologies by CSIRO will be to enable more accurate diversity estimates within breeding populations and seed repositories. Breeders should be quickly able to determine diversity estimates and phylogenies between cultivars, as well as identify shared genetic regions between members of the same pedigree. This data will enable the use of more diverse germplasm which may contain unique traits to be effectively integrated into the breeding program.

#### **5.5. Linking SNPs to traits**

With improvements in SNP identification and the availability of large scale public cotton SNP chip, the availability of molecular makers will no longer be a limiting issue in cotton research and the main task will be linking markers to important traits that will aid in their introgression into elite varieties. Most of our initial focus will be to link SNPs to disease resistance traits, as increased disease resistance has been responsible for a significant amount of our variety based yield improvements. Common methods for linking markers to traits involve the creation of bi-parental crosses that segregate for the trait of interest, followed by selfing, genotyping and trait assessment in the F2 or F3 populations. Where possible immortal genetic populations, in which each line is inbred and can be propagated simply by self-pollination, such as recombi‐ nant inbred populations or inbred backcross (IBC) populations are used as phenotyping is much simpler (only looking at homozygote classes) and repeatable. These immortals lines have been very helpful when assaying for disease resistance (Verticillium wilt, Fusarium wilt and black root rot) as disease phenotypes are often environmentally affected and difficult to score. The new high-throughput marker technology should greatly accelerate these traditional genetic studies in cotton. However, such populations require significant amounts of time and labor invested in their creation. With the availability of high throughput SNP genotyping between *G. hirsutum* cultivars, other approaches consistent with conventional breeding strategies can now be routinely performed. One such method involves locating chromosomal segment substitutions derived by repeated backcrossing of the trait into an elite cultivar. Backcrossing is used by breeders to transfer a limited number of desirable traits from one parent (possibly poor agronomically) to an elite parent, and involves multiple rounds of crossing back to a recurring elite line; with each backcross reducing the amount of donor DNA present in the offspring. This methodology can be used to find markers linked to the trait(s) after as little as 3 cycles of backcrossing and then selfing, as individuals with the desired trait can now be compared using whole genome SNP analysis to reveal which genomic regions are associated with the donor parent. The markers in the donor regions can be confirmed as being linked to the trait in a further cycle of backcrossing, selfing and selection. Once the marker linkage has been verified, marker-assisted backcrossing can then eliminate the need for further trait phenotyping until the desired elite cultivar is obtained. We are using this system to find linked markers for disease and fibre traits and for introgression of traits from other cotton species.

**5.6. Genome wide association mapping**

cotton germplasm cultivars.

**5.7. Mining diversity in other cotton species**

Upland cotton by genetic modification.

Large scale genotyping of cotton at an economical price enables genotyping large diverse collections of *G. hirsutum* lines. To link SNP markers to important traits in cotton the most manageable route is to perform genome wide association studies on elite cultivars and their pedigrees, avoiding the time and energy required in the creation of specialised genetic populations examining small numbers of traits. Association genetic analysis is a method for linking specific markers with phenotypes using established populations of individuals, and has been extensively used in human/animal genetic analyses where defined genetic crosses are often difficult to generate [89]. The extent of genome-wide allelic association (linkage disequilibrium: LD) is the key starting point for association mapping. The extent of LD has been quantified and association mapping has been successfully applied for many plant species (e.g., [90]) including cotton [91, 92]. The application of LD-based association mapping for cotton will facilitate comprehensive utilization of existing genetic diversity conserved within

Australian Cotton Germplasm Resources http://dx.doi.org/10.5772/58414 21

CSIRO is interested in using association analysis on a large number of well defined cotton varieties to find markers linked to disease resistance, fibre quality and yield component traits. The bottleneck for this type of research is the difficulty in adequately phenotyping a large number of cultivars for agronomically important traits. To obtain an accurate measurement of the most important trait, yield, requires significant replication both within a field, across geographical regions and over multiple seasons. This restricts the numbers of varieties that can be tested lowering the power of the analyses. It is also possible to model genetic/environ‐ mental (GxE) interactions for specific traits using association analysis, but that requires a magnitude larger level of replication under different management conditions, as well as significant knowledge about the environmental conditions in which the traits were measured

(and therefore is unlikely to be undertaken by our research teams in the near future).

The cotton genus (*Gossypium*) contains ~50 species that are widely geographically distributed from arid to tropical regions and morphologically diverse ranging from herbaceous perennials to small trees. These species are usually divided into three gene pools with regard to their use for genetic improvement of Upland cotton though only a few sources have been extensively used. The primary gene pool consists of *G. barbadense, G. tomentosum*, *G. mustelinum* and *G. darwinii* which are tetraploid species with the same A and D genome complement that are sexually compatible with Upland cotton. The secondary gene pool is represented by diploids with A, D, B or F genomes that require synthetic tetraploid formation or a synthetic hexaploid bridging species for traits to be introgressed into Upland cotton [39, 44]. The tertiary gene pool consists of other diploid species with a completely different genome type such as C, E, G or K that show relatively poor or no recombination with the A or D-genome and thus traits from these species are likely to require isolation of the causal genes and transfer into cultivated

Once a representative A-genome diploid sequence is completed, and its information added to the already published D genome, it will be possible to use this information as a scaffold to

The ability to scan the whole genome quickly and identify regions from the donor plant makes this strategy practical in cotton and is limited only by reliability and robustness of the trait and the time it takes to cycle through multiple generations. An added advantage of this method to find markers is that the germplasm produced at the end of the process is already partly introgressed into an elite cultivar and can be a parent for more advanced crosses with other elite material. A SNP Chip can also be used to identify backcrossed individuals with the lowest levels of donor DNA at each backcrossing step, reducing the number of backcrosses required to produce an elite cultivar with the trait of interest. This is especially useful to separate the trait of interest from regions nearby that may carry genes that result in poor yield (linkage drag), such as Okra leaf trait.

#### **5.6. Genome wide association mapping**

**5.5. Linking SNPs to traits**

20 World Cotton Germplasm Resources

species.

drag), such as Okra leaf trait.

With improvements in SNP identification and the availability of large scale public cotton SNP chip, the availability of molecular makers will no longer be a limiting issue in cotton research and the main task will be linking markers to important traits that will aid in their introgression into elite varieties. Most of our initial focus will be to link SNPs to disease resistance traits, as increased disease resistance has been responsible for a significant amount of our variety based yield improvements. Common methods for linking markers to traits involve the creation of bi-parental crosses that segregate for the trait of interest, followed by selfing, genotyping and trait assessment in the F2 or F3 populations. Where possible immortal genetic populations, in which each line is inbred and can be propagated simply by self-pollination, such as recombi‐ nant inbred populations or inbred backcross (IBC) populations are used as phenotyping is much simpler (only looking at homozygote classes) and repeatable. These immortals lines have been very helpful when assaying for disease resistance (Verticillium wilt, Fusarium wilt and black root rot) as disease phenotypes are often environmentally affected and difficult to score. The new high-throughput marker technology should greatly accelerate these traditional genetic studies in cotton. However, such populations require significant amounts of time and labor invested in their creation. With the availability of high throughput SNP genotyping between *G. hirsutum* cultivars, other approaches consistent with conventional breeding strategies can now be routinely performed. One such method involves locating chromosomal segment substitutions derived by repeated backcrossing of the trait into an elite cultivar. Backcrossing is used by breeders to transfer a limited number of desirable traits from one parent (possibly poor agronomically) to an elite parent, and involves multiple rounds of crossing back to a recurring elite line; with each backcross reducing the amount of donor DNA present in the offspring. This methodology can be used to find markers linked to the trait(s) after as little as 3 cycles of backcrossing and then selfing, as individuals with the desired trait can now be compared using whole genome SNP analysis to reveal which genomic regions are associated with the donor parent. The markers in the donor regions can be confirmed as being linked to the trait in a further cycle of backcrossing, selfing and selection. Once the marker linkage has been verified, marker-assisted backcrossing can then eliminate the need for further trait phenotyping until the desired elite cultivar is obtained. We are using this system to find linked markers for disease and fibre traits and for introgression of traits from other cotton

The ability to scan the whole genome quickly and identify regions from the donor plant makes this strategy practical in cotton and is limited only by reliability and robustness of the trait and the time it takes to cycle through multiple generations. An added advantage of this method to find markers is that the germplasm produced at the end of the process is already partly introgressed into an elite cultivar and can be a parent for more advanced crosses with other elite material. A SNP Chip can also be used to identify backcrossed individuals with the lowest levels of donor DNA at each backcrossing step, reducing the number of backcrosses required to produce an elite cultivar with the trait of interest. This is especially useful to separate the trait of interest from regions nearby that may carry genes that result in poor yield (linkage

Large scale genotyping of cotton at an economical price enables genotyping large diverse collections of *G. hirsutum* lines. To link SNP markers to important traits in cotton the most manageable route is to perform genome wide association studies on elite cultivars and their pedigrees, avoiding the time and energy required in the creation of specialised genetic populations examining small numbers of traits. Association genetic analysis is a method for linking specific markers with phenotypes using established populations of individuals, and has been extensively used in human/animal genetic analyses where defined genetic crosses are often difficult to generate [89]. The extent of genome-wide allelic association (linkage disequilibrium: LD) is the key starting point for association mapping. The extent of LD has been quantified and association mapping has been successfully applied for many plant species (e.g., [90]) including cotton [91, 92]. The application of LD-based association mapping for cotton will facilitate comprehensive utilization of existing genetic diversity conserved within cotton germplasm cultivars.

CSIRO is interested in using association analysis on a large number of well defined cotton varieties to find markers linked to disease resistance, fibre quality and yield component traits. The bottleneck for this type of research is the difficulty in adequately phenotyping a large number of cultivars for agronomically important traits. To obtain an accurate measurement of the most important trait, yield, requires significant replication both within a field, across geographical regions and over multiple seasons. This restricts the numbers of varieties that can be tested lowering the power of the analyses. It is also possible to model genetic/environ‐ mental (GxE) interactions for specific traits using association analysis, but that requires a magnitude larger level of replication under different management conditions, as well as significant knowledge about the environmental conditions in which the traits were measured (and therefore is unlikely to be undertaken by our research teams in the near future).

#### **5.7. Mining diversity in other cotton species**

The cotton genus (*Gossypium*) contains ~50 species that are widely geographically distributed from arid to tropical regions and morphologically diverse ranging from herbaceous perennials to small trees. These species are usually divided into three gene pools with regard to their use for genetic improvement of Upland cotton though only a few sources have been extensively used. The primary gene pool consists of *G. barbadense, G. tomentosum*, *G. mustelinum* and *G. darwinii* which are tetraploid species with the same A and D genome complement that are sexually compatible with Upland cotton. The secondary gene pool is represented by diploids with A, D, B or F genomes that require synthetic tetraploid formation or a synthetic hexaploid bridging species for traits to be introgressed into Upland cotton [39, 44]. The tertiary gene pool consists of other diploid species with a completely different genome type such as C, E, G or K that show relatively poor or no recombination with the A or D-genome and thus traits from these species are likely to require isolation of the causal genes and transfer into cultivated Upland cotton by genetic modification.

Once a representative A-genome diploid sequence is completed, and its information added to the already published D genome, it will be possible to use this information as a scaffold to determine the gene coding regions for different *Gossypium* species. The coding sequences from these genomes will identify new diversity within genes already associated with disease and stress that could be targets for introgression. Traditionally the work required to introgress traits from anything other than *G. hirsutum* or *G. barbadense* has been prohibitive. However the ability to identify important genes via re-sequencing of other *Gossypium* species means that markers for these traits can be found rapidly and used in marker assisted backcrossing strategies as mentioned above. Therefore, in future we will be able to access much more diverse germplasm for unique traits with potential targets for introgression including glandless seeds (in a glanded plant) and Fusarium wilt resistance that are present in indigenous Australian *Gossypium* species [47, 48].

phenotype of these plants would then be compared to homozygous wild-type plants at this

Australian Cotton Germplasm Resources http://dx.doi.org/10.5772/58414 23

The availability of mutant lines in most or all of the cotton genes would provide the cotton community with the means of testing gene function in a large number of genes at an unpre‐ cedented rate. Lines with mutations in a particular gene would need to be found in both subgenomes, then crossed and progeny selected that were mutated for both copies of the genes in order to categorically define its function. Once a gene has been confirmed as being associated with an important trait, the sequence of this gene could be sequenced from many different cotton cultivars and related species to identify variants that may possess superior qualities for targeted introgression. Currently, large mutagenic populations of *G. arboreum* and *G. hirsu‐ tum* are being established at CSIRO in order to locate genes involved in fibre formation and

NGS and high throughput genotyping will enable large numbers of specific traits to be linked to specific markers, thus increasing the importance of marker assisted selection in the breeding of cotton. However the ability to easily stack a moderate number of traits becomes quickly impractical in the field, where extremely large population would need to be grown to identify plants carrying all the correct combination of markers. To overcome this problem a number of companies have developed non-destructive seed based screening methodologies to screen markers. 'Seed chippers' [96] use complicated robotics and liquid handling equipment to remove small portions of embryo, extracting DNA and performing genotyping. This technol‐ ogy was first demonstrated in maize and then soybean but is beginning to be adapted to work in cotton. Seed-based screening will play an essential role in future transgenic trait stacking and marker-assisted breeding of disease and fibre traits in cotton as it will allow the very small proportion of plants with the favorable allele combination to be identified. Seeds selected that contain the correct marker combinations will be planted in the field and selections for fibre and yield traits performed normally, with the knowledge that other desired traits are already

**6. Success stories of germplasm utilization in Australian cotton**

As mentioned previously, bacterial blight was the most important disease in Australian cotton when the modern Industry first started with estimated yield losses up to 20%. Bacterial blight resistance was introduced to the CSIRO breeding program from Tamcot SP37, which supplied the B2, B3 and B7 genes [97]. The first blight resistant cultivar was released in 1985 and had rapid adoption. Importantly, this cultivar also had high yield, high gin turnout, wide adapta‐ tion and the okra leaf trait. This cultivar was selected from a large breeding population that

location from the same seed lot.

disease and insect resistance.

**5.9. Seed genotyping**

present.

**improvement**

**6.1. Disease resistance**

#### **5.8. Exploiting mutant collections of cotton**

As previously mentioned, induced mutations can produce novel variation that can be directly used for crop improvement and aid in identifying which genes are important for specific traits. Now that a genome sequence is available for cotton, the major basic research challenge is to uncover the genes responsible for major agronomic traits from the more than 50,000 genes present in this species. The advent of NGS technology and the completed *G. raimondii* genome sequence has opened up practical ways of achieving this via a reverse genetics approach; where a mutation in a specific gene sequence is first identified in a plant by sequencing and the phenotype of this plant is then compared against the not mutated or wild-type plant to identify the physiological, morphological or agronomic consequences of the disruption of that gene and hence its possible function. Traditionally mutated plants with interesting phenotypes are selected from large populations of mutagenised plants, and researchers work back to find where the mutation that caused the novel trait is located within the genome (forward genetic approach). However the space/land required in order to select for interesting mutants is substantial for a large plant like cotton, and the tetraploid nature of Upland cotton, potentially makes finding valuable mutants more difficult (as for some traits both the A and D genome genes would need to be simultaneously mutated to result in a phenotype). The reverse genetics approach, is to first obtain a large collection of plants carrying chemically or radiation induced mutations spread throughout the genome that are identified through NGS. Since NGS can rapidly re-sequence a genome as long as an existing reference genome is known, sequencing the randomly mutated plants will identify single base changes in gene sequences caused by chemical mutagenesis similar to how SNPs are already identified between cultivars. Using specialised DNA capture technologies that only isolate DNA from expressed regions of the genome to reduce the sequence complexity [93, 94], this procedure would be cost effective. A large library of plants harbouring these known mutations, would be made by subjecting cotton seeds to induced mutagenesis and plants (M1 generation) grown and allowed to self seed. The M2 seeds would then be planted, leaf material isolated and the gene sequences of these plants sequenced and the multiple mutations in each plant identified. By collecting seeds from the M2 plants, a database of the sequence changes would be identified and associated with specific seed lots and mutations causing loss of gene function of any selected gene could be identified. The M3 seeds from that plant would be planted and using markers specific for the mutation (a type of SNP marker), plants selected that were homozygous for the mutation [95]. The phenotype of these plants would then be compared to homozygous wild-type plants at this location from the same seed lot.

The availability of mutant lines in most or all of the cotton genes would provide the cotton community with the means of testing gene function in a large number of genes at an unpre‐ cedented rate. Lines with mutations in a particular gene would need to be found in both subgenomes, then crossed and progeny selected that were mutated for both copies of the genes in order to categorically define its function. Once a gene has been confirmed as being associated with an important trait, the sequence of this gene could be sequenced from many different cotton cultivars and related species to identify variants that may possess superior qualities for targeted introgression. Currently, large mutagenic populations of *G. arboreum* and *G. hirsu‐ tum* are being established at CSIRO in order to locate genes involved in fibre formation and disease and insect resistance.

#### **5.9. Seed genotyping**

determine the gene coding regions for different *Gossypium* species. The coding sequences from these genomes will identify new diversity within genes already associated with disease and stress that could be targets for introgression. Traditionally the work required to introgress traits from anything other than *G. hirsutum* or *G. barbadense* has been prohibitive. However the ability to identify important genes via re-sequencing of other *Gossypium* species means that markers for these traits can be found rapidly and used in marker assisted backcrossing strategies as mentioned above. Therefore, in future we will be able to access much more diverse germplasm for unique traits with potential targets for introgression including glandless seeds (in a glanded plant) and Fusarium wilt resistance that are present in indigenous Australian

As previously mentioned, induced mutations can produce novel variation that can be directly used for crop improvement and aid in identifying which genes are important for specific traits. Now that a genome sequence is available for cotton, the major basic research challenge is to uncover the genes responsible for major agronomic traits from the more than 50,000 genes present in this species. The advent of NGS technology and the completed *G. raimondii* genome sequence has opened up practical ways of achieving this via a reverse genetics approach; where a mutation in a specific gene sequence is first identified in a plant by sequencing and the phenotype of this plant is then compared against the not mutated or wild-type plant to identify the physiological, morphological or agronomic consequences of the disruption of that gene and hence its possible function. Traditionally mutated plants with interesting phenotypes are selected from large populations of mutagenised plants, and researchers work back to find where the mutation that caused the novel trait is located within the genome (forward genetic approach). However the space/land required in order to select for interesting mutants is substantial for a large plant like cotton, and the tetraploid nature of Upland cotton, potentially makes finding valuable mutants more difficult (as for some traits both the A and D genome genes would need to be simultaneously mutated to result in a phenotype). The reverse genetics approach, is to first obtain a large collection of plants carrying chemically or radiation induced mutations spread throughout the genome that are identified through NGS. Since NGS can rapidly re-sequence a genome as long as an existing reference genome is known, sequencing the randomly mutated plants will identify single base changes in gene sequences caused by chemical mutagenesis similar to how SNPs are already identified between cultivars. Using specialised DNA capture technologies that only isolate DNA from expressed regions of the genome to reduce the sequence complexity [93, 94], this procedure would be cost effective. A large library of plants harbouring these known mutations, would be made by subjecting cotton seeds to induced mutagenesis and plants (M1 generation) grown and allowed to self seed. The M2 seeds would then be planted, leaf material isolated and the gene sequences of these plants sequenced and the multiple mutations in each plant identified. By collecting seeds from the M2 plants, a database of the sequence changes would be identified and associated with specific seed lots and mutations causing loss of gene function of any selected gene could be identified. The M3 seeds from that plant would be planted and using markers specific for the mutation (a type of SNP marker), plants selected that were homozygous for the mutation [95]. The

*Gossypium* species [47, 48].

22 World Cotton Germplasm Resources

**5.8. Exploiting mutant collections of cotton**

NGS and high throughput genotyping will enable large numbers of specific traits to be linked to specific markers, thus increasing the importance of marker assisted selection in the breeding of cotton. However the ability to easily stack a moderate number of traits becomes quickly impractical in the field, where extremely large population would need to be grown to identify plants carrying all the correct combination of markers. To overcome this problem a number of companies have developed non-destructive seed based screening methodologies to screen markers. 'Seed chippers' [96] use complicated robotics and liquid handling equipment to remove small portions of embryo, extracting DNA and performing genotyping. This technol‐ ogy was first demonstrated in maize and then soybean but is beginning to be adapted to work in cotton. Seed-based screening will play an essential role in future transgenic trait stacking and marker-assisted breeding of disease and fibre traits in cotton as it will allow the very small proportion of plants with the favorable allele combination to be identified. Seeds selected that contain the correct marker combinations will be planted in the field and selections for fibre and yield traits performed normally, with the knowledge that other desired traits are already present.
