**Integration of Next-generation Sequencing Technologies with Comparative Genomics in Cereals**

Thandeka N. Sikhakhane, Sandiswa Figlan, Learnmore Mwadzingeni, Rodomiro Ortiz and Toi J. Tsilo

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/61763

#### **Abstract**

Cereals are the major sources of calories worldwide. Their production should be high to achieve food security, despite the projected increase in global population. Genomics re‐ search may enhance cereal productivity. Genomics immensely benefits from robust nextgeneration sequencing (NGS) techniques, which produce vast amounts of sequence data in a time and cost-efficient way. Research has demonstrated that gene sequences among closely related species that share common ancestry have remained well conserved over millions of years of evolution. Comparative genomics allows for comparison of genome sequences across different species, with the implication that genomes with large sizes can be investigated using closely related species with smaller genomes. This offers prospects of studying genes in a single species and, in turn, gaining information on their functions in other related species. Comparative genomics is expected to provide invaluable infor‐ mation on the control of gene function in complex cereal genomes, and also in designing molecular markers across related species. This chapter discusses advances in sequencing technologies, their application in cereal genomics and their potential contribution to the understanding of the relationships between the different cereal genomes and their pheno‐ types.

**Keywords:** Bioinformatics, Cereals, Comparative genomics, Next-generation sequencing, Synteny

#### **1. Introduction**

Significant limitations to cereal crop production and productivity pose a threat to global food security since these crops are the main sources of calories that support the ever-growing human population. Despite the significant progress that has been made in the improvement of edible

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

yield through classical breeding techniques, the current rates of increase in grain yield in several major cereal crops are still too slow to catch up with the increasing demand of the growing population [1, 2]. This is likely to get worse according to the projected climate change scenarios [3], as it also affects biotic stresses such as pests, diseases and weeds, and abiotic stresses including drought, extreme temperatures, salinity and nutrient deficiencies [4-6]. Although there are various strategies to cope with these constraints, Kole [7] suggested the use of genomics-assisted breeding as an effective and economic strategy.

Despite the sustainability of breeding resilient crops, there are still several genomic constraints to genome-based selection and stress resistance improvement, particularly for multigenic traits. A poor understanding of the genetic basis and the regulatory mechanisms of various stresses is among the major challenges for successful genetic manipulation through gene introgression, gene pyramiding, gene stacking or gene silencing. Additionally, more diagnos‐ tic genetic markers are necessary to improve the current limited success in marker application in both foreground and background selection. These challenges are related to the fact that genomes of some cereal crops are not yet fully sequenced and annotated, either because the crops have been under-researched or the genomes are huge and structurally complex. For instance, the hexaploid wheat (*Triticum aestivum*) genome is the largest (about 17 billion nucleotides) among cultivated cereals, and is multifaceted by repetitive DNA sequences [8]. Furthermore, dissection of the genetic and regulatory mechanisms of host plant resistance is complicated because most traits of interest are multigenic and thus influenced by several genes with additive and nonadditive gene effects. Hence, tools that detect the genetic variation at the genome sequence level allow all genes controlling particular traits to be investigated for various genetic applications to realize phenotypic gains from genetic manipulation.

Enhanced application of next-generation sequencing (NGS) techniques in cereal crops is revolutionizing and speeding up plant breeding. The advances that have been made so far in the use of NGS, particularly with the human genome in the field of medicine, and on various model crops through plant biotechnology, envisions the following in cereals and other crops: first, complete sequencing of small and less complex plant genomes is increasingly becoming possible as costs have dropped significantly and more sequences are being generated in a shorter time than before. Secondly, the genetic mechanisms of particular traits in huge and complex plant genomes can now be investigated using small and less complex genomes of related plants sharing conserved regions through comparative genomics. This will potentially identify genes or quantitative trait loci (QTL) and putative single nucleotide polymorphism (SNP) markers for genome-wide association mapping and annotation of genomes. This chapter discusses the advances made in improving sequencing technologies and how these advances can assist in generating complete sequences for the improvement of genome-aided selection. This will also assist in identifying the unique sequences responsible for the major differences existing among cereals.

#### **2. The need for high-throughput genome and transcriptome sequencing**

Since the discovery of the DNA molecule by Friedrich Miescher in 1869 [9], and the subsequent exposition of its double-helical structure by Watson and Crick in 1953, significant knowledge has been gained on the flow of genetic information. Understanding how this genetic informa‐ tion influences the phenotype (trait) of interest has, however, remained a challenge. This is mainly because the overall instruction contributing to the phenotype is not restricted to the coding region but is also influenced by some posttranscriptional modifications controlled by noncoding DNA [10-12]. Also, multigenic traits are influenced by complex interactions of alleles at different loci, having major or minor influence [13]. These, together with differential genotype-by-environment interactions, add to the structural and functional complexity of most cereal genomes that are multifaceted by repetitive DNA sequences, transposable elements and polyploid genomes, as in the case of wheat and finger millet (*Eleusine coracana*) [8, 14]. Whole genome and transcriptome sequencing therefore become a necessity so that all the genomic and transcriptomic variation can be detected. NGS and various 'omic' technolo‐ gies, including genomics, transcriptomics, proteomics, metabolomics and phenomics, offer prospects towards whole-genome annotation; particularly in cereals that have small and less complex genomes. This will simplify comparative genomics and evolutionary genetic re‐ search, which will enhance the manipulation and exploitation of important genes for cereal improvement.

NGS technologies are one of the available tools that can produce complete sequences for diverse research at the DNA and RNA level within and across species. Firstly, this will make it easy to obtain the entire DNA, coding and noncoding regions. Secondly, this will simplify studies on the whole transcriptome, including RNAs involved in protein synthesis such as the messenger, ribosomal, signal recognition particle, transfer and transfer-messenger RNAs and other RNAs involved in posttranscriptional modifications, such as small RNAs [15]. Quanti‐ fication of such transcripts through NGS under various stress conditions will precisely determine the levels of gene expression within and across different species.

#### **3. Advances in sequencing technologies**

yield through classical breeding techniques, the current rates of increase in grain yield in several major cereal crops are still too slow to catch up with the increasing demand of the growing population [1, 2]. This is likely to get worse according to the projected climate change scenarios [3], as it also affects biotic stresses such as pests, diseases and weeds, and abiotic stresses including drought, extreme temperatures, salinity and nutrient deficiencies [4-6]. Although there are various strategies to cope with these constraints, Kole [7] suggested the

Despite the sustainability of breeding resilient crops, there are still several genomic constraints to genome-based selection and stress resistance improvement, particularly for multigenic traits. A poor understanding of the genetic basis and the regulatory mechanisms of various stresses is among the major challenges for successful genetic manipulation through gene introgression, gene pyramiding, gene stacking or gene silencing. Additionally, more diagnos‐ tic genetic markers are necessary to improve the current limited success in marker application in both foreground and background selection. These challenges are related to the fact that genomes of some cereal crops are not yet fully sequenced and annotated, either because the crops have been under-researched or the genomes are huge and structurally complex. For instance, the hexaploid wheat (*Triticum aestivum*) genome is the largest (about 17 billion nucleotides) among cultivated cereals, and is multifaceted by repetitive DNA sequences [8]. Furthermore, dissection of the genetic and regulatory mechanisms of host plant resistance is complicated because most traits of interest are multigenic and thus influenced by several genes with additive and nonadditive gene effects. Hence, tools that detect the genetic variation at the genome sequence level allow all genes controlling particular traits to be investigated for

various genetic applications to realize phenotypic gains from genetic manipulation.

existing among cereals.

30 Plant Genomics

Enhanced application of next-generation sequencing (NGS) techniques in cereal crops is revolutionizing and speeding up plant breeding. The advances that have been made so far in the use of NGS, particularly with the human genome in the field of medicine, and on various model crops through plant biotechnology, envisions the following in cereals and other crops: first, complete sequencing of small and less complex plant genomes is increasingly becoming possible as costs have dropped significantly and more sequences are being generated in a shorter time than before. Secondly, the genetic mechanisms of particular traits in huge and complex plant genomes can now be investigated using small and less complex genomes of related plants sharing conserved regions through comparative genomics. This will potentially identify genes or quantitative trait loci (QTL) and putative single nucleotide polymorphism (SNP) markers for genome-wide association mapping and annotation of genomes. This chapter discusses the advances made in improving sequencing technologies and how these advances can assist in generating complete sequences for the improvement of genome-aided selection. This will also assist in identifying the unique sequences responsible for the major differences

**2. The need for high-throughput genome and transcriptome sequencing**

Since the discovery of the DNA molecule by Friedrich Miescher in 1869 [9], and the subsequent exposition of its double-helical structure by Watson and Crick in 1953, significant knowledge

use of genomics-assisted breeding as an effective and economic strategy.

Since the pioneering of genome sequencing through technologies such as Sanger sequencing [16], significant advances have been made to resolve the limitations of the early technologies. This has seen the development of more sophisticated sequencing technologies that allow *de novo* genome sequencing, generating vast amounts of data in a short period at low costs. Table 1 summarizes the advances made in sequencing technology development, from the advent of the chain termination sequencing [16], to prominent NGS technologies including Roche/454 sequencing [17], Illumina (Solexa) sequencing [18], sequencing by oligonucleotide ligation and detection (SOLiD) [19], the single molecule sequence pioneered by Helicos Biosciences [20] and Ion Torrent sequencing [21]. These technological advances are instrumental in wholegenome research and are expected to simplify comparative genomics within species and across distantly related cereals and grasses. Several modifications are available for each of these technologies and fine-tuned protocols are constantly being developed to address some of the current limitations.

Although NGS technologies have enormous prospective benefits, they come with their own limitations that need to be addressed to realize their full potential. Key among these drawbacks are the bioinformatic and computational challenges related to storage, image analysis, base calling and integration of the large amounts of data that are generated in several terabytes per day. Apparently, the large amount of sequence data that is being generated on a daily basis in cereal genomics cannot be transformed into information that is useful for the detection of important genomic variants within and among species or in identifying genes that are differentially expressed under particular stress conditions. Hence, investment in computa‐ tional and high-throughput bioinformatic equipment and human resources and combining the various NGS technologies will allow the data generated using different NGS techniques by various laboratories to be related and used to build onto each other. Unlike traditional marker technologies, NGS is currently dissociated from phenomics, yet it should be comple‐ mentary to high-throughput phenotyping in order to relate sequence variations to traits of interest for progressive discoveries through genome-wide association mapping, particularly for multigenic traits like adaptation to drought in complex cereal genomes [22]. Additionally, NGS technologies are still associated with high error rates [23] and short read lengths that limit data analysis accuracy. This further confuses detection and distinction of sequence variations including large amounts of duplications, deletions, inversions and chromosomal rearrange‐ ments that characterize cereal genomes.



**Table 1.** Evolution of next-generation sequencing technologies.

calling and integration of the large amounts of data that are generated in several terabytes per day. Apparently, the large amount of sequence data that is being generated on a daily basis in cereal genomics cannot be transformed into information that is useful for the detection of important genomic variants within and among species or in identifying genes that are differentially expressed under particular stress conditions. Hence, investment in computa‐ tional and high-throughput bioinformatic equipment and human resources and combining the various NGS technologies will allow the data generated using different NGS techniques by various laboratories to be related and used to build onto each other. Unlike traditional marker technologies, NGS is currently dissociated from phenomics, yet it should be comple‐ mentary to high-throughput phenotyping in order to relate sequence variations to traits of interest for progressive discoveries through genome-wide association mapping, particularly for multigenic traits like adaptation to drought in complex cereal genomes [22]. Additionally, NGS technologies are still associated with high error rates [23] and short read lengths that limit data analysis accuracy. This further confuses detection and distinction of sequence variations including large amounts of duplications, deletions, inversions and chromosomal rearrange‐

**Year Sequencing chemistry Throughput Read**

1977 Involves DNA polymerase based selective amplicon-termination of in vitro DNA replication by radioactively or fluorescently labeled di-deoxynucleotide triphosphates, followed by electrophoresis and UV or X-ray spectra detection of DNA sequences. Major limitations of the Sanger technique Since the technique relies on cloning vectors, there is potential for a mix up of the target sequences with some DNA portions from the clonal vector. Additionally, it requires a lot of labor and space since multiplexing is not

> This is a sequencing by synthesis (SBS) technique where DNA fragments attached to adapters annealed to beads are PCR amplified using adapter specific primers. Addition of each dNTP is associated with the release of a pyrophosphate, which is converted to ATP energy used to produce an optical signal (light). The light allows reading of the beads to which the dNTP is added, hence deducing

the sequence (Pyrosequencing).

**length**

Up to 1000 bp

Up to 1,000 bp

Up to 84 Kb per about 3 hr run

700 Mb of sequence data

per 23 hr run **References**

[16]

[17]

ments that characterize cereal genomes.

possible.

2004 First NGS technique

**Technologies (Developer)**

Roche/454 (Life Sciences)

Sanger sequencing (Frederick Sanger and team)

32 Plant Genomics

#### **4. Application of next-generation sequencing in cereal biotechnology**

Among the major cereals, the relatively small rice (*Oryza sativa*) genome (∼389 Mb) has long been fully sequenced by the International Rice Genome Project [24]. Kawahara [25] recently demonstrated, however, the robustness of NGS technologies by revising the rice genome using the Illumina and Roche 454 pyrosequencing platforms. Their study noted some errors in the initial assembly. This research provides sufficient evidence that high quality and validated reference genomes can be produced among most cereals through resequencing using NGS technologies. Also, a recent whole genome-wide study of the hexaploid wheat genome (∼17 Gb) using the Roche/454 pyrosequencing technology reviewed the capacity of NGS technolo‐ gies to resequence huge and complex genomes and to identify SNPs for dissection of quanti‐ tative traits [26]. Similarly, Illumina sequencing was recently used to quantify the transposable element (TE) content in the complex maize *(Zea mays)* genome (∼2.3 Gb) [27] and to estimate their potential contribution to the genome size differences between the cultivated species and its close relative, *Zea luxurians* [28]. The latter also reported high proportions of conserved TE families between the two species, revealing the potential of NGS technologies to enhance evolutionary and comparative genomic studies. Other major cereals whose genomes have been sequenced and are expected to further benefit from NGS technologies include barley *(Hordeum vulgare)* (∼5.1 Gb) [29] and sorghum *(Sorghum bicolor)* (∼730 Mb) [30].

Minor and under-researched cereals such as the allotetraploid finger millet *(Eleusine coracana*) —which has a genome size of about 1.76 Gb [31]—and the diploids, pearl millet *(Pennisetum glaucum*)—with a genome size of about 4.6 Gb [32]—and tef *(Eragrostis tef)*—with a 714 to 733 Mb genome [33]—have not received much benefit from NGS technologies. However, these crops are expected to benefit from the African Orphan Crops Consortium that has the mandate to use the latest scientific equipment and techniques to sequence, assemble and annotate genomes of under-researched crops [34]. These minor crops are renowned for their adaptation to various biotic and abiotic stresses, particularly drought. Thus, sequencing or resequencing their genomes will potentially expose huge amounts of relevant genetic information for cereal improvement. NGS technologies will have great application in comparing genomic features of cereal crops through comparative genomic research.

#### **5. Comparative genomics in cereal crops**

Core questions unanswered with traditional cereal biotechnology approaches include: (1) What are the genetic foundations that underlie the similarities between different grass species or individuals within a species? (2) What are the genetic variations responsible for the detected phenotypic differences? Comparative genomics is the branch of biology in which DNA sequence information from genomes of different life forms are compared in an effort to directly answer these questions. It was founded mainly on various ideas. Firstly, comprehensive analysis and comparison of whole genomes can uncover the essentially conserved and the important variable components of any set of genomes [35]. Secondly, differences in genome sequence (genotype) contribute to differences in genome function and therefore explain differences between phenotypic traits [36]. The application of comparative genomic informa‐ tion on various plants including cereals has, however, been a challenge previously because of the large genome sizes of most species, which are complicated by high rates of structural rearrangements mainly due to transposable elements, duplications and inversions [35], as listed in Table 2.


**Table 2.** Genome size, structure and genomic resources of major cereal species.

**4. Application of next-generation sequencing in cereal biotechnology**

34 Plant Genomics

*vulgare)* (∼5.1 Gb) [29] and sorghum *(Sorghum bicolor)* (∼730 Mb) [30].

of cereal crops through comparative genomic research.

**5. Comparative genomics in cereal crops**

Among the major cereals, the relatively small rice (*Oryza sativa*) genome (∼389 Mb) has long been fully sequenced by the International Rice Genome Project [24]. Kawahara [25] recently demonstrated, however, the robustness of NGS technologies by revising the rice genome using the Illumina and Roche 454 pyrosequencing platforms. Their study noted some errors in the initial assembly. This research provides sufficient evidence that high quality and validated reference genomes can be produced among most cereals through resequencing using NGS technologies. Also, a recent whole genome-wide study of the hexaploid wheat genome (∼17 Gb) using the Roche/454 pyrosequencing technology reviewed the capacity of NGS technolo‐ gies to resequence huge and complex genomes and to identify SNPs for dissection of quanti‐ tative traits [26]. Similarly, Illumina sequencing was recently used to quantify the transposable element (TE) content in the complex maize *(Zea mays)* genome (∼2.3 Gb) [27] and to estimate their potential contribution to the genome size differences between the cultivated species and its close relative, *Zea luxurians* [28]. The latter also reported high proportions of conserved TE families between the two species, revealing the potential of NGS technologies to enhance evolutionary and comparative genomic studies. Other major cereals whose genomes have been sequenced and are expected to further benefit from NGS technologies include barley *(Hordeum*

Minor and under-researched cereals such as the allotetraploid finger millet *(Eleusine coracana*) —which has a genome size of about 1.76 Gb [31]—and the diploids, pearl millet *(Pennisetum glaucum*)—with a genome size of about 4.6 Gb [32]—and tef *(Eragrostis tef)*—with a 714 to 733 Mb genome [33]—have not received much benefit from NGS technologies. However, these crops are expected to benefit from the African Orphan Crops Consortium that has the mandate to use the latest scientific equipment and techniques to sequence, assemble and annotate genomes of under-researched crops [34]. These minor crops are renowned for their adaptation to various biotic and abiotic stresses, particularly drought. Thus, sequencing or resequencing their genomes will potentially expose huge amounts of relevant genetic information for cereal improvement. NGS technologies will have great application in comparing genomic features

Core questions unanswered with traditional cereal biotechnology approaches include: (1) What are the genetic foundations that underlie the similarities between different grass species or individuals within a species? (2) What are the genetic variations responsible for the detected phenotypic differences? Comparative genomics is the branch of biology in which DNA sequence information from genomes of different life forms are compared in an effort to directly answer these questions. It was founded mainly on various ideas. Firstly, comprehensive analysis and comparison of whole genomes can uncover the essentially conserved and the important variable components of any set of genomes [35]. Secondly, differences in genome sequence (genotype) contribute to differences in genome function and therefore explain differences between phenotypic traits [36]. The application of comparative genomic informa‐ The application of comparative genomics for crop improvement has evolved over time. In the grass family, significant research provided remarkable and comprehensive datasets demon‐ strating high degree of collinearity or synteny among genomes at chromosome (macro) and gene (micro) levels [37, 38]. *Synteny*, from the Greek *syn* (together with) and *taenia* (ribbon), refers to loci contained within the same chromosome. Collinearity, on the other hand, refers to some degree of conservation of gene order between chromosomes of different species or between nonhomologous chromosomes of a single species [39]. A large number of sequences within the grass family has remained considerably conserved at the genome level over millions of years of evolution, irrespective of the differences in ploidy level, chromosome number and haploid DNA content [37]. This conservation of gene content and order at the megabase level makes it easy to use species with small genome sizes such as *Arabidopsis* and rice as model species for studying similar gene contents in other related species. Their applications include allele discovery, positional cloning, and comparative studies in related species [40]. There is, however, limited synteny and gene homology between *Arabidopsis* and rice, but an extensive collinearity between the latter and other grasses, thereby suggesting that rice is an appropriate grass model species for cereal comparative genomics [41]. In this case, rice and purple false brome (*Brachypodium distachyon)* (genome size ~355 Mb), both of which are from the grass family, serve as functional model species for cereal comparative genomics owing to their small and fully sequenced genomes. Moreover, *Brachypodium* showed conservation of gene content and family structure with rice and sorghum [42]. A phylogenetic study carried out on seven grass species also revealed a close evolutionary relationship of *Brachypodium* with maize, barley and wheat based on 335 commonly shared sequences [43].

Microcollinearity has numerous interesting applications in cereal genome analysis including the transfer of genetic markers between species and the identification of candidate genes across species borders [44]. It is possible, due to such advances, to intensively study, decipher and understand the genetic makeup of the cereal genomes including those of rice, maize, wheat, barley and sorghum [30, 45-47]. Comparing the gene sequences of these cereal crops is the initial step towards understanding their morphological and functional similarities and differences. Comparative analysis research has been extended to the DNA sequence (micro) level, to allow the investigation of conservation of coding and noncoding regions as well as characterization of molecular mechanisms of genome evolution [38].

#### **6. Several examples of macro- and microcollinearity in cereal crops**

The advent of molecular markers and molecular mapping allowed researchers to conduct comparative mapping research, comparing gene orders and content of genes and markers along chromosomes of related species. The first research of large-scale restriction fragment length polymorphism (RFLP) mapping in several economically important crop genomes included the genomes of wheat, rice, maize, oat and barley. They are benchmarks for the discovery of collinearity in the grass family [44]. Hence, in the past, exploiting RFLPs to compare genomes was a valuable method as the markers made it possible to map, for the first time, a huge number of randomly distributed polymorphic loci in a single population and provided the foundation for efficient, whole-genome studies at the molecular level [48]. The application of RFLP technology in comparative genome analysis studies revealed that an extensive commonality in gene content and arrangement was a basic chromosomal property, thus prompting the idea that the genetic map could be used to tie all grasses into a single model system. This led to the construction of a consensus grass map based on 25 rice linkage blocks [37, 38]. The resolution of the genetic maps, however, proved to be very low with an average of one marker in every 5 to 10 centimorgans (cM), allowing the detection of only large rearrangements. The RFLP markers used to construct the maps were also low-copy, therefore limiting the detection of small deletions, inversions and whole or partial genome duplication events [49]. The use of RFLP markers for comparative mapping also had difficulty to assess orthologous (derived from a common ancestor by speciation) and paralogous (derived by duplication within one genome) relationships in gene families. Having these challenges associated with traditional genotyping, the NGS techniques discussed above are expected to advance comparative genomics because they provide actual DNA sequences that allow interspecies or intergeneric comparisons.

Traditional genome analyses have provided sufficient evidence that cereal genomes share conserved regions at either macro or micro levels. For example, a comparative genomics study on rice and maize indicated high levels of collinearity between the two genomes with some chromosomes or their arms—accounting for at least 67% of the two genomes—having almost similar gene order and sequences [46]. Similarly, large proportions of conserved regions between rice and wheat chromosomes were identified with major differences arising from chromosomal rearrangements [40, 45]. Conservation of about 24% of grass-specific gene orders have been reported in sorghum [30], including high collinearity with rice [50]. Thus, sorghum can also serve as a model species for cereal genomic studies due to its relatively small genome size and wide adaptability. High levels of microcollinearity have been demonstrated between chromosome 6 of rice and the telomeric regions of barley chromosome 1P, which further confirm the usefulness of mapping the small rice genome for map-based cloning of important genes in complex genomes [47]. Figures 1 and 2 illustrate the conservation of synteny and collinearity among different cereals by revealing the syntenic relationships between chromo‐ somes of cereal crops. Furthermore, Figure 2B reveals that the 10 maize progenitor chromo‐ somes and the 10 linkage groups of sorghum appear to be similar, thus exposing their evolutionary divergence from rice that could be their common ancestor before speciation [51]. The study of such evolutionary relationships and changes that occurred after cereals diverged from their progenitors will further be enhanced through comparative genomics integrated with NGS and next-next or third-generation sequencing techniques, which can generate more resolute physical maps. Availability of updated genome sequences will expose the multiple breaks in collinearity occurring in the genome compositions due to structural rearrangements caused by transposable elements, inversions, deletions and duplications. The macro- and microcollinearities described in this section are exposed by the observed phenotypic similar‐ ities that exist among different cereal species.

grass model species for cereal comparative genomics [41]. In this case, rice and purple false brome (*Brachypodium distachyon)* (genome size ~355 Mb), both of which are from the grass family, serve as functional model species for cereal comparative genomics owing to their small and fully sequenced genomes. Moreover, *Brachypodium* showed conservation of gene content and family structure with rice and sorghum [42]. A phylogenetic study carried out on seven grass species also revealed a close evolutionary relationship of *Brachypodium* with maize,

Microcollinearity has numerous interesting applications in cereal genome analysis including the transfer of genetic markers between species and the identification of candidate genes across species borders [44]. It is possible, due to such advances, to intensively study, decipher and understand the genetic makeup of the cereal genomes including those of rice, maize, wheat, barley and sorghum [30, 45-47]. Comparing the gene sequences of these cereal crops is the initial step towards understanding their morphological and functional similarities and differences. Comparative analysis research has been extended to the DNA sequence (micro) level, to allow the investigation of conservation of coding and noncoding regions as well as

barley and wheat based on 335 commonly shared sequences [43].

36 Plant Genomics

characterization of molecular mechanisms of genome evolution [38].

interspecies or intergeneric comparisons.

**6. Several examples of macro- and microcollinearity in cereal crops**

The advent of molecular markers and molecular mapping allowed researchers to conduct comparative mapping research, comparing gene orders and content of genes and markers along chromosomes of related species. The first research of large-scale restriction fragment length polymorphism (RFLP) mapping in several economically important crop genomes included the genomes of wheat, rice, maize, oat and barley. They are benchmarks for the discovery of collinearity in the grass family [44]. Hence, in the past, exploiting RFLPs to compare genomes was a valuable method as the markers made it possible to map, for the first time, a huge number of randomly distributed polymorphic loci in a single population and provided the foundation for efficient, whole-genome studies at the molecular level [48]. The application of RFLP technology in comparative genome analysis studies revealed that an extensive commonality in gene content and arrangement was a basic chromosomal property, thus prompting the idea that the genetic map could be used to tie all grasses into a single model system. This led to the construction of a consensus grass map based on 25 rice linkage blocks [37, 38]. The resolution of the genetic maps, however, proved to be very low with an average of one marker in every 5 to 10 centimorgans (cM), allowing the detection of only large rearrangements. The RFLP markers used to construct the maps were also low-copy, therefore limiting the detection of small deletions, inversions and whole or partial genome duplication events [49]. The use of RFLP markers for comparative mapping also had difficulty to assess orthologous (derived from a common ancestor by speciation) and paralogous (derived by duplication within one genome) relationships in gene families. Having these challenges associated with traditional genotyping, the NGS techniques discussed above are expected to advance comparative genomics because they provide actual DNA sequences that allow

**Figure 1.** Microsynteny conservation between sorghum and rice.

**Figure 2.** Conservation and changes in rice, maize, sorghum and wheat chromosomes during cereal speciation.

#### **7. Phenotypic commonality in cereals**

The conservation of synteny and collinearity of genes among cereals is highly attributed to the common phenotypic features or characteristics that are evidence that they share common ancestry, while their differences mainly stem from chromosomal rearrangements and poly‐ ploidization as shown in Figure 2. Their morphological similarity (Figure 3) also shows evidence that they share common ancestry. Based on phenotype alone, most also share similar rooting system, leaf venation, flowering habits, tillering, inflorescences, physiological behavior such as vernalization requirements, and adaptation to biotic and abiotic stresses. For example, some cereals are hosts of common diseases, as in the case of maize streak virus (MSV), wheat streak mosaic virus (WSMV) and rusts [52, 53], while others are nonhosts, as in the case of rice to rusts. The differences in phenotype and genome structure among all these species could be due to mutations, breaks in collinearity and loss of synteny that occurred in their genomes over millions of years. Such differences can be traced through comparative genomic analysis, particularly with the aid of high-throughput sequencing techniques. Likewise, the similarity in phenotype and genome structure could be due to sharing a common ancestry (Figures 2 and 3). This finding therefore reveals some phenotypes along with gene orders and sequences that have been conserved over millions of years.

Integration of Next-generation Sequencing Technologies with Comparative Genomics in Cereals http://dx.doi.org/10.5772/61763 39

**Figure 3.** Phenotypic commonality in cereals. Similar seedlings: (A1) rice seedlings, (B1) wheat seedlings, (C1) barley seedlings, (D1) rye seedlings, (E1) oat seedlings, (F1) pearl millet seedlings, (G1) finger millet seedlings, (H1) sorghum seedlings, (I1) tef seedlings, (J1) maize seedlings. Similar heads: (A2) rice heads, (B2) wheat heads, (C2) barley heads, (D2) rye heads, (E2) oat heads, (F2) pearl millet heads, (G2) finger millet heads, (H2) sorghum heads, (I2) tef heads, (J2) maize heads.

#### **8. Outlook**

Source: Wei [51].

38 Plant Genomics

**7. Phenotypic commonality in cereals**

that have been conserved over millions of years.

**Figure 2.** Conservation and changes in rice, maize, sorghum and wheat chromosomes during cereal speciation.

The conservation of synteny and collinearity of genes among cereals is highly attributed to the common phenotypic features or characteristics that are evidence that they share common ancestry, while their differences mainly stem from chromosomal rearrangements and poly‐ ploidization as shown in Figure 2. Their morphological similarity (Figure 3) also shows evidence that they share common ancestry. Based on phenotype alone, most also share similar rooting system, leaf venation, flowering habits, tillering, inflorescences, physiological behavior such as vernalization requirements, and adaptation to biotic and abiotic stresses. For example, some cereals are hosts of common diseases, as in the case of maize streak virus (MSV), wheat streak mosaic virus (WSMV) and rusts [52, 53], while others are nonhosts, as in the case of rice to rusts. The differences in phenotype and genome structure among all these species could be due to mutations, breaks in collinearity and loss of synteny that occurred in their genomes over millions of years. Such differences can be traced through comparative genomic analysis, particularly with the aid of high-throughput sequencing techniques. Likewise, the similarity in phenotype and genome structure could be due to sharing a common ancestry (Figures 2 and 3). This finding therefore reveals some phenotypes along with gene orders and sequences

Plant species have highly conserved regions at DNA sequence level, whereas the bulk of the large genomes consist of repetitive DNA sequences, most of which are species-specific. Comparative genomics have opened new avenues for map-based positional cloning of genes encoding important traits on large and intricate genomes through investigating small and less complex genomes. In grasses, rice and *Brachypodium* have been identified as model species for such research since they have small and stable genomes. This, however, requires the integra‐ tion of NGS techniques so that all the conserved and nonconserved regions can be fully sequenced and annotated with the aid of other "omic" technologies. Hence, the future of comparative genomics studies in cereals will largely rely on cost-effective sequencing tech‐ nologies along with computational systems that handle large numbers of sequences, thus allowing effective sequence comparisons across species of interest. The substantial evidence regarding a common ancestry of cereals—based on genome and morphological structures led to the successful use of the genome sequence of one species to share a light on the function of that sequence in other related species. A wide adoption of this approach across different cereals will speed up gains and generate useful databases and datasets for effective cereal breeding. Furthermore, researchers will be able to use other widely adapted cereals like sorghum and some of the under-researched cereals as models for sequencing genes and alleles responsible for unique traits such as wide adaptation to stress-prone environments due to increased sequencing throughput. There is, however, a need to invest in advanced computa‐ tional and bioinformatics tools to handle and analyze huge datasets that will be generated through these technology advances.

#### **Acknowledgements**

The authors would like to thank Damien Shumbusha (Rwanda Agriculture Board), Andre Malan (Agricultural Research Council - Small Grain Institute), Hussein Shimelis (African Centre for Crop Improvement, University of KwaZulu-Natal), Caleb Souta (Seed Co. Limited, Zimbabwe), Lydia Ndinelao Horn (Ministry of Agriculture, Water and Forestry, Namibia), Cousin Musvosvi (Department of Crop Science, University of Zimbabwe) and Kingstone Mashingaidze (Agricultural research Council –Grain Crops Institute) for providing seeds or images used to illustrate phenotypic commonalities in cereals.

#### **Author details**

Thandeka N. Sikhakhane1,2, Sandiswa Figlan1 , Learnmore Mwadzingeni1 , Rodomiro Ortiz3 and Toi J. Tsilo1,2,4\*

\*Address all correspondence to: tsilot@arc.agric.za

1 Agricultural Research Council, Small Grain Institute, Bethlehem, South Africa

2 Department of Life and Consumer Sciences, University of South Africa, Pretoria, South Africa

3 Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden

4 Department of Plant Production, University of Venda, Thohoyandou, South Africa

#### **References**

comparative genomics studies in cereals will largely rely on cost-effective sequencing tech‐ nologies along with computational systems that handle large numbers of sequences, thus allowing effective sequence comparisons across species of interest. The substantial evidence regarding a common ancestry of cereals—based on genome and morphological structures led to the successful use of the genome sequence of one species to share a light on the function of that sequence in other related species. A wide adoption of this approach across different cereals will speed up gains and generate useful databases and datasets for effective cereal breeding. Furthermore, researchers will be able to use other widely adapted cereals like sorghum and some of the under-researched cereals as models for sequencing genes and alleles responsible for unique traits such as wide adaptation to stress-prone environments due to increased sequencing throughput. There is, however, a need to invest in advanced computa‐ tional and bioinformatics tools to handle and analyze huge datasets that will be generated

The authors would like to thank Damien Shumbusha (Rwanda Agriculture Board), Andre Malan (Agricultural Research Council - Small Grain Institute), Hussein Shimelis (African Centre for Crop Improvement, University of KwaZulu-Natal), Caleb Souta (Seed Co. Limited, Zimbabwe), Lydia Ndinelao Horn (Ministry of Agriculture, Water and Forestry, Namibia), Cousin Musvosvi (Department of Crop Science, University of Zimbabwe) and Kingstone Mashingaidze (Agricultural research Council –Grain Crops Institute) for providing seeds or

, Learnmore Mwadzingeni1

,

images used to illustrate phenotypic commonalities in cereals.

1 Agricultural Research Council, Small Grain Institute, Bethlehem, South Africa

2 Department of Life and Consumer Sciences, University of South Africa, Pretoria, South

4 Department of Plant Production, University of Venda, Thohoyandou, South Africa

3 Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden

Thandeka N. Sikhakhane1,2, Sandiswa Figlan1

and Toi J. Tsilo1,2,4\*

\*Address all correspondence to: tsilot@arc.agric.za

through these technology advances.

**Acknowledgements**

40 Plant Genomics

**Author details**

Rodomiro Ortiz3

Africa


[30] Paterson AH, et al. The Sorghum bicolor genome and the diversification of grasses. Nature, 2009. 457(7229): p. 551-556. DOI: doi:10.1038/nature07723

[14] Edwards D, J Batley, and RJ Snowdon. Accessing complex crop genomes with nextgeneration sequencing. Theoretical and Applied Genetics, 2013. 126(1): p. 1-11. DOI:

[15] Taft RJ, et al. Non‐coding RNAs: regulators of disease. The Journal of pathology,

[16] Sanger F, S Nicklen, and AR Coulson. DNA sequencing with chain-terminating in‐ hibitors. Proceedings of the National Academy of Sciences, 1977. 74(12): p. 5463-5467.

[17] Roche. GS FLX+ System. Sanger-like read lengths - the power of next-gen through‐ put. Roche Diagnostics GmbH; Available from: http://454.com/downloads/

[18] Illumina. Next-Generation Sequencing 2015; Available from: http://www.illumi‐

[19] TFS. SOLiD® Next-Generation Sequencing 2015; Available from: https://www.life‐ technologies.com/za/en/home/life-science/sequencing/next-generation-sequencing/

[20] SeqLL. Helicos Genetic Analysis Aool 2014; Available from: http://seqll.com/ [Ac‐

[21] TFS. Ion Personal Genome Machine® (PGM™) System 2015; Available from: http:// www.lifetechnologies.com/order/catalog/product/4462921 [Accessed: 2015-09-05]. [22] Mwadzingeni L, et al. Breeding wheat for drought tolerance: Progress and technolo‐ gies. Journal of Integrative Agriculture, 2015. DOI: 10.1016/s2095-3119(15)61102-9 [23] Harismendy O, et al. Evaluation of next generation sequencing platforms for popula‐

[24] IRGSP. The map-based sequence of the rice genome. Nature, 2005. 436(7052): p.

[25] Kawahara Y, et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice, 2013. 6(1): p. 4.

[26] Brenchley R, et al. Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature, 2012. 491(7426): p. 705-710. DOI: 10.1038/nature11650

[27] Schnable PS, et al. The B73 maize genome: complexity, diversity, and dynamics. sci‐

[28] Tenaillon MI, et al. Genome size and transposable element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome biology and evo‐

[29] IBGSC. A physical, genetic and functional sequence assembly of the barley genome.

GSFLXApplicationFlyer\_FINALv2.pdf [Accessed: 2015-09-05].

solid-next-generation-sequencing.html [Accessed: 2015-09-05].

tion targeted sequencing studies. Genome biol, 2009. 10(3): p. R32.

ence, 2009. 326(5956): p. 1112-1115. DOI: 10.1126/science.1178534

Nature, 2012. 491(7426): p. 711-716. DOI: 10.1038/nature11543

lution, 2011. 3: p. 219-229. DOI: 10.1093/gbe/evr008

10.1007/s00122-012-1964-x

42 Plant Genomics

na.com/ [Accessed: 2015-09-05].

793-800. DOI: 10.1038/nature03895

cessed: 210-09-05].

2010. 220(2): p. 126-139. DOI: 10.1002/path.2638


## **Strategies for Sequence Assembly of Plant Genomes**

Stéphane Deschamps and Victor Llaca

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/61927

#### **Abstract**

[46] Ahn S and SD Tanksley. Comparative linkage maps of the rice and maize genomes. Proceedings of the National Academy of Sciences, 1993. 90(17): p. 7980-7984. http://

[47] Kilian A, et al. Rice-barley synteny and its application to saturation mapping of the barley Rpg1 region. Nucleic acids research, 1995. 23(14): p. 2729-2733. DOI:

[49] Paterson AH, et al. Comparative genomics of plant chromosomes. The Plant Cell,

[50] Bowers JE, et al. Comparative physical mapping links conservation of microsynteny to chromosome structure and recombination in grasses. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(37): p. 13206-13211.

[51] Wei F, et al. Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet, 2007. 3(7): p. e123. DOI: 10.1371/journal.pgen.

[52] Cummins GB, The rust fungi of cereals, grasses and bamboos. 1971: Berlin, Springer-

[53] Brakke M. Wheat streak mosaic virus. CMI/AAB descriptions of plant viruses, 1971.

[48] McCouch SR. Genomics and synteny. Plant Physiology, 2001. 125(1): p. 152-155.

www.pnas.org/content/90/17/7980.abstract

10.1093/nar/23.14.2729

44 Plant Genomics

2000. 12(9): p. 1523-1539.

DOI: 10.1073pnas.0502365102

0030123

Verlag.

48: p. 1-4.

The field of plant genome assembly has greatly benefited from the development and widespread adoption of next-generation DNA sequencing platforms. Very high sequenc‐ ing throughputs and low costs per nucleotide have considerably reduced the technical and budgetary constraints associated with early assembly projects done primarily with a traditional Sanger-based approach. Those improvements led to a sharp increase in the number of plant genomes being sequenced, including large and complex genomes of eco‐ nomically important crops. Although next-generation DNA sequencing has considerably improved our understanding of the overall structure and dynamics of many plant ge‐ nomes, severe limitations still remain because next-generation DNA sequencing reads typically are shorter than Sanger reads. In addition, the software tools used to *de novo* assemble sequences are not necessarily designed to optimize the use of short reads. These cause challenges, common to many plant species with large genome sizes, high repeat contents, polyploidy and genome-wide duplications. This chapter provides an overview of historical and current methods used to sequence and assemble plant genomes, along with new solutions offered by the emergence of technologies such as single molecule se‐ quencing and optical mapping to address the limitations of current sequence assemblies.

**Keywords:** Sequencing, Plant, Genome, Assembly

#### **1. Introduction**

Genome sequencing, assembling and annotation have been major priorities in plant genetics research during the past 20 years. The release of draft reference genomes have typically constituted major milestones and have proven to be invaluable for the analysis and charac‐ terization of genome architecture, genes and their expression, diversity and evolution [1–5]. The expansion of sequence information in a growing number of taxa has contributed to comparative studies and the implementation of molecular breeding and biotechnology approaches for crop improvement [6, 7]. The construction of the first plant genomes was made

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

possible by applying considerable resources, coordination and effort to enabling automated Sanger-based sequencing technologies and computational algorithms. Starting in 2005, a series of technological revolutions in DNA sequencing, driven in large part by the goal of affordable personalized genome sequencing, radically changed the sequencing model. First, new technologies drastically increased throughput while reducing costs and times in data collec‐ tion. Additional technologies then enabled long single-molecule reads and algorithms that were more suitable to resolve complex genomes [8, 9].

In addition to these advances, the genomics community has benefited from the development and implementation of complementary mapping technologies and methods that have facilitated the scaffolding of sequences and integration to genetic maps. This review provides a historical and technical perspective of methods and technologies applied to genome reference assembly in plants as well as current advances and future directions.

### **2. The development of Sanger sequencing for** *de novo* **assembly of plant genomes**

The construction of reference genomes was initially enabled by technological advances in sequencing using the Sanger method [10]. During the 1980s and 1990s, the introduction of thermal cycle sequencing, single-tube reactions and fluorescence-tagged terminator chemistry [11] facilitated the development of high-capacity sequencing platforms. Additional improve‐ ments in parallelization, base quality assessment, read length and cost-effectiveness were achieved by the development of automatic base-calling and capillary electrophoresis [12, 13]. With no major modifications made in the past years, automated high-throughput Sanger sequencing is performed by parallel reactions that include a mixture of the DNA template, primer, DNA polymerase, and deoxynucleotides (dNTPs). A proportion of dideoxynucleotide terminators (ddNTP) are included in the reaction, each labelled with a different fluorescent dye. DNA molecules are extended from templates using a thermal cycling reaction and terminated by random incorporation of the labelled ddNTPs, which are detected by laser excitation of the fluorescent labels after capillary-based electrophoresis. The differences in dye excitation profiles are recorded and translated by a computer to generate the sequence. Primary analysis software then calls nucleotides from the raw sequences, assigning a corresponding quality score at each position [6, 14].

The complete sequencing of the first bacterial genomes [15,16] as well as the creation of initiatives aimed at sequencing the genomes of *Sacharomyces cerevisae*, *Caenorhabditis elegans*, *Drosophila melanogaster* and *Homo sapiens* provided the technical and technological framework for the initial sequencing of genomes in plants [17–21]. These projects validated the idea of applying a scaled-up form of shotgun sequencing [22]. Shotgun sequencing relied on computer algorithms to enable *in silico* assembly of overlapping sequencing reads derived from ran‐ domly-generated subclones. The development of software suites such as Phred, Phrap and Consed [23] allowed calling bases, setting individual base quality, assembling overlapping reads, assigning assembly quality scores, viewing final assemblies and extracting consensus sequences. Two major genomic shotgun sequencing strategies were defined at that time: (1) whole-genome shotgun sequencing (WGS) and (2) clone-by-clone, also referred to as BAC-byBAC sequencing. In WGS, genomic DNA is randomly sheared and the ends of the cloned fragments are directly sequenced and assembled. This strategy is the simplest, and it was initially used in small bacterial and yeast genomes. Later, it was also used in *D. melanogaster* and one of two initiatives aimed at sequencing and assembling the reference human genome [19, 21]. Major improvements to *de novo* WGS assembly came from using strategies that relied on paired-end reads from multiple libraries with different average insert sizes and the optimization of software with algorithms that use end-sequence distance information from these libraries.

possible by applying considerable resources, coordination and effort to enabling automated Sanger-based sequencing technologies and computational algorithms. Starting in 2005, a series of technological revolutions in DNA sequencing, driven in large part by the goal of affordable personalized genome sequencing, radically changed the sequencing model. First, new technologies drastically increased throughput while reducing costs and times in data collec‐ tion. Additional technologies then enabled long single-molecule reads and algorithms that

In addition to these advances, the genomics community has benefited from the development and implementation of complementary mapping technologies and methods that have facilitated the scaffolding of sequences and integration to genetic maps. This review provides a historical and technical perspective of methods and technologies applied to genome reference

**2. The development of Sanger sequencing for** *de novo* **assembly of plant**

The construction of reference genomes was initially enabled by technological advances in sequencing using the Sanger method [10]. During the 1980s and 1990s, the introduction of thermal cycle sequencing, single-tube reactions and fluorescence-tagged terminator chemistry [11] facilitated the development of high-capacity sequencing platforms. Additional improve‐ ments in parallelization, base quality assessment, read length and cost-effectiveness were achieved by the development of automatic base-calling and capillary electrophoresis [12, 13]. With no major modifications made in the past years, automated high-throughput Sanger sequencing is performed by parallel reactions that include a mixture of the DNA template, primer, DNA polymerase, and deoxynucleotides (dNTPs). A proportion of dideoxynucleotide terminators (ddNTP) are included in the reaction, each labelled with a different fluorescent dye. DNA molecules are extended from templates using a thermal cycling reaction and terminated by random incorporation of the labelled ddNTPs, which are detected by laser excitation of the fluorescent labels after capillary-based electrophoresis. The differences in dye excitation profiles are recorded and translated by a computer to generate the sequence. Primary analysis software then calls nucleotides from the raw sequences, assigning a corresponding

The complete sequencing of the first bacterial genomes [15,16] as well as the creation of initiatives aimed at sequencing the genomes of *Sacharomyces cerevisae*, *Caenorhabditis elegans*, *Drosophila melanogaster* and *Homo sapiens* provided the technical and technological framework for the initial sequencing of genomes in plants [17–21]. These projects validated the idea of applying a scaled-up form of shotgun sequencing [22]. Shotgun sequencing relied on computer algorithms to enable *in silico* assembly of overlapping sequencing reads derived from ran‐ domly-generated subclones. The development of software suites such as Phred, Phrap and Consed [23] allowed calling bases, setting individual base quality, assembling overlapping reads, assigning assembly quality scores, viewing final assemblies and extracting consensus sequences. Two major genomic shotgun sequencing strategies were defined at that time: (1) whole-genome shotgun sequencing (WGS) and (2) clone-by-clone, also referred to as BAC-by-

were more suitable to resolve complex genomes [8, 9].

**genomes**

46 Plant Genomics

quality score at each position [6, 14].

assembly in plants as well as current advances and future directions.

The second Sanger sequence assembly strategy, clone-by-clone, was successfully deployed in projects aimed at complex eukaryotic genomes. In clone-by-clone genome assembly, shotgun sequencing is performed in libraries derived from individual genomic large-insert clones, selected in a minimum tile path according to physical and genetic map information [24, 25]. The most common type of large-insert clone is the bacterial artificial chromosome (BAC), which can stably carry genomic inserts ranging from 100 to 300 kb and is relatively easy to maintain and purify. Accordingly, this method is usually referred to as BAC-by-BAC, although additional vector systems have been used in assembly projects, including yeast artificial chromosomes (YACs), P1 artificial chromosomes (PAC), transformation-competent artificial chromosomes (TACs), cosmids and fosmids. The two major genomic shotgun-sequencing approaches, WGS and BAC-by-BAC, had advantages and disadvantages when applied to Sanger-based sequencing platforms, depending on the genome of interest. The clone-by-clone approach benefited from working in small units, effectively reducing complexity and compu‐ tational requirements. This approach minimized problems associated with the misassembly of highly repetitive DNA and therefore provided a better, more complete assembly in plants and other complex eukaryotic genomes. WGS projects were computationally intensive and were less effective bridging repetitive regions in complex genomes but benefited from considerably lower cost, time and logistics [14].

The first completed reference plant genome was from the model system *Arabidopsis thaliana*, accession Columbia [26]. At that time, it was only the third multicellular eukaryotic genome to be published, after *C. elegans* and *D. melanogaster*. The nuclear genome of Arabidopsis is distributed in five chromosomes, and it is only approximately 4% the size of the human genome. The *A. thaliana* genome initiative used multiple types of available large-insert libraries including cosmids, BACs, PACs and TACs. Shotgun clones were constructed and then mapped by restriction fragment fingerprinting as well as screening with hybridization or polymerase chain reaction (PCR) markers. End sequences for 47,788 BAC clones were further used to anchor clones, integrate contigs and help select a minimum tiling path. Each of 1,569 clones in a minimum tiling path were selected, sequenced bidirectionally and assembled at estimated error rates of less than 1 in 10,000 bases. Direct PCR products were used to close some gaps and YACs allowed the characterization of telomere sequences. As initially published, the total length of sequenced regions was 115.4 Mb, in addition to an estimated 10 Mb nonsequenced centromeric and rDNA repeat regions. Since the original publication, the Arabidopsis genome sequence reference has been subjected to several rounds of improvements, each time reducing gaps and extending the sequence towards the centromeric regions [27].

The second published plant genome reference was rice (*Oryza sativa*). While the rice genome is more than 2-fold the size of Arabidopsis, approximately 390 Mb, it is one of the smallest genomes of any major crop, less than 15% the size of the human genome. Like Arabidopsis, the rice genome was completed using a Sanger-only clone-by-clone approach [28] that required the initial construction, fingerprinting and physical mapping of a large number of random BACs and PACs. In total, 3,401 mapped clones in a minimum tiling path were selected from the physical map, randomly sheared and individually end-sequenced to approximately 10 fold coverage. Clone sequences were assembled and low-quality regions were finished using targeted sequencing. Gaps were closed and low-quality regions resolved by sequencing PCR fragments, plasmids and fosmids.

The draft reference genome of Maize, one of the most important crops in the world, is considered the last major published plant genome project based primarily on a Sanger BACby-BAC strategy [29]. At 2.3 Gb and spanning 10 chromosomes, the nuclear genome of maize is considerably larger than that of rice and Arabidopsis, approximately 3/4 the size of the human genome. A set of 16,848 minimally overlapping BAC clones, derived from an integrated physical and genetic map, were selected and end-sequenced. The assembly was performed after adding additional data derived from cDNA sequences and sequences from subtractive libraries with methyl-filtered DNA and high C0t techniques, resulting in a whole-genome assembly (B73 RefGen\_v1) made of 2,048 Mb in 125,325 sequence contigs and 61,161 scaffolds [29]. Unlike the completed genomes of rice and Arabidopsis, most sequenced BACs in the first version of the maize draft genome are unfinished. Gaps and low-quality regions in BACs were not systematically closed by PCR sequencing or other target approaches. Therefore, while the BACs used in the minimum tiling path were mapped, the order and orientation of individual contigs within a single BAC could be incorrect. Subsequent versions of the genome have been improved by targeting gaps and adding alternative sequencing strategies described later in this review.

Finally, it is important to mention that a significant number of plant genome sequencing initiatives have used WGS strategies, which provide a considerable reduction in time and cost associated with cloning, construction, mapping and selection. Sanger WGS genome projects included those of poplar tree, grape, and papaya [30–32]. Later refinements to the process enabled the sequencing of *Brachypodium distachyon* [33] as well as the larger genomes of *Sorghum bicolor* (~730 Mb) [34] and soybean, an ancestral tetraploid (1.1 Mb) [35]. It should be noticed that, as demonstrated by the Maize genome project, the two Sanger shotgun assembly approaches, as well as later sequence technologies, are not mutually exclusive and may be complementary to increase quality and coverage.

The high cost and logistics of plant projects based on clone-by-clone Sanger sequencing required extensive funding, the creation of large collaborative consortia and several years of fingerprinting and sequencing work. The cost of the project by the Arabidopsis Genome Initiative has been estimated at US\$70 million [36]. The International Rice Genome Sequencing Project (IRGSP), which included groups from 11 different nations, took over 5 years to complete. During its early stages, IRGSP had estimated that the project would take 10 years and cost a staggering US\$200 million [37]. The Maize draft genome was accomplished by multiple laboratories at an estimated cost of tens of millions in a joint NSF/DOE/USDA program. It is worth noticing that, while the cost and time required to accomplish Sanger WGS projects are in fact lower than those based on a clone-by-clone approach, they are still consid‐ erable for today's standards. The sequencing of the 1.1-Gb soybean genome, the largest published plant genome based on a Sanger WGS approach, provides an example of such a cost. It was completed in less than two years although it took a group of 18 institutions several million dollars to generate and assemble more than 15 million Sanger reads from multiple libraries with average sizes ranging from 3.3 kb to 135 kb [35].

genomes of any major crop, less than 15% the size of the human genome. Like Arabidopsis, the rice genome was completed using a Sanger-only clone-by-clone approach [28] that required the initial construction, fingerprinting and physical mapping of a large number of random BACs and PACs. In total, 3,401 mapped clones in a minimum tiling path were selected from the physical map, randomly sheared and individually end-sequenced to approximately 10 fold coverage. Clone sequences were assembled and low-quality regions were finished using targeted sequencing. Gaps were closed and low-quality regions resolved by sequencing PCR

The draft reference genome of Maize, one of the most important crops in the world, is considered the last major published plant genome project based primarily on a Sanger BACby-BAC strategy [29]. At 2.3 Gb and spanning 10 chromosomes, the nuclear genome of maize is considerably larger than that of rice and Arabidopsis, approximately 3/4 the size of the human genome. A set of 16,848 minimally overlapping BAC clones, derived from an integrated physical and genetic map, were selected and end-sequenced. The assembly was performed after adding additional data derived from cDNA sequences and sequences from subtractive libraries with methyl-filtered DNA and high C0t techniques, resulting in a whole-genome assembly (B73 RefGen\_v1) made of 2,048 Mb in 125,325 sequence contigs and 61,161 scaffolds [29]. Unlike the completed genomes of rice and Arabidopsis, most sequenced BACs in the first version of the maize draft genome are unfinished. Gaps and low-quality regions in BACs were not systematically closed by PCR sequencing or other target approaches. Therefore, while the BACs used in the minimum tiling path were mapped, the order and orientation of individual contigs within a single BAC could be incorrect. Subsequent versions of the genome have been improved by targeting gaps and adding alternative sequencing strategies described later in

Finally, it is important to mention that a significant number of plant genome sequencing initiatives have used WGS strategies, which provide a considerable reduction in time and cost associated with cloning, construction, mapping and selection. Sanger WGS genome projects included those of poplar tree, grape, and papaya [30–32]. Later refinements to the process enabled the sequencing of *Brachypodium distachyon* [33] as well as the larger genomes of *Sorghum bicolor* (~730 Mb) [34] and soybean, an ancestral tetraploid (1.1 Mb) [35]. It should be noticed that, as demonstrated by the Maize genome project, the two Sanger shotgun assembly approaches, as well as later sequence technologies, are not mutually exclusive and may be

The high cost and logistics of plant projects based on clone-by-clone Sanger sequencing required extensive funding, the creation of large collaborative consortia and several years of fingerprinting and sequencing work. The cost of the project by the Arabidopsis Genome Initiative has been estimated at US\$70 million [36]. The International Rice Genome Sequencing Project (IRGSP), which included groups from 11 different nations, took over 5 years to complete. During its early stages, IRGSP had estimated that the project would take 10 years and cost a staggering US\$200 million [37]. The Maize draft genome was accomplished by multiple laboratories at an estimated cost of tens of millions in a joint NSF/DOE/USDA program. It is worth noticing that, while the cost and time required to accomplish Sanger WGS projects are in fact lower than those based on a clone-by-clone approach, they are still consid‐ erable for today's standards. The sequencing of the 1.1-Gb soybean genome, the largest

fragments, plasmids and fosmids.

complementary to increase quality and coverage.

this review.

48 Plant Genomics

Besides cost and time considerations, these early Sanger-only projects posed considerable technical challenges. Despite the extensive resources deployed towards the sequencing of the Arabidopsis and rice genomes, which are usually considered as finished, as well as other projects mentioned in this review, they all have representation gaps. A considerable number of gaps correspond to regions that are "unclonable" under the conditions used to prepare BAC and other genomic libraries. Although many of these regions correspond to tandem repeats such as telomeric sequences and other repetitive regions, it may also include gene space [29]. Moreover, the maximum length of quality Sanger reads, usually 800–900 bp, as well as technical issues associated with the sequencing of DNA stretches with strong secondary structures or extensive homopolymers, create conditions for additional sequencing gaps, even in regions with physical coverage.

Finally, most plant genomes are characterized by elevated proportions of highly repetitive DNA and by the presence of segmental duplications or full genome duplications due to polyploidization events [38], which can be problematic during assembly. The 1C genome content in Maize, for example, is smaller than in humans but it consists of higher proportions and larger tracks of high-copy elements such as retrotransposable elements [29, 38]. At least some of the differences between the assembled and estimated genomes of the Maize line B73 could be attributed to the assembly-based collapse of highly similar long terminal repeats (LTRs) at the end of retrotransposons. It is important to note that all the Sanger-only initiatives corresponded to plant species with genomes that were considerably smaller than the average 5.8-Gb plant genome. Plant genomes have a considerably wider size range than in mammals, and in some important crops (e.g. wheat), nuclear genomes can be more than 15 Gb long, well beyond the practical realm of Sanger sequencing. Although BAC-by-BAC approaches can reduce complexity by more than 10,000 fold, Sanger-based assembly remains difficult and prohibitively expensive in plant genomes of moderate or large size. The WGS approach is even more sensitive to the complexity of plant genomes as it increases the potential for assembly artefacts due to haplotype and homeolog collapse in regions with high identity. Reductions in time and cost in WGS projects are achieved at the expense of assembly fidelity in repetitive regions and expanded need for computational resources.

#### **3. Next-generation sequencing technologies applied to** *de novo* **assembly of plant genomes**

#### **3.1. Second-generation sequencing technologies**

As indicated above, successful whole-genome sequencing projects have been achieved with the use of Sanger technology. However, such projects require dealing with several complicat‐ ing factors, including high costs and relatively long turnaround times to completion. The emergence of next-generation sequencing (NGS) technologies has changed this paradigm, both by reducing costs and increasing sequencing throughputs, while at the same time introducing complexity related to the relative short reads of NGS reads. Several NGS technologies have emerged in the past 7 to 8 years [for reviews, see refs. 39–41]. All follow a relatively uniform approach to library construction and sequencing. To complete sequencing: (1) universal adapters are ligated at the end of single DNA molecule templates; (2) adapter-ligated DNA templates are amplified via PCR to create a cluster of identical isoforms and (3) clusters are loaded on sequencers and nucleotide incorporations occur in parallel on millions of clusters. These generate an amplified signal that is recognized by the platform and translated into a base call.

The most widely used NGS technology nowadays is the one commercialized by Illumina [42], whose high-throughput instrument, the HiSeq4000, can produce up to 1.5 Tb of sequencing data in approximately 3.5 days.. In the Illumina sequencing platform, sequencing templates generated during library construction are immobilized on a solid surface, and a "bridge PCR" approach allows for the localized amplification of millions of single DNA molecules, thus generating millions of clusters, each containing thousands of copies of the original DNA molecules [43]. Sequencing then is performed using a sequencing-by-synthesis approach where single-base extension allows the incorporation of a fluorescently labelled nucleotide (a blocking chemical moiety at the 3' hydroxyl end allows the incorporation of one base only). Once incorporated, the label is detected and the resulting signal subsequently translated into a base call. Finally, the fluorescent dye and the blocking 3' agent are cleaved, allowing the next single base incorporation event to occur. Through the use of alternating cycles of base incorporation, image capture and dye cleavage, the Illumina sequencing technology can produce reads that are up to 300 bp in length. The relatively high error rate (~0.1% or 10 times higher than Sanger sequencing) [39] can be compensated by very high sequencing coverage, thus allowing random errors at any given base position to be ignored below a certain frequency threshold. The relative short read of Illumina sequencing reads can be explained by several noise factors accumulating after each cycle, including phasing, where imperfect single-base incorporation and imperfect cleavage of the dye and 3' hydroxyl blocking moiety lead to the accumulation of copies of various lengths within a cluster, and the subsequent increase of signal-to-noise ratio after each cycle [44].

#### **3.2. Third-generation sequencing technologies**

*De novo* assemblies of plant genomes have been performed with NGS reads only, either with reads generated on the Illumina platform alone or with reads generated with the Illumina platform combined with reads generated on the Roche 454 second-generation sequencing platform [45]. However, those assemblies generally are fragmented, resulting in low N50 values and a high number of contigs, mostly because of the overall short read length, the complexity of the genome and the presence of conserved regions whose length exceeds the length of NGS reads and thus cannot be extended during the *de novo* assembly process. The emergence of third-generation sequencing technologies [46, 47] has started to address some of the inherent limitations of sequencing and assembling large and complex plant genomes. Those technologies are characterized by the parallel sequencing of single molecules of DNA (rather than "clusters"), thus avoiding phasing issues, and the resulting sequences tend to be in the kb range, offering the opportunity to assemble genomes and generating longer contigs by encompassing complex and conserved genomic regions and allowing relatively highconfidence assemblies of overlapping reads. However, single sequencing reads tend to exhibit relatively high error rates (~15%–25% on average). Deep sequencing coverage or repeated sequencing of the same DNA fragments therefore are required to offset the presence of a high number of sequencing errors [48, 49]. As of today, two companies have developed and commercialized third-generation sequencing technologies, namely, Pacific Biosciences [e.g., 50] (Menlo Park, CA) and Oxford Nanopore Technologies [e.g., 51] (Oxford, UK). Each company uses vastly different approaches to sequencing. The Pacific Biosciences (PacBio) RS II system uses a sequencing-by-synthesis approach to offer up to ~40-kb reads, where base incorporation is monitored in a real-time fashion. Nanoscale holes, described as Zero Mode Waveguides ("ZMW") are located on a chip, where individual polymerases are covalently attached to the surface of each ZMW. Individual nucleotides with a fluorescent label attached to the phosphate chain are incorporated to the elongating strand and the excited dye emits a signal that is captured before diffusion of the released pyrophosphate, and translated into a specific base call. DNA fragments used as template are ligated to "bell-shaped" adapters at both ends, thus facilitating the sequencing of DNA fragments through multiple passes and the creation of a more accurate consensus sequence. The overall stability and activity of the polymerase remain limited by photo damage and the progressive dissociation of the poly‐ merase/template complex from the surface of the ZMW. It is therefore expected that reads generated from smaller DNA fragments will exhibit higher consensus accuracy than reads from larger DNA fragments. Oxford Nanopore Technologies released the MinION sequencing device in early access mode in 2014. Like the PacBio RS II system, the MinION delivers long reads in a real-time fashion, from single molecules of DNA. In that particular case, however, sequencing is performed by measuring the change in ionic current when a single DNA strand translocates through a protein nanopore located in an insulated membrane. The resulting signal is measured and translated into a base call. Because no enzyme is involved in the DNA sequencing process, it is expected that read length will be driven mostly by the physical length of the DNA strand being sequenced. Library construction involves the ligation of two types of adapters to DNA fragment, one "Y-shaped" adapter with a bound protein that unwinds the double-stranded DNA and facilitates the translocation of a single strand through the pore, and one "bell-shaped" adapter at the other end that allows the translocation, and sequencing, of both the sense and antisense strands. Sequencing reads then are generated by aligning base calls from the two strands and producing a higher quality consensus sequence.

#### **3.3. Challenges in assembling plant genomes**

by reducing costs and increasing sequencing throughputs, while at the same time introducing complexity related to the relative short reads of NGS reads. Several NGS technologies have emerged in the past 7 to 8 years [for reviews, see refs. 39–41]. All follow a relatively uniform approach to library construction and sequencing. To complete sequencing: (1) universal adapters are ligated at the end of single DNA molecule templates; (2) adapter-ligated DNA templates are amplified via PCR to create a cluster of identical isoforms and (3) clusters are loaded on sequencers and nucleotide incorporations occur in parallel on millions of clusters. These generate an amplified signal that is recognized by the platform and translated into a

The most widely used NGS technology nowadays is the one commercialized by Illumina [42], whose high-throughput instrument, the HiSeq4000, can produce up to 1.5 Tb of sequencing data in approximately 3.5 days.. In the Illumina sequencing platform, sequencing templates generated during library construction are immobilized on a solid surface, and a "bridge PCR" approach allows for the localized amplification of millions of single DNA molecules, thus generating millions of clusters, each containing thousands of copies of the original DNA molecules [43]. Sequencing then is performed using a sequencing-by-synthesis approach where single-base extension allows the incorporation of a fluorescently labelled nucleotide (a blocking chemical moiety at the 3' hydroxyl end allows the incorporation of one base only). Once incorporated, the label is detected and the resulting signal subsequently translated into a base call. Finally, the fluorescent dye and the blocking 3' agent are cleaved, allowing the next single base incorporation event to occur. Through the use of alternating cycles of base incorporation, image capture and dye cleavage, the Illumina sequencing technology can produce reads that are up to 300 bp in length. The relatively high error rate (~0.1% or 10 times higher than Sanger sequencing) [39] can be compensated by very high sequencing coverage, thus allowing random errors at any given base position to be ignored below a certain frequency threshold. The relative short read of Illumina sequencing reads can be explained by several noise factors accumulating after each cycle, including phasing, where imperfect single-base incorporation and imperfect cleavage of the dye and 3' hydroxyl blocking moiety lead to the accumulation of copies of various lengths within a cluster, and the subsequent increase of

*De novo* assemblies of plant genomes have been performed with NGS reads only, either with reads generated on the Illumina platform alone or with reads generated with the Illumina platform combined with reads generated on the Roche 454 second-generation sequencing platform [45]. However, those assemblies generally are fragmented, resulting in low N50 values and a high number of contigs, mostly because of the overall short read length, the complexity of the genome and the presence of conserved regions whose length exceeds the length of NGS reads and thus cannot be extended during the *de novo* assembly process. The emergence of third-generation sequencing technologies [46, 47] has started to address some of the inherent limitations of sequencing and assembling large and complex plant genomes. Those technologies are characterized by the parallel sequencing of single molecules of DNA (rather than "clusters"), thus avoiding phasing issues, and the resulting sequences tend to be in the kb range, offering the opportunity to assemble genomes and generating longer contigs

base call.

50 Plant Genomics

signal-to-noise ratio after each cycle [44].

**3.2. Third-generation sequencing technologies**

*De novo* assembly of genomes has closely mimicked the trends and improvements in sequenc‐ ing technologies and accompanying sequencing assembly software over the years [45]. The emergence of next-generation sequencing technologies has allowed a much larger number of plant genomes to be sequenced and assembled than what would have been deemed possible with Sanger sequencing alone, mostly because of the costs and labor involved in such projects. However, the complexity of the majority of those genomes still makes it a challenge to resolve them with short reads alone [52, 53]. As a result, most plant genome assemblies are highly fragmented, with large number of contigs and conserved regions of the genome in an unfin‐ ished state [54]. The presence of highly conserved repeats often exceeding 10 kb in length represents a major challenge in assembling plant genomes. The most common types of repeats in plants are type II long-terminal repeat (LTR) retrotransposons and their proliferation within a genome often explains most of the structural variations between strains [55]. Their movement also results in genome expansion, where repeats represent, in some instances, more than 80– 90% of the structural content of a particular genome [29]. Repeat expansion also can lead to very large genome sizes. While NGS technologies can generate enough raw data to cover an entire genome in a relatively cost-effective manner, assembling such a large amount of data often represents a major computational challenge. For example, the assembly of the loblolly pine genome (~22 Gb), which represents the largest genome assembled to date, could be solved only using condensed sets and read pooling prior to assembly [56]. Assembling large and repeat-rich genomes can also be facilitated by using supplemental layers of information, such as the physical distance between "paired" reads (end-sequences generated at both ends of a particular DNA fragment) in mate-pair libraries. Another challenge for *de novo* assembly of plant genome is the issue of polyploidy [57]. Polyploidy is an important force in plant genome evolution and it is estimated that ~80% of all living plants are polyploids [58], while close to 100% of all plant lineages have a paleo-polyploidy event in their history. As a consequence, some plants species, including economically important crop species like soybean [35], have entire gene families consisting of highly similar paralogs. Those gene families are the direct result of paleo-polyploidization events where the merger of genomes has been followed by extensive structural rearrangements, including gene loss, and the modification of gene expression for paralogs within a particular gene family. The diploid genomes of progenitor species can be used to determine the origin and structure of contigs when assembling large polyploid genomes [59]. Finally, heterozygosity may represent another important challenge when assembling plant genomes. Outcrossing species like grape, for instance, exhibit up to 13% sequence divergence between alleles, and the existence of such variation will impact contig assembly when both alleles are sequenced in a whole-genome assembly project [31].

#### **3.4. Examples of plant genome assemblies**

According to Michael and Van Buren [45], over 100 plants genomes have been sequenced since 2000, out of which 63% are genomes from various crop species. As indicated above, different Sanger sequencing strategies have been applied with varying degrees of success on several plant genomes. However, the most successful Sanger-based genome assemblies have been obtained from relatively small genomes (Arabidopsis, rice), while *de novo* assemblies for larger and complex genomes, such as maize, remains partial and unfinished (manual improvements of the maize genome were limited to nonrepetitive regions only). In addition, due to the high costs and labor associated with such approaches, and the need for (in most cases) an interna‐ tional consortium to complete such projects, a vast majority of the most recent genomes have been sequenced using either a hybrid approach, complementing Sanger sequencing with NGS data, or using NGS data alone, from various NGS platforms. Such platforms include Illumina, 454/Roche, and more recently, Pacific Biosciences.

The domesticated tomato genome [60] represents an example of Sanger/NGS hybrid genome assembly. A total of 30,800 BAC clones from three different BAC libraries were shotgunsequenced and end-sequenced, generating a total of 3.3 Gb of Sanger reads. In addition, 454/ Roche shotgun and mate-pair sequencing was performed, both on BAC pools and wholegenome DNA preparation, using different insert sizes and generating a total of 21 Gb of NGS data. The *de novo* assembly of Sanger and 454 data was performed using the Newbler assembly software [61] and other sequence assembly and alignment tools. Further scaffolding and polishing of the assembly were performed when integrating BAC end-sequence data and additional high-coverage Illumina and ABI/SOLiD data. Taken together, the *de novo* assembly resulted in 3,761 scaffolds totalling to 782 Mb, with 95% of the assembled scaffold sequences present in 225 scaffolds. The predicted tomato genome size is approximately 900 Mb. The correctness and integrity of the assembly were validated through different means including the alignment of clone end-sequences, publicly available tomato EST sequences, and alignment of BAC contigs from a sequence-based physical BAC map. Interestingly comparison of the tomato, potato and grape genomes supported the existence of two successive whole-genome triplication events in common ancestors that added new gene family members that mediate important fruit functions, such as enzymes involved in ethylene biosynthesis (examples of whole genome duplication or triplication events abound among plant genomes that have been sequenced to date).

ished state [54]. The presence of highly conserved repeats often exceeding 10 kb in length represents a major challenge in assembling plant genomes. The most common types of repeats in plants are type II long-terminal repeat (LTR) retrotransposons and their proliferation within a genome often explains most of the structural variations between strains [55]. Their movement also results in genome expansion, where repeats represent, in some instances, more than 80– 90% of the structural content of a particular genome [29]. Repeat expansion also can lead to very large genome sizes. While NGS technologies can generate enough raw data to cover an entire genome in a relatively cost-effective manner, assembling such a large amount of data often represents a major computational challenge. For example, the assembly of the loblolly pine genome (~22 Gb), which represents the largest genome assembled to date, could be solved only using condensed sets and read pooling prior to assembly [56]. Assembling large and repeat-rich genomes can also be facilitated by using supplemental layers of information, such as the physical distance between "paired" reads (end-sequences generated at both ends of a particular DNA fragment) in mate-pair libraries. Another challenge for *de novo* assembly of plant genome is the issue of polyploidy [57]. Polyploidy is an important force in plant genome evolution and it is estimated that ~80% of all living plants are polyploids [58], while close to 100% of all plant lineages have a paleo-polyploidy event in their history. As a consequence, some plants species, including economically important crop species like soybean [35], have entire gene families consisting of highly similar paralogs. Those gene families are the direct result of paleo-polyploidization events where the merger of genomes has been followed by extensive structural rearrangements, including gene loss, and the modification of gene expression for paralogs within a particular gene family. The diploid genomes of progenitor species can be used to determine the origin and structure of contigs when assembling large polyploid genomes [59]. Finally, heterozygosity may represent another important challenge when assembling plant genomes. Outcrossing species like grape, for instance, exhibit up to 13% sequence divergence between alleles, and the existence of such variation will impact contig assembly when both alleles are sequenced in a whole-genome assembly project [31].

According to Michael and Van Buren [45], over 100 plants genomes have been sequenced since 2000, out of which 63% are genomes from various crop species. As indicated above, different Sanger sequencing strategies have been applied with varying degrees of success on several plant genomes. However, the most successful Sanger-based genome assemblies have been obtained from relatively small genomes (Arabidopsis, rice), while *de novo* assemblies for larger and complex genomes, such as maize, remains partial and unfinished (manual improvements of the maize genome were limited to nonrepetitive regions only). In addition, due to the high costs and labor associated with such approaches, and the need for (in most cases) an interna‐ tional consortium to complete such projects, a vast majority of the most recent genomes have been sequenced using either a hybrid approach, complementing Sanger sequencing with NGS data, or using NGS data alone, from various NGS platforms. Such platforms include Illumina,

**3.4. Examples of plant genome assemblies**

52 Plant Genomics

454/Roche, and more recently, Pacific Biosciences.

Because of the relatively cheap costs involved, a large number of plant genomes have been sequenced and assembled using NGS technologies alone. This includes the assembly of the complex tetraploid genome of cultivated cotton (*Gossypium arboreum*) [62]. The tetraploid cultivated cotton genome has a genome size of approximately 1.7 Gb and is thought to have appeared 1–2 million years ago through interspecific hybridization between diploid A (*Gossypium arboretum*) and D (*Gossypium raimondii*) subgenome progenitors. A total of 371.5 Gb of shotgun Illumina data was generated with various insert sizes ranging from 180 bp to 40 kb and complemented with 33,454 BAC end sequences. The assembly was performed with SOAPdenovo [63], which resulted in 40,381 contigs, anchored and oriented in 7,914 scaf‐ folds, ranging in length from 140 kb to 5.9 Mb with 90% of the contigs included in 3,740 scaffolds.

An example of a smaller, relatively less complex genome assembly is that of the crop species *Brassica rapa* [64]. An estimated 72× sequencing coverage of the genome was generated, corresponding to Illumina shotgun paired-end data from NGS libraries with insert sizes ranging from 200 bp to 10 kb, and assembled using SOAPdenovo [63]. The resulting assembly was made of 14,207 contigs larger than 2 kb, further assembled into 794 scaffolds, totalling approximately 283.8 Mb and estimated to cover more than 98% of the gene space, based on alignments of 214,425 *B. rapa* public EST sequences and 52,712 unigenes from the BrGP database [65]. Further assessment of the integrity of the assembly was performed by aligning BAC clone Sanger sequences reported in previous studies.

While a large number of genomes have been sequenced with NGS technologies alone, the relatively short reads of the major NGS platforms that have been used in those assembly projects, combined with the general complexity of most of those genomes, generally require the use of alternative methods to facilitate the assembly or confirm its integrity. These methods rely on the use of various types of NGS libraries, such as mate-pair large inserts, or the use of Sanger-derived sequencing data such as EST or BAC-based shotgun reads. However, scaffolding of NGS contigs, based on using pairing information between NGS reads originating from the same DNA fragment, generally leads to unresolved gaps between contigs, often due to the presence of large repeat regions whose size exceed the length and resolution of short NGS reads. As a result, significant portions of any given scaffold contain large batches of unknown sequences, and of unknown length. To address these issues and improve plant genome assemblies, researchers have developed a series of multifaceted solutions, combining alignment to known public data, such as ESTs or BAC ends, or, when available, reference genomes from related species, integration of physical and genetic map data, or new technologies. Some of these approaches have been described in the next chapter.

#### **4. Complementary approaches to** *de novo* **assembly of plant genomes**

#### **4.1. Long-read assembly**

NGS assembly strategies based on the use of short reads cannot solve long and identical transposable elements abundantly present in most plant genomes. The use of long reads is expected to address some of those shortcomings and improve the overall quality of *de novo* assembly by ordering contigs, closing gaps, and improving scaffolding. As a consequence, researchers have started to adopt the single-molecule long-read sequencing technology from Pacific Biosciences in plant genome assembling projects. Spinach is an example of such genome assembly efforts. Spinach is a diploid species with a genome size estimated at 989 Mb. Van Deynze *et al*. [66] sequenced and assembled the Spinach genome using large fragment libraries of Pacific Biosciences sequence reads. They generated a 60× coverage of the genome, with 20% of the reads larger than 20 kb. Data were assembled using PacBio's Hierarchical Genome Assembly Process (HGAP) [67], which showed that long-read assemblies exhibited a 63-fold improvement in contig size over an Illumina-only assembly, derived from multiple Illumina libraries.

A distinct strategy to long-read assembly, namely, the Illumina TruSeq Synthetic Long-Read (SLR) strategy [68], is also expected to improve the quality of assemblies generated with short reads only. In SLR libraries, genomic DNA is fragmented to ~10 kb and individual indexed Illumina libraries are generated in parallel from highly diluted pools of sheared DNA frag‐ ments. After Illumina sequencing and data deconvolution, the original ~10 kb fragments can be reassembled, effectively reducing the complexity level of the assembly and generating veryhigh quality synthetic long reads that can subsequently be assembled together or used for haplotype resolution.

The use of long reads in *de novo* assembly is bound to become more prevalent in the near future, reducing the number of scaffolds while at the same time increasing their average length. The use of PacBio in smaller genomes, such as microbial genomes, has already demonstrated that the assemblies often result in contigs corresponding in most cases to individual chromosomes or plasmids present in the microbial cells. Likewise, it is likely that future plant studies will include such long reads, either alone or in combination with short-read NGS data to improve assembly and coverage in questionable regions, and to confirm the integrity of the assembly in a manner similar to Sanger data with current NGS assemblies.

#### **4.2. Genetic anchoring**

the use of alternative methods to facilitate the assembly or confirm its integrity. These methods rely on the use of various types of NGS libraries, such as mate-pair large inserts, or the use of Sanger-derived sequencing data such as EST or BAC-based shotgun reads. However, scaffolding of NGS contigs, based on using pairing information between NGS reads originating from the same DNA fragment, generally leads to unresolved gaps between contigs, often due to the presence of large repeat regions whose size exceed the length and resolution of short NGS reads. As a result, significant portions of any given scaffold contain large batches of unknown sequences, and of unknown length. To address these issues and improve plant genome assemblies, researchers have developed a series of multifaceted solutions, combining alignment to known public data, such as ESTs or BAC ends, or, when available, reference genomes from related species, integration of physical and genetic map data, or new technologies. Some of these approaches have been described in the next chapter.

**4. Complementary approaches to** *de novo* **assembly of plant genomes**

NGS assembly strategies based on the use of short reads cannot solve long and identical transposable elements abundantly present in most plant genomes. The use of long reads is expected to address some of those shortcomings and improve the overall quality of *de novo* assembly by ordering contigs, closing gaps, and improving scaffolding. As a consequence, researchers have started to adopt the single-molecule long-read sequencing technology from Pacific Biosciences in plant genome assembling projects. Spinach is an example of such genome assembly efforts. Spinach is a diploid species with a genome size estimated at 989 Mb. Van Deynze *et al*. [66] sequenced and assembled the Spinach genome using large fragment libraries of Pacific Biosciences sequence reads. They generated a 60× coverage of the genome, with 20% of the reads larger than 20 kb. Data were assembled using PacBio's Hierarchical Genome Assembly Process (HGAP) [67], which showed that long-read assemblies exhibited a 63-fold improvement in contig size over an Illumina-only assembly, derived from multiple Illumina

A distinct strategy to long-read assembly, namely, the Illumina TruSeq Synthetic Long-Read (SLR) strategy [68], is also expected to improve the quality of assemblies generated with short reads only. In SLR libraries, genomic DNA is fragmented to ~10 kb and individual indexed Illumina libraries are generated in parallel from highly diluted pools of sheared DNA frag‐ ments. After Illumina sequencing and data deconvolution, the original ~10 kb fragments can be reassembled, effectively reducing the complexity level of the assembly and generating veryhigh quality synthetic long reads that can subsequently be assembled together or used for

The use of long reads in *de novo* assembly is bound to become more prevalent in the near future, reducing the number of scaffolds while at the same time increasing their average length. The use of PacBio in smaller genomes, such as microbial genomes, has already demonstrated that

**4.1. Long-read assembly**

54 Plant Genomics

libraries.

haplotype resolution.

The emergence of NGS technologies has rapidly led researchers to develop methods and assays for variant discovery in various plant genomes. Some studies have shown that Single nucleo‐ tide polymorphisms (SNPs) can be discovered in parental inbred lines using next-generation sequencing [69]. Entire mapping populations also have been simultaneously sequenced and genotyped, in a process known as "genotyping-by-sequencing" (GBS) [70, 71], discovering in the process extensive lists of segregating markers within the mapped population [72, 73], that can be completed by using known reference maps or sequences to impute missing marker data from individual haplotypes. Various reduced-representation methods have been employed for NGS-derived SNP discovery in plant species where whole-genome shotgun sequencing still remains too expensive for sequencing more than a few individuals [71]. These methods include the use of restriction enzyme digestion–based assays with methyl-sensitive restriction endonucleases [74, 75], or methods based on sequence capture approaches [76], to sequence and map gene-rich portions of a genome, and allowing the anchoring of SNPs in a relatively unambiguous manner.

More recently, ultradense linkage maps have been created from genotyping by whole genome sequencing of a genetic mapping population. It has been used to place whole-genome sequencing contigs into a map, thus anchoring, and ordering, sequencing of contigs [77]. Such an approach requires using a genetic linkage map as a framework, into which SNPs derived from the whole genome sequencing assembly can be integrated into a genetic framework derived from low coverage whole-genome sequencing data from a segregating population. The genetic position of the sequence-derived SNPs can then be used to assign chromosomal locations to the contigs harboring them. Such an approach has been used in the context of a whole-genome assembly project in barley where genetic anchoring was applied to a wholegenome assembly [78]. SNPs discovered by sequencing individuals from two mapping populations at low coverage (~1×) were placed into genetic maps that had been previously constructed through different means, including SNP array data and GBS, or made from the whole-genome shotgun sequencing data of the population. Their genetic positions then were used to assign chromosomal locations, and integrate into the combined physical and genetic genome framework, approximately two-thirds of all whole-genome shotgun sequencing contigs. While highly effective in plants, where mapping populations are often readily available, it must be noted that such an approach is limited by the overall recombination landscape, and the subsequent relationship between physical and genetic distance within a particular region of the genome [76]. Recombination events in plants often occur in distal regions of the chromosomes, and peri-centromeric regions may require very large mapping populations to improve their resolution. In addition, recent studies have suggested that specific features of the genome, such as chromosomal inversion, translocation and duplication varying between the two parents used to generate the mapping population, may lead to errors and potentially confound genome assemblies.

#### **4.3. BAC pool sequencing in gene-rich regions**

A large number of genome assemblies have been generated with the help of physical maps and the use of a BAC-by-BAC sequencing approach. While laborious and costly, this approach still remains relevant as it offers multiple advantages over a whole-genome sequencing approach, especially in terms of assembling sequencing reads conserved in the context of a whole-genome assembly but mapping exclusively to a defined portion of a genome in the context of an individual clone assembly. Lonardi *et al*. [80] proposed a modified version of clone sequencing to take advantage of the massive sequencing capacity offered by NGS platforms. In that study, subsets of overlapping genome-tiling BAC clones were selected and pooled according to a multidimensional grid design. Each pool then was sequenced on an Illumina HiSeq2000 instrument. The resulting paired-end reads were deconvoluted by determining, for each read the intersection between the pool it originates from and the individual BAC clone(s) within that same pool covering the portion of the genome the read corresponds to, based on physical map information. Once deconvolution is achieved, reads can be assembled using an NGS assembler (Velvet) [81], to recreate the sequence of the original BAC clone. Such an approach was successfully tested in barley BAC clones selected based on BAC-unigene associations described in that same study, thus suggesting that BAC pool sequencing can be used in correlation with existing physical maps to complement or correct whole-genome sequencing assemblies, offering in the process the likelihood of higher quality contig sequence assemblies in gene-rich regions of complex plant genomes.

#### **4.4. Optical mapping**

Optical mapping is a single-molecule approach that produces fingerprints using ordered restriction maps [82] or specific nick sites [83]. After enzymatic treatment and subsequent incorporation of fluorescent labels, the DNA molecules are stretched on a glass surface or in a nanochannel array and directly imaged to locate regions corresponding to the restriction sites or nick sites within the molecule. Distances between those sites are then inferred to produce an optical map of the DNA molecule. Two commercial platforms currently are available, namely, the Opgen Argus [84] and the BioNano Genomics Irys [85] systems. Using such techniques, very large DNA molecules, in the Mb range, can be interrogated for the presence and location of short recognition sites (whose sequence will vary with the enzyme being used to treat the DNA). Consensus optical maps then can be created by determining the overlap, under highly redundant conditions, between optical maps of single DNA molecules. Such consensus maps have to take into account the possibility of errors inherent to this type of technology, including star activity and false enzyme cuts, or the possibility of chimeric maps when joining, for example, optically mapped molecules containing paralogous genomic regions.

Optical maps can be used for multiple applications, including comparative genomics and structural variation detection, as well as the development of optical map-guided genome assemblies, where the optical map is aligned and compared to *in silico* digested contig sequences. Optical map-guided genome assemblies can assist in building high-quality genome assemblies by providing evidence of the ordering of adjacent contigs and scaffolds, or by assessing the overall sequence accuracy of contigs and suggesting potential errors in an assembly, such as inversions, translocations or chimeric contig or scaffold sequences. The addition of optical maps to a genome assembly often results in a significant increase in the scaffold N50 value. For example, Hastie *et al*. [86] used the mapping of tiling BAC clones in a 2.1 Mb highly repetitive region of *Aegilops tauschii* (the D-genome donor of hexaploid wheat) to correct several misassemblies and improve the assembly from 75% to 95% complete. In another study [87], a high-resolution optical map, spanning 91% of the maize genome, was built, and used to characterize gaps within contigs, the maize genetic-physical (FPC) map and the reference pseudomolecules. Results also suggested that the placement of 12 FPC contigs on the maize genetic-physical map required re-evaluation.

#### **4.5. Long-range Hi-C interactions**

specific features of the genome, such as chromosomal inversion, translocation and duplication varying between the two parents used to generate the mapping population, may lead to errors

A large number of genome assemblies have been generated with the help of physical maps and the use of a BAC-by-BAC sequencing approach. While laborious and costly, this approach still remains relevant as it offers multiple advantages over a whole-genome sequencing approach, especially in terms of assembling sequencing reads conserved in the context of a whole-genome assembly but mapping exclusively to a defined portion of a genome in the context of an individual clone assembly. Lonardi *et al*. [80] proposed a modified version of clone sequencing to take advantage of the massive sequencing capacity offered by NGS platforms. In that study, subsets of overlapping genome-tiling BAC clones were selected and pooled according to a multidimensional grid design. Each pool then was sequenced on an Illumina HiSeq2000 instrument. The resulting paired-end reads were deconvoluted by determining, for each read the intersection between the pool it originates from and the individual BAC clone(s) within that same pool covering the portion of the genome the read corresponds to, based on physical map information. Once deconvolution is achieved, reads can be assembled using an NGS assembler (Velvet) [81], to recreate the sequence of the original BAC clone. Such an approach was successfully tested in barley BAC clones selected based on BAC-unigene associations described in that same study, thus suggesting that BAC pool sequencing can be used in correlation with existing physical maps to complement or correct whole-genome sequencing assemblies, offering in the process the likelihood of higher quality

contig sequence assemblies in gene-rich regions of complex plant genomes.

Optical mapping is a single-molecule approach that produces fingerprints using ordered restriction maps [82] or specific nick sites [83]. After enzymatic treatment and subsequent incorporation of fluorescent labels, the DNA molecules are stretched on a glass surface or in a nanochannel array and directly imaged to locate regions corresponding to the restriction sites or nick sites within the molecule. Distances between those sites are then inferred to produce an optical map of the DNA molecule. Two commercial platforms currently are available, namely, the Opgen Argus [84] and the BioNano Genomics Irys [85] systems. Using such techniques, very large DNA molecules, in the Mb range, can be interrogated for the presence and location of short recognition sites (whose sequence will vary with the enzyme being used to treat the DNA). Consensus optical maps then can be created by determining the overlap, under highly redundant conditions, between optical maps of single DNA molecules. Such consensus maps have to take into account the possibility of errors inherent to this type of technology, including star activity and false enzyme cuts, or the possibility of chimeric maps when joining, for example, optically mapped molecules containing paralogous genomic

and potentially confound genome assemblies.

56 Plant Genomics

**4.3. BAC pool sequencing in gene-rich regions**

**4.4. Optical mapping**

regions.

High-throughput Chromosome capture (Hi-C) is a method that uses cross-linking of DNAbinding protein to DNA followed by restriction digestion and self-ligation of protein-bound DNA fragments, to probe genome-wide three-dimensional chromatin interactions between chromosomal regions bound to the same proteins (such as enhancer and promoter regions) [88]. There is a statistically higher probability that those regions are located on the same chromosome rather than on different chromosomes, as expected within the context of chro‐ mosomes located in distinct three-dimensional spaces within the nucleus. As a result, a vast majority of Hi-C read pairs (where each paired reads correspond to reads that may be millions of bases apart from each other on the same chromosome) can be used to determine what two contigs can be linked together on the same chromosome, based on the Hi-C paired reads they each contain.

Burton *et al*. [89] evaluated the use of Hi-C datasets for long-range scaffolding of *de novo* wholegenome assemblies. This approach works, first, by aligning Hi-C reads to *de novo* assembly contig sequences and indexing each contig to their respective chromosomes, ordering contigs within each respective chromosome group by using higher Hi-C interaction densities expected between closely located contigs, and orienting ordered contigs using the location and orien‐ tation of Hi-C reads within each contig. The approach tested on existing human and mouse contig datasets generated from next-generation shotgun and mate-pair sequencing reads showed that a vast majority of the contigs could be grouped (98.2% and 98% of all sequences, in human and mouse, respectively) and ordered (94.4% and 86.7% of all grouped sequences, in human and mouse, respectively) within individual chromosomes when combined with Hi-C sequencing reads. Similar studies, where Hi-C datasets were used to complement *de novo* assembly generated with next-generation sequencing reads have been performed in human and mouse by Kaplan and Dekker [90] and Selvaraj *et al*. [91].

#### **4.6. Long-range scaffolding**

Two companies, namely, 10X Genomics [92] (Pleasanton, CA) and Dovetail Genomics [93] (Santa Cruz, CA), recently presented new ways to assemble short reads delivered by the Illumina technology. The GemCode instrument from 10X Genomics is a microfluidic device used to partition very long DNA molecules (typically 50 kb or more) into oil-based droplets and to prepare Illumina-compatible libraries in combination with "gel beads", each containing a unique 14-bp indexing barcode. Once sequencing is performed, in-house software decon‐ volutes the barcodes and reconstructs the sequence of the original DNA subfragments as to where they originate from on the original long DNA molecule. In contrast to 10X Genomics, Dovetail Genomics approach does not necessarily require an instrument but requires larger amount of starting material for preparing samples. Dovetail's approach works essentially by *in vitro* making a Hi-C library from chromatin-free purified DNA, thus recreating intramolec‐ ular interactions while reducing intermolecular ones. The resulting fragments can then be selected for mate-pair sets capturing long-range intramolecular interactions for genome scaffolding. While not yet applied on plant genome assemblies, it is presumed that the strategies and technologies highlighted above could potentially assist in grouping and ordering contigs and scaffolds from gene-rich regions of diploid plant genomes.

#### **5. Conclusion**

Reference genomes are now available for a significant number of plant species. The emergence of NGS technologies has made it possible to sequence genomes not only from economically important crop species but also from nonstandard model and special plants whose genomes otherwise might not have been sequenced due to the requirements for large funds, instru‐ mentation and personnel that was witnessed in earlier pre-NGS days. While great progress has been made, assembling such genomes still remains challenging due to their inherent complexity and the relative absence of long-range connectivity, lost during DNA fragmenta‐ tion and short-read sequencing. As a result, plant genome assemblies tend to be highly fragmented, and focused essentially on unique "gene-rich" regions, while large fractions of the genomes, namely, complex repeat and conserved regions, remain unassembled. Research‐ ers have come up with creative ways to address those shortcomings, including the use of matepair NGS libraries, the complementation of physical assemblies with genetic maps, or the use of new technologies for sequencing, physical mapping or scaffolding. It is hoped that the routine use of such novel approaches will help in elucidating the biological aspects of genomes by allowing true comparative and structural analysis between species, strains, tissue or environment.

#### **Acknowledgements**

The authors would like to acknowledge Gregory May for his support and contribution to this book chapter.

#### **Author details**

**4.6. Long-range scaffolding**

58 Plant Genomics

**5. Conclusion**

environment.

book chapter.

**Acknowledgements**

Two companies, namely, 10X Genomics [92] (Pleasanton, CA) and Dovetail Genomics [93] (Santa Cruz, CA), recently presented new ways to assemble short reads delivered by the Illumina technology. The GemCode instrument from 10X Genomics is a microfluidic device used to partition very long DNA molecules (typically 50 kb or more) into oil-based droplets and to prepare Illumina-compatible libraries in combination with "gel beads", each containing a unique 14-bp indexing barcode. Once sequencing is performed, in-house software decon‐ volutes the barcodes and reconstructs the sequence of the original DNA subfragments as to where they originate from on the original long DNA molecule. In contrast to 10X Genomics, Dovetail Genomics approach does not necessarily require an instrument but requires larger amount of starting material for preparing samples. Dovetail's approach works essentially by *in vitro* making a Hi-C library from chromatin-free purified DNA, thus recreating intramolec‐ ular interactions while reducing intermolecular ones. The resulting fragments can then be selected for mate-pair sets capturing long-range intramolecular interactions for genome scaffolding. While not yet applied on plant genome assemblies, it is presumed that the strategies and technologies highlighted above could potentially assist in grouping and

ordering contigs and scaffolds from gene-rich regions of diploid plant genomes.

Reference genomes are now available for a significant number of plant species. The emergence of NGS technologies has made it possible to sequence genomes not only from economically important crop species but also from nonstandard model and special plants whose genomes otherwise might not have been sequenced due to the requirements for large funds, instru‐ mentation and personnel that was witnessed in earlier pre-NGS days. While great progress has been made, assembling such genomes still remains challenging due to their inherent complexity and the relative absence of long-range connectivity, lost during DNA fragmenta‐ tion and short-read sequencing. As a result, plant genome assemblies tend to be highly fragmented, and focused essentially on unique "gene-rich" regions, while large fractions of the genomes, namely, complex repeat and conserved regions, remain unassembled. Research‐ ers have come up with creative ways to address those shortcomings, including the use of matepair NGS libraries, the complementation of physical assemblies with genetic maps, or the use of new technologies for sequencing, physical mapping or scaffolding. It is hoped that the routine use of such novel approaches will help in elucidating the biological aspects of genomes by allowing true comparative and structural analysis between species, strains, tissue or

The authors would like to acknowledge Gregory May for his support and contribution to this

Stéphane Deschamps\* and Victor Llaca

\*Address all correspondence to: stephane.deschamps@cgr.dupont.com

DuPont Pioneer, Wilmington, Delaware, USA

#### **References**


[26] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant *Arabidopsis thaliana*. Nature. 2000;408:796–815. DOI: 10.1038/35048692

[12] Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res.1998;8:175–185. DOI: 10.1101/gr.

[13] Karger BL, Guttman A. DNA sequencing by CE. Electrophoresis. 2009;30:S196–202.

[14] Llaca V.Sequencing technologies and their use in plant biotechnology and breeding. In: Dr. Anjana Munshi, editor. DNA Sequencing—Methods and Applications. 2012.

[15] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, *et al*.Whole-genome random sequencing and assembly of *Haemophilus influenzae* Rd.

[16] Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li BC, Herrmann R.Complete se‐ quence analysis of the genome of the bacterium *Mycoplasma pneumoniae*. Nucleic

[17] Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, *et al*.Life with

[18] *C. elegans* Sequencing Consortium. Genome sequence of the nematode *C. elegans*: a platform for investigating biology. Science. 1998;282:2012–2018. DOI: 10.1126/science.

[19] Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, *et al*.The genome sequence of *Drosophila melanogaster*. Science. 2000;287:2185–2195. DOI:

[20] International Human Genome Sequencing Consortium. Initial sequencing and analy‐ sis of the human genome. Nature. 2001;409:860–921. DOI: 10.1038/35057062

[21] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, *et al*.The sequence of the human genome. Science. 2001;291:1304–1351. DOI: 10.1126/science.1058040

[22] Staden R.A strategy of DNA sequencing employing computer programs. Nucleic

[23] Gordon D, Abajian C, Green P. Consed: a graphical tool for sequence finishing. Ge‐

[24] Soderlund C, Longden I, Mott R. FPC: a system for building contigs from restriction fingerprinted clones. Comput. Appl. Biosci. 1997;13:523–535. DOI: 10.1093/bioinfor‐

[25] Ding Y, Johnson MD, Chen WQ, Wong D, Chen YJ, Benson SC, *et al*.Five-color-based high-information-content fingerprinting of bacterial artificial chromosome clones us‐ ing type IIS restriction endonucleases. Genomics. 2001;74:142–154. DOI: 10.1006/

6000 genes. Science. 1996;274:563–567. DOI: 10.1126/science.274.5287.546

Science. 1995;269:496–512. DOI: 10.1126/science.7542800

Acids Res. 1996;24:4420–4449. DOI: 10.1093/nar/24.22.4420

Acids Res. 1979;6:2601–2610. DOI: 10.1093/nar/6.7.2601

nome Res.1998;8:195–202. DOI: 10.1101/gr.8.3.195

8.3.175

60 Plant Genomics

282.5396.2012

matics/13.5.523

geno.2001.6547

10.1126/science.287.5461.2185

DOI: 10.1002/elps.200900218

p. 35–60. DOI: 10.5772/37918


ships in barley. Funct. Integr. Genomics. 2002;2:51–59. DOI: 10.1007/ s10142-002-0055-5

[54] Claros MG, Bautista R, Guererro-Fernandez D, Benzerki H, Seoane P, Fernandez-Po‐ zo N. Why assembling plant genome sequences is so challenging. Biology (Basel). 2012;1:439–459. DOI: 10.3390/biology1020439

[40] Ansorge WJ.Next-generation DNA sequencing techniques. N. Biotechnol.

[41] Niedringhaus TP, Milanova D, Kerby MB, Snyder MP, Barron AE.Landscape of nextgeneration sequencing technologies. Anal. Chem.2011;83:4327–4341. DOI: 10.1021/

[42] Bentley DR, Balasubramanjan S, Swerdlow HP, Smith GP, Milton J, Brown CG, *et al*.Accurate whole human genome sequencing using reversible terminator chemistry.

[43] Fedurco M, Romieu A, Williams S, Lawrence I, Turcatti G.BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA col‐

[44] Erlich Y, Mitra PP, de la Bastide M, McCombie WR, Hannon GJ. Alta-cyclic: a selfoptimizing base caller for next-generation sequencing. Nat. Methods. 2008;5:679–682.

[45] Michael TP, Van Buren R. Progress, challenges and the future of crop genomes. Curr.

[46] Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing. Hum.

[47] Feng Y, Zhang Y, Ying C, Wang D, Du C. Nanopore-based fourth-generation DNA sequencing technology. Genomics Proteomics Bioinformatics. 2015;13:4–16. DOI:

[48] Anton BP, Mongodin EF, Agrawal S, Fomenkov A, Byrd DR, Roberts RJ, *et al*.Com‐ plete genome sequence of ER2796, a DNA methyltransferase-deficient strain of *Es‐ cherichia coli* K-12. PLoS One. 2015;10:e0127446. DOI: 10.1371/journal.pone.0127446

[49] Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo us‐ ing only nanopore sequencing data. Nat. Methods. 2015;12:733–735. DOI: 10.1038/

[50] Plant and animal whole genome sequencing [Internet]. 2015. Available from: http:// www.pacb.com/applications/whole-genome-sequencing/plant-animal/ [Accessed

[51] DNA: nanopore sequencing [Internet]. 2015. Available from: https://nanopore‐

[52] Bennetzen JL, Ma J, Devos KM. Mechanisms of recent genome size variation in flow‐

[53] Rostoks N, Park YJ, Ramakrishna W, Ma J, Druka A, Shiloff BA, *et al*.Genomic se‐ quencing reveals gene content, genomic organization, and recombination relation‐

tech.com/applications/dna-nanopore-sequencing [Accessed 2015-10-14]

ering plants. Ann. Bot.2005;95:127–132. DOI: 10.1093/aob/mci008

2009;25:195–203. DOI: 10.1016/j.nbt.2008.12.009

Nature. 2008;456:53–59. DOI: 10.1038/nature07517

onies. Nucleic Acids Res. 2006;34:e22. DOI: 10.1093/nar/gnj023

Opin. Plant Biol.2015;24:71–81. DOI: 10.1016/j.pbi.2015.02.002

Mol. Genet.2010;19:R227–240. DOI: 10.1093/hmg/ddq416

ac2010857

62 Plant Genomics

DOI: 10.1038/nmeth.1230

10.1016/j.gpb.2015.01.009

nmeth.3444

2015-10-14]


gle-nucleotide polymorphisms. Plant Physiol.2014;164:412–423. DOI: 10.1104/pp. 113.228213

[79] Mascher M, Stein N. Genetic anchoring of whole-genome shotgun assemblies. Front. Genet.2014;5:208. DOI: 10.3389/fgene.2014.00208

[67] Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, *et al*.Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat.

[68] McCoy RC, Taylor RW, Blauwkamp TA, Kelley JL, Kertesz M, Pushkarev D, *et al*. Il‐ lumina TruSeq synthetic long-reads empower de novo assembly and resolve com‐ plex, highly-repetitive transposable elements. PLoS One. 2014;9:e106689. DOI:

[69] Deschamps S, Campbell M. Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery. Mol. Breed.2010;25:553–570. DOI:

[70] Elshire RJ, Glaubitz JC, Suri Q, Poland JA, Kawamoto K, Buckler ES, *et al*.A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS

[71] Deschamps S, Llaca V, May GD. Genotyping-by-sequencing in plants. Biology

[72] Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, *et al*.High-throughput geno‐ typing by whole-genome resequencing. Genome Res.2009;19:1068–1076. DOI:

[73] Trick M, Adamski NM, Mugford SG, Jiang CC, Febrer M, Uauy C. Combining SNP discovery from next-generation sequencing data with bulked segregant analysis (BSA) to fine-map genes in polyploidy wheat. BMC Plant Biol.2012;12. DOI:

[74] Fellers JP. Genome filtering using methylation-sensitive restriction enzymes with six base pair recognition sites. The Plant Genome. 2008;1:146–152. DOI: 10.3835/plantge‐

[75] Gore MA, Wright MH, Ersoz ES, Bouffard P, Szekeres ES, Jarvie TP, *et al*.Large-scale discovery of gene-enriched SNPs. The Plant Genome. 2009;2:121–133. DOI: doi:

[76] Muraya MM, Schmutzer T, Ulpinnis C, Scholz U, Altmann T. Targeted sequencing reveals large-scale sequence polymorphism in maize candidate genes for biomass production and composition. PLoS One. 2015;10:e0132120. DOI: 10.1371/jour‐

[77] Mascher M, Muehlbauer GJ, Rokhsar DS, Chapman J, Schmutz J, Barry K, *et al*.An‐ choring and ordering NGS contig assemblies by population sequencing (POPSEQ).

[78] Ariyadasa R, Mascher M, Nussbaumer T, Schulte D, Frenkel Z, Poursarebani N, *et al*.A sequence-ready physical map of barley anchored genetically by two million sin‐

Methods. 2013;10:563–569. DOI: 10.1038/nmeth.2474

One. 2011;6:e19379. DOI: 10.1371/journal.pone.0019379

(Basel). 2012;1:460–483. DOI: 10.3390/biology1030460

10.1371/journal.pone.0106689.

64 Plant Genomics

10.1007/s11032-009-9357-9

10.1101/gr.089516.108

10.1186/1471-2229-12-14

10.3835/plantgenome2009.01.0002

Plant J. 2013;76:718–727. DOI: 10.1111/tpj.12319

nome2008.05.0245

nal.pone.0132120

