**2. First-generation sequencing: A brief history**

nucleotides can be sequenced in parallel, yielding substantially more throughput and mini‐ mizing the need for the fragment-cloning methods that were used with Sanger sequencing [4]. The second-generation sequencing methods are characterized by the need to prepare amplified sequencing libraries before undertaking sequencing of the amplified DNA clones, whereas third-generation single molecular sequencing can be done without the need for creating the time-consuming and costly amplification libraries [5]. The parallelization of a high number of sequencing reactions by NGS was achieved by the miniaturization of sequencing reactions and, in some cases, the development of microfluidics and improved detection systems [6]. The time needed to generate the gigabase (Gb)-sized sequences by NGS was reduced from many years to only a few days or hours, with an accompanying massive price reduction. For example, as part of the Human Genome Project, the J. C. Venter genome [7] took almost 15 years to sequence at a cost of more than 1 million dollars using the Sanger method, whereas the J. D. Watson (1962 Nobel Prize winner) genome was sequenced by NGS using the 454 Genome Sequencer FLX with about the same 7.5x coverage within 2 months and for approximately 100th of the price [8]. The cost of sequencing the bacterial genome is now possible at about \$1000 (https://www.nanoporetech.com), and the large-scale whole-genome sequencing (WGS) of 2,636 Icelanders [9] has brought some of the aims of the 1000 Genomes Project [10] to abrupt

4 Next Generation Sequencing - Advances, Applications and Challenges

Rapid progress in NGS technology and the simultaneous development of bioinformatics tools has allowed both small and large research groups to generate *de novo* draft genome sequences for any organism of interest. Apart from using NGS for WGS [11], these technologies can be used for whole transcriptome shotgun sequencing (WTSS) — also called RNA sequencing (RNA-seq) [12], whole-exome sequencing (WES) [13], targeted (TS) or candidate gene se‐ quencing (CGS) [14–16], and methylation sequencing (MeS) [17]. RNA-seq can be used to identify all transcriptional activities (coding and noncoding) or a select subset of targeted RNA transcripts within a given sample [12], and it provides a more precise and sensitive measure‐ ment of gene expression levels than microarrays in the analysis of many samples [18–21]. In contrast to WGS, WES provides coverage for more than 95% of human exons to investigate the protein-coding regions (CDS) of the genome and identify coding variants or SNPs when WGS and WTSS are not practical or necessary [13]. Since the exome represents less than 2% of the human genome, it is the cost-effective alternative to WGS and RNA-seq in the study of human genetics and disease [13]. However, WGS may be preferred over WES because it provides more data with better uniformity of read coverage on disease-associated variants and reveals polymorphisms outside coding regions and genomic rearrangements [19, 22]. The analysis of the methylome by MeS complements WGS, WES, and CGS to determine the active methylation sites and the epigenetic markers that regulate gene expression, epistructural base variations, imprinting, development, differentiation, disease, and the epigenetic state [23–30]. The impact of NGS technology is indeed egalitarian in that it allows both small and large research groups the possibility to provide answers and solutions to many different problems and questions in the fields of genetics and biology, including those in medicine, agriculture,

forensic science, virology, microbiology, and marine and plant biology.

fruition.

Twelve years after the publication of the Watson and Crick double-helix DNA structure in 1953 [31], the first natural polynucleotide sequence was reported [32]. It was the 77-nt yeast alanine tRNA with a proposed cloverleaf structure, although the anticodon, the three nucleotides that bind to the mRNA sequence, was not yet identified in the sequence [32]. It took 7 years to prepare up to 1 g of the tRNA from commercial baker's yeast by countercurrent distribution before fragmenting the RNA into short oligonucleotides with various RNase enzymes to reconstruct and identify the nucleotide residues using twodimensional chromatography and spectrophotometric procedures [33]. At that time, scientists could sequence only a few base pairs per year, not nearly enough to sequence an entire gene. Nevertheless, despite the time-consuming and laborious nature of these very first sequencing methods that were developed for tRNA and other oligonucleotides, there was a flurry of RNA and DNA sequencing for the next 10 years that improved the sequencing procedures of fragmented DNA and provided new information on the sequences of more than 100 different tRNA. These initial labor-intensive sequencing efforts resulted also in the first complete genome sequence — the 3,569-nucleotide-long bacteriophage MS2 RNA, the lysozyme gene sequence of bacteriophage T4 DNA, and the 24-bp lac operator sequence [33–36]. This eventually led to the Maxam and Gilbert chemical degradation DNA sequencing method that chemically cleaved specific bases of terminally labeled DNA fragments and separated them by electrophoresis [37]. New data on how to sequence bacteriophage DNA by specific primer extension methods resulted in Sanger et al. [1] using primer-extension and chain-termination methods for sequencing polynucleotides longer than oligonucleotide lengths. Subsequently, the new Sanger DNA chain-termination sequencing method [1], known simply as the Sanger sequencing method, prevailed over the Maxam and Gilbert chemical degradation method [37] because of its greater simplici‐ ty and reliability and the use of fewer toxic chemicals and lower amounts of radioactivi‐ ty. The first-generation automated DNA sequencers developed by Applied Biosystem Instruments (ABI) used the Sanger method with fluorescent dye-terminator reagents for single-reaction sequencing rather than the usual four separate reactions [34–36]. These sequencers were later improved by including computers to collect, store, and analyze the sequencing data [38]. The invention of the PCR technology [39] and thermal cyclers and the use of a heat-resistant enzyme such as Taq polymerase from *Thermus aquaticus* between 1985 and 1990 enabled the generation of random or specific sequences for *de novo* sequenc‐ ing, filling gaps, and resequencing particular regions of interest [35]. The discovery of reverse transcriptase in 1970 [40, 41] led to the development of RNA sequencing using cDNA reverse transcribed from RNA. In 1991, Adams et al. [42] initiated a systematic cDNA

sequencing project using the Sanger method and the 373A DNA semiautomated sequenc‐ ers to generate large batches of cDNA sequences with an average length of 397 bases, which they named "expressed sequence tags" (ESTs) and used as substrates and markers for RNA contig and transcriptome mapping. These improvements, together with the establishment of GenBank (http://www.ncbi.nlm.nih.gov/genbank) in 1982, resulted in the generation of hundreds of thousands of more DNA sequences throughout the 1980s, 1990s [34–36], and right up to the beginning of the new millennium, with the publication of the first draft sequence of the human genome [43, 44].

A sudden increase in the number of DNA and RNA sequences generated for GenBank between 1992 and 2004 (http://www.ncbi.nlm.nih.gov/genbank/statistics) resulted mostly from three main initiatives: the development of automated sequencers and the emergence of service providers, the industrialization and the establishment of sequencing centers and international consortiums, and the continued development of computing hardware and software to store and analyze nucleotide sequences. The automated-industrialized ap‐ proach based on random or shotgun sequencing was initiated by The Institute for Genom‐ ic Research (TIGR) in Rockville, Maryland, and resulted in the publication of 337 new human genes and 48 homologous genes from other organisms [42]. By 1999, the TIGR venture generated 83 million nucleotides of cDNA sequence, 87,000 human cDNA sequen‐ ces, and the complete genome sequences of two bacterial species, *Haemophilus influenzae* [45] and *Mycoplasma genitalium* [46]. This success was in part due to the development of the TIGR sequence assembler, an innovative computer program to assemble vast amounts of EST data [47]. By the end of 2001, the automated sequencers, such as the fully automat‐ ed Prism 3700 with 96 capillaries that could produce 1.6×105 bases of sequence data per day, sequencing centers and international consortiums, such as the TIGR in the USA, the Sanger Centre in the United Kingdom, and RIKEN in Japan, produced the complete genomic sequences of the bacteria *E. coli* and *Bacillus subtilis*, the yeast *Saccharomyces cerevisiae*, the nematode *C. elegans*, the fruit fly *Drosophila melanogaster*, the plant *Arabidopsis thialiana*, and the human genome (see references cited by Stein [48]). Although sequencing was still hugely expensive and time consuming, Sanger sequencing was by then the dominant method. Pundits now placed DNA sequencing into a postgenomic era and predicted functional genomics, SNPs, and transcript arrays as the future of biological investigation [49, 50]. Indeed, after the establishment of the first Affymetrix and GeneChip microarrays in 1996, the decade saw a rapid growth in DNA array technology and applications for various gene expression studies in prokaryotes and eukaryotes [21, 51, 52]. Nevertheless, the outputs for genomic and/or RNA sequencing had neither finished nor slowed; new sequencing methods continued to emerge after 2005 to challenge the cost and supremacy of the Sanger di‐ deoxy method [34–36]. These new methods became known as next-generation sequencing because they were designed to employ massively parallel strategies to produce large amounts of sequence from multiple samples at very high-throughput and at a high degree of sequence coverage to allow for the loss of accuracy of individual reads when com‐ pared to Sanger sequencing. These different approaches brought the cost of sequencing the genome down from \$100 million in 2001 to less than \$10,000 in 2014 [53].
