Preface

*There is not a discovery in science, however revolutionary, however sparkling with insight that does not arise out of what went before.*

Isaac Asimov, Adding a Dimension.

This year, 2015, marks the 150th anniversary of the Morovian monk Gregor Mendel's seminal paper on pea-plant genetics. He discovered the statistical patterns of inheritance from one generation to the next and deduced that the basic unit of inheritance was inherited in pairs, one unit from each parent that segregated and manifested in the offspring as dominant or recessive traits. Twenty-three years after Mendel's death in 1884, the English biologist William Bateson described the study of heredity as genetics, and a few years later the Danish botanist Wilhelm Johannsen named the gene as the physical and functional unit of inheritance. Soon after the rediscovery and understanding of Mendel's publication, his experimental observations be‐ came known in the biological sciences as Mendel's laws of heredity with (1) the law of segrega‐ tion, (2) the law of independent assortment and (3) the law of dominance. Thus, by the early and mid-1900s, the science of genetics was well and truly born to evolve, amplify and spread far beyond the pea plants and impact on all the life sciences for the next 100 years, all the way across to the present age of next-generation genomics.

And what of the physical nature of the gene itself? It was only 62 years ago that Watson and Crick published their landmark paper 'Molecular structure of Nucleic Acids' in *Nature* on 25th April 1953. Their opening paragraph was *'We wish to suggest a structure for the salt of deoxyribose nucleic acid (D.N.A). This structure has novel feature which are of considerable biologi‐ cal interest.'* They put forward their hypothesis of the DNA structure as a sequence of four nucleotides as if these nucleotides were beads on two complementary helical strings or chains that coiled about each other in antiparallel and around the same axis. The nucleotides on the inside of the helix or strand always bound covalently to those on the opposite helix in a complimentary fashion, adenine to thymine and guanidine to cytosine. Thus, *'the two chains are held together by the purine and pyrimidine bases'* by hydrogen bonds. Further, *'It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.'* The history of how Watson and Crick surrepti‐ tiously established their DNA model as a double helix by using the unpublished X-ray dif‐ fraction data of Rosalind Franklin is well established and has been variously dramatized in print, theatre, film and on the TV screens. Two years prior to the Watson and Crick discov‐ ery, Erwin Chargaff and his colleagues had pointed out in a *Nature* paper that DNA was composed of equivalent amounts of nucleotides with A=T and G=C. Even earlier, in 1944, Oswald Avery and his colleagues, Colin MacLeod and Maclyn McCarty, already had identi‐ fied DNA as the molecule of heredity in their published experimental work on bacterial transformation.

The impact of the Watson and Crick publication on the structure of DNA as a double helix was profound because it quickly led to the establishment of the Central Dogma that genetic information was transcribed from DNA to messenger RNA and translated to build a pro‐ tein, and that it could not again flow in the reverse direction from protein to RNA. The proof that triplets or codons of the DNA sequence coded for amino acids that were the building blocks of peptides and proteins soon followed in the late 1950s and throughout the 1960s with the works of Marshall Nirenberg, Heinrich Matthaei, Sydney Brenner, Symour Benzer and others. Thus, DNA and RNA were strongly asserted to contain the genetic code and that the genetic information encoded within DNA was undoubtedly universal to all forms of life. It was not until 1970 that Crick's Central Dogma was jolted by Howard Temin and Da‐ vid Baltimore and their colleagues with the discovery that the reverse transcriptase enzyme allows the flow of genetic information from RNA to DNA by RNA retroviruses like human immunodeficiency virus and via certain cellular enzymes such as telomerase and a reverse transcriptase-like protein encoded by the RVT gene. Moreover, the sequencing of the human genome in 2001 revealed that at least half of our genome is made of fossils of past retrotrans‐ poson integrations, some of which have evolved to act as regulators and insulators in a com‐ plex, regulatory process of transcription and translation. Only 2–3 % of our genome consists of loci that we call genes and that code for proteins. The rest of the genome, once referred to as 'junk DNA', appears to be regulatory, although it remains mostly as *'dark matter'* , waiting to be fully deciphered. Nevertheless, Crick's fundamental insight that the sequence of the nucleotides in DNA is transcribed and translated into the synthesis of proteins via messen‐ ger RNA and amino acids carried by three-base coded (anticodons) transfer RNA molecules remains the basis of modern genetics and genomics.

The first published natural polynucleotide sequence was a yeast transfer RNA and not DNA. It was published in *Science* in 1965 by Robert W. Holley and his colleagues 12 years after the Watson and Crick paper proposed the structure of DNA, and it had taken them many years to obtain a gram of RNA by complicated purification procedures so that they had sufficient amounts to identify its sequence by spectrophotometry and chromatography. A number of different laborious DNA and RNA sequencing procedures were developed during the late 1960s and early 1970s mostly using two-dimensional chromatography and/or electrophoresis and hazardous chemicals and/or large doses of radioisotopes. The first ge‐ nome to be sequenced was in 1976, a viral RNA genome from the bacteriophage MS2 by Walter Fiers' group at the University of Ghent in Belgium using an RNA sequencing techni‐ que based on RNA fragmentation and the separation of fragments by two-dimensional gel electrophoresis. The following year, Fred Sanger and his colleagues in the UK published the first DNA genome, the 5386 base-pair Phage Phi-X174. In the same year, they also published two technical papers on the rapid determination of DNA sequence that was safer, easier and more reliable than the other sequencing techniques such as the Maxam and Gilbert chemical method. The Sanger method was based on plasmid cloning of DNA fragments and DNA polymerase reactions using fluorescent-labelled or radiolabelled dideoxynucleotides to fol‐ low the sequencing reactions that were amendable more easily to automation. The coupling of the Sanger fluorescent DNA sequencing method with automated capillary electrophoresis led to the establishment of sequencing centres and factories with hundreds of DNA se‐ quencing instruments operated in parallel by large numbers of personnel. This led to the domination of Sanger sequencing as the first-generation sequencing procedure and gold standard for 30 years since its inception, and it has had an enormous impact on our under‐ standing of DNA and gene organization and what happens at the genomic level in humans and various plants and animals and microorganisms. By 2001, a mosaic version of the whole human genome was sequenced to 90 % completion with announcements of the accomplish‐ ment by the 42nd US President Bill Clinton at the White House, universal fanfare and two major papers published in *Science* and *Nature* at about the same time. In addition to the hu‐ man genome, sequencing groups had already published the full genomes for eukaryotic and prokaryotic viruses, bacterial plasmids and partial genomes for a variety of different species.

fied DNA as the molecule of heredity in their published experimental work on bacterial

The impact of the Watson and Crick publication on the structure of DNA as a double helix was profound because it quickly led to the establishment of the Central Dogma that genetic information was transcribed from DNA to messenger RNA and translated to build a pro‐ tein, and that it could not again flow in the reverse direction from protein to RNA. The proof that triplets or codons of the DNA sequence coded for amino acids that were the building blocks of peptides and proteins soon followed in the late 1950s and throughout the 1960s with the works of Marshall Nirenberg, Heinrich Matthaei, Sydney Brenner, Symour Benzer and others. Thus, DNA and RNA were strongly asserted to contain the genetic code and that the genetic information encoded within DNA was undoubtedly universal to all forms of life. It was not until 1970 that Crick's Central Dogma was jolted by Howard Temin and Da‐ vid Baltimore and their colleagues with the discovery that the reverse transcriptase enzyme allows the flow of genetic information from RNA to DNA by RNA retroviruses like human immunodeficiency virus and via certain cellular enzymes such as telomerase and a reverse transcriptase-like protein encoded by the RVT gene. Moreover, the sequencing of the human genome in 2001 revealed that at least half of our genome is made of fossils of past retrotrans‐ poson integrations, some of which have evolved to act as regulators and insulators in a com‐ plex, regulatory process of transcription and translation. Only 2–3 % of our genome consists of loci that we call genes and that code for proteins. The rest of the genome, once referred to as 'junk DNA', appears to be regulatory, although it remains mostly as *'dark matter'* , waiting to be fully deciphered. Nevertheless, Crick's fundamental insight that the sequence of the nucleotides in DNA is transcribed and translated into the synthesis of proteins via messen‐ ger RNA and amino acids carried by three-base coded (anticodons) transfer RNA molecules

The first published natural polynucleotide sequence was a yeast transfer RNA and not DNA. It was published in *Science* in 1965 by Robert W. Holley and his colleagues 12 years after the Watson and Crick paper proposed the structure of DNA, and it had taken them many years to obtain a gram of RNA by complicated purification procedures so that they had sufficient amounts to identify its sequence by spectrophotometry and chromatography. A number of different laborious DNA and RNA sequencing procedures were developed during the late 1960s and early 1970s mostly using two-dimensional chromatography and/or electrophoresis and hazardous chemicals and/or large doses of radioisotopes. The first ge‐ nome to be sequenced was in 1976, a viral RNA genome from the bacteriophage MS2 by Walter Fiers' group at the University of Ghent in Belgium using an RNA sequencing techni‐ que based on RNA fragmentation and the separation of fragments by two-dimensional gel electrophoresis. The following year, Fred Sanger and his colleagues in the UK published the first DNA genome, the 5386 base-pair Phage Phi-X174. In the same year, they also published two technical papers on the rapid determination of DNA sequence that was safer, easier and more reliable than the other sequencing techniques such as the Maxam and Gilbert chemical method. The Sanger method was based on plasmid cloning of DNA fragments and DNA polymerase reactions using fluorescent-labelled or radiolabelled dideoxynucleotides to fol‐ low the sequencing reactions that were amendable more easily to automation. The coupling of the Sanger fluorescent DNA sequencing method with automated capillary electrophoresis led to the establishment of sequencing centres and factories with hundreds of DNA se‐ quencing instruments operated in parallel by large numbers of personnel. This led to the

transformation.

X Preface

remains the basis of modern genetics and genomics.

Although the need for DNA sequencing and generating sequences for analytical consump‐ tion was great during the era of Sanger sequencing, the cost and effort of sequencing were still prohibitively expensive and far too slow for many laboratories to join in to work on the maturing field of genomics. This began to change dramatically by 2007 with the emergence of a number of different next-generation automated sequencing technologies such as those developed by 454 Life Sciences, Solexa, Applied Biosystems and Helicos that increased the number of sequencing reactions in miniaturized arrays, fibre-optic slides or flow cells and greatly reduced the cost of sequencing from millions to thousands of US dollars in only a few years. A single next-generation sequencing (NGS) run using any one of the new mas‐ sively parallel-sequencing platforms could generate more sequencing data than simultane‐ ously running a hundred Sanger sequencing machines. Although the short read lengths, sequencing errors and the large volumes of data generated by NGS were at first seen as a problem with the technology, the lower costs, large capacity, high coverage of reasonably accurate sequencing information and the versatility of NGS for a wide range of applications soon began to win over the scientific community and funding agents. The NGS market has grown exponentially over the last decade and promises to continue its enormous expansion with ongoing improvements and cost reductions provided by the manufacturers and service providers. It is envisaged that desktop sequencers for personal genomics and single investi‐ gators and small laboratory groups will be developed in the near future that will be no larg‐ er than portable hard drives stacked together and connected to laptop or desktop computers. However, with the sudden technical and economic ease to generate vast amounts of sequencing information comes the problem and burden of sequence data acquis‐ ition, storage, transmission and analysis. The bottleneck for genomics is no longer about generating sequences, the holdup is now at the level of bioinformatics, storing, processing, analysing and interpreting the sequencing information.

Unsurprisingly, the developments in DNA sequencing technology progressed almost hand in hand with those in computing and information systems technology. When Apple released the Macintosh 128K in 1984, its first Macintosh personal computer, the sequences available for analysis were from genes, genomic fragments, and genomes of plasmids and viruses. They were simple sequences that were analysed using the primitive, pioneering computer software such as DNA Inspector, DNAStrider and, later, GeneJockey, MacVector, Sequench‐ er and others. With the arrival of the World Wide Web on the open Internet and personal computer web browsers in the early 1990s, the sequences available for analysis were increas‐ ing in number and complexity and they required more sophisticated algorithms, software and hardware with ever-increasing computation capacity and speed. The Human Genome Project was formally initiated in October of 1990 and it required a 13-year international ef‐ fort to complete the sequence of most of the 3 billion DNA nucleotides and annotate the estimated 20,000–24,000 human genes for further study. Today, NGS and genomics are seen much more as a science of Biological Information Systems and 'Big Data'. There has been an inundation of DNA and RNA sequences that heavily taxes the limitations of the parallel de‐ velopment of computers to store and process the rapidly accumulating DNA and RNA data and to translate and manage the information systematically, efficiently and securely. For ex‐ ample, as of November 2014, half of the 7,597 prokaryote species in the NCBI Refseq dataset still were uncharacterized, and there are estimates that zetabases (> 1 × 10 21) of sequence per year will need to be processed in a projected trillion dollar industry by 2025, including the personal data of a million or more human genome sequences. Thus, much attention is now drawn towards solving the problems of computational greed and how to integrate, process, filter and secure astronomical amounts of genomic data.

What is next-generation sequencing (NGS)? In brief, NGS is a sequencing technology that is faster, cheaper and more versatile than the first-generation sequencing methods that preceded it such as the Sanger sequencing method. NGS permits high-throughput sequencing of the whole genome (DNA-seq), exomes (exomic DNA-seq) or targeted genomic regions (targeted DNA-seq), genomic RNA or the transcriptome (RNA-seq), DNA methylation sites throughout the genome (Methyl-Seq) and the genomic regions involved in protein–DNA interactions (ChIP-seq) and three-dimensional genome structure (Hi-C) of any organism. However, NGS is not just a sequencing technology, it is also an information systems technology with enormous implications for man's future in various fields and aspects of life. It is the interrogation, collec‐ tion and spread of biological information for our enlightenment and for the development of novel biological applications and innovations both good and bad in biotechnology, biodefense, the environment, ecosystems, agriculture, industry and human health.

Many of the advances, applications and challenges associated with NGS are dealt with com‐ prehensively and insightfully in this book in the form of reviews and original studies by leading researchers providing expert and novel information and insights in their particular fields of interest. This is a book for scientists, clinicians, technicians, academics, specialists, graduate and postgraduate students and for all who are interested in DNA sequencing and bioinformatics across all fields of the life sciences. This book consists of 16 chapters present‐ ed in four sections. The first section, 'Genomics, Transcriptomics and Methylomics', contains five chapters starting with an overview of the basic tools and technological developments pertaining to NGS and 'omics', followed by examples of the application of NGS in the as‐ sembly of aquatic genomes, targeted NGS to genotype the polymorphisms of the MHC ge‐ nomic region, NGS transcriptomic profiling and the computational analysis of methylome data. The three chapters in the second section, 'NGS of Microorganisms', cover the impact and progress of NGS techniques and the computational applications in the generation and analysis of NGS data for microorganisms, especially viruses and bacteria. The three chapters in the third section, 'NGS of Agricultural Plants', address the role of NGS in the study of plants that are part of the agricrops that sustain and feed humans and their livestock. The fourth and final section, 'NGS in Humanomics', consists of five chapters that focus on NGS in the analysis of ancestral haplotypes, ambiguities and quality measures using NGS for genotyping polymorphic HLA genes, NGS in the diagnosis of inherited macrothrombocyto‐ penias as a Mendelian disease using signature sequence markers, NGS for the detection of non-invasive genetic diseases in the foetus using the pregnant mother's DNA, and it con‐ cludes with a chapter on the impact of RNA-seq data analysis on human gene annotation.

NGS is a vast and rapidly evolving area of science, and it is beyond the scope of this book to cover all the issues and topics related to this subject. The authors in this book are experts from various areas of NGS, structural and functional genomics, bioinformatics and complex data analysis who have devoted their time, despite their busy schedules, to write their val‐ uable and thought-provoking chapters with tireless dedication in the few months allowed to them to meet the demanding deadlines. We thank them for their tireless dedication to over‐ come the challenges and to complete the book project on schedule. We welcome, savour and appreciate the information and knowledge imparted by these different authorities in their chapters presented in this book.

estimated 20,000–24,000 human genes for further study. Today, NGS and genomics are seen much more as a science of Biological Information Systems and 'Big Data'. There has been an inundation of DNA and RNA sequences that heavily taxes the limitations of the parallel de‐ velopment of computers to store and process the rapidly accumulating DNA and RNA data and to translate and manage the information systematically, efficiently and securely. For ex‐ ample, as of November 2014, half of the 7,597 prokaryote species in the NCBI Refseq dataset still were uncharacterized, and there are estimates that zetabases (> 1 × 10 21) of sequence per year will need to be processed in a projected trillion dollar industry by 2025, including the personal data of a million or more human genome sequences. Thus, much attention is now drawn towards solving the problems of computational greed and how to integrate, process,

What is next-generation sequencing (NGS)? In brief, NGS is a sequencing technology that is faster, cheaper and more versatile than the first-generation sequencing methods that preceded it such as the Sanger sequencing method. NGS permits high-throughput sequencing of the whole genome (DNA-seq), exomes (exomic DNA-seq) or targeted genomic regions (targeted DNA-seq), genomic RNA or the transcriptome (RNA-seq), DNA methylation sites throughout the genome (Methyl-Seq) and the genomic regions involved in protein–DNA interactions (ChIP-seq) and three-dimensional genome structure (Hi-C) of any organism. However, NGS is not just a sequencing technology, it is also an information systems technology with enormous implications for man's future in various fields and aspects of life. It is the interrogation, collec‐ tion and spread of biological information for our enlightenment and for the development of novel biological applications and innovations both good and bad in biotechnology, biodefense,

Many of the advances, applications and challenges associated with NGS are dealt with com‐ prehensively and insightfully in this book in the form of reviews and original studies by leading researchers providing expert and novel information and insights in their particular fields of interest. This is a book for scientists, clinicians, technicians, academics, specialists, graduate and postgraduate students and for all who are interested in DNA sequencing and bioinformatics across all fields of the life sciences. This book consists of 16 chapters present‐ ed in four sections. The first section, 'Genomics, Transcriptomics and Methylomics', contains five chapters starting with an overview of the basic tools and technological developments pertaining to NGS and 'omics', followed by examples of the application of NGS in the as‐ sembly of aquatic genomes, targeted NGS to genotype the polymorphisms of the MHC ge‐ nomic region, NGS transcriptomic profiling and the computational analysis of methylome data. The three chapters in the second section, 'NGS of Microorganisms', cover the impact and progress of NGS techniques and the computational applications in the generation and analysis of NGS data for microorganisms, especially viruses and bacteria. The three chapters in the third section, 'NGS of Agricultural Plants', address the role of NGS in the study of plants that are part of the agricrops that sustain and feed humans and their livestock. The fourth and final section, 'NGS in Humanomics', consists of five chapters that focus on NGS in the analysis of ancestral haplotypes, ambiguities and quality measures using NGS for genotyping polymorphic HLA genes, NGS in the diagnosis of inherited macrothrombocyto‐ penias as a Mendelian disease using signature sequence markers, NGS for the detection of non-invasive genetic diseases in the foetus using the pregnant mother's DNA, and it con‐ cludes with a chapter on the impact of RNA-seq data analysis on human gene annotation.

filter and secure astronomical amounts of genomic data.

XII Preface

the environment, ecosystems, agriculture, industry and human health.

Last but not least, we thank the staff of InTech and the Publishing Process Manager Sandra Bakic for their valuable contribution to the editing and smooth publication of this book. We hope that it will become a valuable reference and a further inspiration for basic and practical research on the implementation of NGS technologies and bioinformatics and assist in eluci‐ dating all the wonders of the genetic code still waiting to be deciphered at the many differ‐ ent levels of biology, now and into the future.

#### **Jerzy K Kulski**

1 Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Isehara, Japan

> 2 Centre for Forensic Science, The University of Western Australia, Nedlands, WA, Australia

**Genomics, Transcriptomics and Methylomics: Tools and Applications**
