**8.3. Methylomics and epigenomics**

and structural variations [14–16]. PCR-based candidate gene and whole-exome analysis are two widely used methods that can be performed with higher coverage and at much lower cost than whole-genome resequencing. Genotyping HLA genes of humans for clinical diagnosis or research by sequencing the entire gene [97, 128] or just the exons [129] is an example of targeted resequencing to identify polymorphisms that are important in tissue or cell matching for transplantation [130]. Exomics is targeted specifically towards coding genes and discovering exonic mutations responsible for rare Mendelian disorders such as hearing loss, intellectual disabilities, and movement disorders and for investigating common disorders such as heart disease, hypertension, diabetes, and cancer [13, 123, 125], and many others that are listed at the Online Mendelian Inheritance in Man (OMIM) database (Table 3, [49]). In contrast to WES, WGS can assess alterations in the coding genes and the regulatory and noncoding regions [123, 126], especially multiallelic copy number variations [127]. Cancer research has shown that it is important to target all types of somatic/germ-line genetic alterations, including nucleotide substitutions, small insertions and deletions (indels), CNVs, and chromosomal rearrange‐ ments in the noncoding regions [13, 15, 123]. WGS has been used to identify variants, indels, and multiple numbers of genes involved in rare and common diseases such as Charcot-Marie-Tooth neuropathy, dopa-responsive dystonia, acquired essential thrombocytosis, erythrocy‐

RNA-seq is the NGS method that sequences the transcriptome, that is, all the RNA transcript sets expressed by the genome in cells, tissues, and organs at different stages of an organism's life cycle [12, 18, 19, 20, 30]. High-throughput RNA sequencing using cDNA fragments was first employed in mammalian cells [131] and yeast [132], and now it is used for a wide range of organisms [133]. Without transcriptome data, the genome sequence alone is of limited use for understanding the intricacies of genome function in biology. RNA-seq provides technical reliability and sensitivity and unambiguous maps of the transcribed regions of the genome with high accuracy in quantitative expression levels, identification of tissue-specific transcript variants and isoforms (SNPs and mutations), transcription boundaries and splicing events, transcription factors, and small and large noncoding RNAs (ncRNA) involved in the regulation

At least 90% of the mammalian genome is actively transcribed to produce different classes of ncRNAs [135, 136], including ribosomal RNA (rRNA), transfer RNA (tRNA), microRNA (miRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), small interfering RNA (siRNA), PIWI-interacting RNA (piRNA), and large intergenic noncoding RNA (lincR‐ NA) [138–141] and retrotransposons [142–146]. The known classes of functional ncRNAs consists largely of those supporting protein translation (ribosomal, transfer, and small nucleolar RNAs), transcript splicing (snRNAs) [137, 138], and miRNA that target conserved binding sites of mRNAs to decrease their stability [139]. The new class of small piRNA was discovered to interact with PIWI regulatory proteins and RNA to silence transposons in the germ line and regulate gene expression in the soma [140]. The lincRNAs are expressed by a different class of actively transcribed RNA genes and they have diverse roles in processes such

tosis, and many others [123, 126].

of gene expression [131–137].

**8.2. Transcriptomics and RNA sequencing**

26 Next Generation Sequencing - Advances, Applications and Challenges

Epigenomics is the study of heritable gene regulation that does not involve the DNA nucleotide coding sequence itself but acts on a genome-wide scale via DNA nucleotide methylation and posttranslational modifications of histones, the interaction between transcription factors and their targets, and nucleosome positioning [23–30]. Methylomics is the genome-wide analysis of DNA methylations and their effects on gene expression and heredity [28]. Methyl-seq uses NGS to analyze and map DNA cytosine methylation at single-base resolution usually by employing bisulfite DNA sequencing [24, 25]. Treatment of genomic DNA with sodium bisulfite converts cytosines but not methylcytosines to uracils so that the uracils are PCR converted and sequence differentiated at the SNP locations as thymidines and the methylcy‐ tosines are sequenced as cytosines. Bisulfite DNA sequencing is used widely for DNA methylation profiling in various organisms as well as humans for assessing disease genes [23, 27, 29].

ChIP-seq is chromatin immunoprecipitation (ChIP) that is followed by NGS sequencing. It permits genome-wide profiling of DNA-binding proteins and histone and nucleosome modifications [30]. The ChIP-seq technology was partly adapted from microarray ChIP-chip technology and first implemented in 2007 and since then has been used widely to analyze transcription factor binding sites, histone modifications, and chromatin-modifying complexes and sequences in a wide variety of organisms [154]. ChIP-seq provides higher resolution, less noise, and greater coverage than the array-based ChIP-chip method that was previous used, and therefore, it has become the preferred tool for studying gene regulation and epigenetic mechanisms. Two other NGS tools commonly used for epigenetic studies are Hi-C and ChIA-PET that provide insights into the global 3D organization of eukaryote genomes [30]. Hi-C utilizes NGS on cross-linked DNA fragments to identify the DNA regions such as promoters, enhancers, and insulators that come together to mediate their regulatory activities. ChIA-PET uses immunoprecipitation of crosslinked-interacting protein-DNA and paired-end sequenc‐ ing to reveal the interaction between enhancer and promoter regions located at intergenic distances away from each other but either on the same (*cis*) or different (*trans*) chromosomes [30]. de Wit and de Laat [155] provided an overview of the various derived chromosomal conformation capture (3C) methods, including 4C (chromosome conformation capture-onchip) and 5C (chromosome conformation carbon copy) and their application in the study of chromatin interactions. Two epigenomic databases on the Internet, the NIH Roadmap Epigenomics Project and Blueprint (Table 3), catalogue the chemical modifications to the genome and how they activate gene expression in human tissues and cell types.

#### **8.4. Proteomics, metabolomics, and systeomics**

Proteomics is the large-scale study of the structure, function, identification, and characteriza‐ tion of peptides and proteins [113, 156, 157]. This includes information on protein abundance, variations and polymorphisms, modifications, and their interactions and networks in cellular processes. As a first step, the sequence translation of open reading frames of genomes, exons, and transcripts using a codon table and one or more bioinformatics tools is the simplest way of constructing proteomic profiles from NGS data. However, this is not the only analytical protocol used in the domain of proteomics, and protein specialists often employ a variety of other hardware and software tools to build up an organism's peptide and protein profiles. Among these are the detection and analysis of proteins and their functions by two-dimensional polyacrylamide gels, liquid chromatography coupled with tandem mass spectrometry, affinity-tagged proteins, and yeast two-hybrid assays [156, 157]. A number of public databases for proteomics and protein-protein interactions are available on the Internet such as ExPASy and PRIDE (Table 3).

Metabolomics is the study of an organism's total metabolic response to an environmental stimulus or a genetic modification [113]. The metabolomics of organisms are drawn indirectly from NGS data, mainly from the known functions of enzymes and proteins involved in metabolic and biochemical pathways. Metabolomics data also provide biochemical and physiological snapshots of processes that are obtained from cellular and tissue experimental studies using various technologies of separation (gas chromatography, high-performance liquid chromatography, and capillary electrophoresis) and detection (mass spectrophotome‐ try, NMR spectrometry, ion mobility, and thin-layer chromatography) [158]. Metabolomics is an important part of functional genomics for determining the phenotypic effects of genetic manipulations such as gene deletions, insertions, and mutations. Nutrigenomics is an arm of metabolomics that links genomics, transcriptomics, proteomics, metabolomics, and microbio‐ mics together in an examination of the effects of nutrition and energy metabolism on gene expression in relation to an organism's genotype [113, 159]. The use of constraint-based methods such as the Flux Balance Analysis to design models of metabolite flow in microbes has connected "omic" to phenotypes in the science of Fluxomics [160]. Some web-based metabolomic resources include FAME, AromaDeg, Metabolomexpress, and MetaboAnalyst (Table 3).

Systeomics is the integration of genomics, proteomics, metabolomics, and phenomics into a single network system. It is a branch of biology that uses computational techniques to analyze and model how the components of a biological system such as cells or organisms interact with each other to produce the characteristics and behavior of that system [160–162]. Systeomics is a biology-based interdisciplinary field applied to biomedical and biological scientific research that focuses on complex interactions within biological systems using a holistic approach. For example, the U.S. Department of Energy's Genomic Science program uses microbial and plant genomic data, high-throughput analytical technologies, and modeling and simulation to develop a predictive understanding of biological systems behavior relevant to solving energy and environmental challenges (http://doegenomestolife.org). The U.S. Department of Energy Systems Biology Knowledgebase (KBase) is a software and data platform for systems biology mechanisms (Table 3) to assist with the prediction and design of biological functions of microbes and plants. KBase integrates data, tools, and their associated interfaces into one unified, scalable environment to allow users to upload their own data for analysis, to build models, and to share and publish their workflows and conclusions. Another example is the Kyoto Encylopedia of Genes and Genomes (KEGG), which is a database resource to integrate high-level functions and utilities of biological systems from molecular-level information (Table 3). Other "omics" that contribute to the "omic" lexicon and biology are epidemiomics [163], physionomics [113], variomics [164], and phenomics [165–167]. In the case of phenomics, the European Genome-phenome Archive (EGA) provides accession numbers that refer to the relationship between genomics and phenotype/traits, such as the physical and biochemical traits of humans (Table 3). It integrates physical traits or phenotypes with genomics, tran‐ scriptomics, methylomics, proteomics, and metabolomics [166].

#### **8.5. Metagenomics and microbiomes**

modifications [30]. The ChIP-seq technology was partly adapted from microarray ChIP-chip technology and first implemented in 2007 and since then has been used widely to analyze transcription factor binding sites, histone modifications, and chromatin-modifying complexes and sequences in a wide variety of organisms [154]. ChIP-seq provides higher resolution, less noise, and greater coverage than the array-based ChIP-chip method that was previous used, and therefore, it has become the preferred tool for studying gene regulation and epigenetic mechanisms. Two other NGS tools commonly used for epigenetic studies are Hi-C and ChIA-PET that provide insights into the global 3D organization of eukaryote genomes [30]. Hi-C utilizes NGS on cross-linked DNA fragments to identify the DNA regions such as promoters, enhancers, and insulators that come together to mediate their regulatory activities. ChIA-PET uses immunoprecipitation of crosslinked-interacting protein-DNA and paired-end sequenc‐ ing to reveal the interaction between enhancer and promoter regions located at intergenic distances away from each other but either on the same (*cis*) or different (*trans*) chromosomes [30]. de Wit and de Laat [155] provided an overview of the various derived chromosomal conformation capture (3C) methods, including 4C (chromosome conformation capture-onchip) and 5C (chromosome conformation carbon copy) and their application in the study of chromatin interactions. Two epigenomic databases on the Internet, the NIH Roadmap Epigenomics Project and Blueprint (Table 3), catalogue the chemical modifications to the

genome and how they activate gene expression in human tissues and cell types.

Proteomics is the large-scale study of the structure, function, identification, and characteriza‐ tion of peptides and proteins [113, 156, 157]. This includes information on protein abundance, variations and polymorphisms, modifications, and their interactions and networks in cellular processes. As a first step, the sequence translation of open reading frames of genomes, exons, and transcripts using a codon table and one or more bioinformatics tools is the simplest way of constructing proteomic profiles from NGS data. However, this is not the only analytical protocol used in the domain of proteomics, and protein specialists often employ a variety of other hardware and software tools to build up an organism's peptide and protein profiles. Among these are the detection and analysis of proteins and their functions by two-dimensional polyacrylamide gels, liquid chromatography coupled with tandem mass spectrometry, affinity-tagged proteins, and yeast two-hybrid assays [156, 157]. A number of public databases for proteomics and protein-protein interactions are available on the Internet such as ExPASy

Metabolomics is the study of an organism's total metabolic response to an environmental stimulus or a genetic modification [113]. The metabolomics of organisms are drawn indirectly from NGS data, mainly from the known functions of enzymes and proteins involved in metabolic and biochemical pathways. Metabolomics data also provide biochemical and physiological snapshots of processes that are obtained from cellular and tissue experimental studies using various technologies of separation (gas chromatography, high-performance liquid chromatography, and capillary electrophoresis) and detection (mass spectrophotome‐ try, NMR spectrometry, ion mobility, and thin-layer chromatography) [158]. Metabolomics is

**8.4. Proteomics, metabolomics, and systeomics**

28 Next Generation Sequencing - Advances, Applications and Challenges

and PRIDE (Table 3).

Metagenomics, or beyond genomics, is the study of the total genomic content of a microbial community that bridges the three domains of life, Archaea, Bacteria, and Eukaryotes [100, 114– 118, 168–179]. The total DNA and/or RNA is isolated from a microbial population without prior cultivation, sequenced, and compared with previously known sequences to identify known species or to discover previously unknown species. Some of the environments from which microbial communities are isolated and studied include aquatic and terrestrial envi‐ ronments, host-associated ecosystems, and various human engineered systems such as those involved with food, water, and waste production, agriculture, animal husbandry, and various human and animal habitations [100, 115, 168, 169]. Hospitals are a worrying source of patho‐ genic microorganisms, especially those that develop resistance to commonly used medical antibiotics [115, 168]. Thus, NGS is an important growing application for epidemiological studies of various pathogens, such as mycobacteria, *S. aureus*, *E. coli*, cholera, influenza, HIV, Ebola virus, etc. [169–171]. The Earth Microbiome Project (http://www.earthmicrobiome.org) is an ambitious multidisciplinary attempt to analyze microbial communities across the globe using approximately 500,000 reconstructed microbial genomes.

The earliest metagenomic studies targeted 16S rRNA genes to genotype and identify the different species within the environment before the first NGS microbial studies using the Roche pyrosequencing and Illumina platforms targeted mining sites and the surface waters of the gulfs, seas, and oceans [114, 169]. Many big projects and consortia for sequencing metagenomes have been launched in the past 10 years, such as the TerraGenome project for soils (Table 3) and the *Tara* Oceans project on the microbiome, eukaryotic plankton, and viromes of the global oceans [172–174].

Microbes colonize the human body (microbiome) in numbers that are estimated to outnumber human genes and somatic cells by more than 100-fold [175]. These microbes (viruses, prokar‐ yotes, and eukaryotic microbes) occupy various anatomical habitats including gut, skin, vagina, and oral mucosa and are believed to markedly influence human physiology, nutrition, and health [175–177]. Advances in NGS and computing methods have enabled human microbiome studies such as the MetaHit project and the Human Microbiome Project (HMP) (Table 3). In May 2015, SRA that was established by NCBI in 2008 to store raw sequence data from the NGS technologies had over 2,068 trillion open access nucleotides in its database to massively outgrow GenBank, EMBL, and DDBJ for bacterial sequence storage. The genomic sequences continue to accumulate in other databases as well [114], such as 47,083 prokaryotic genomes projected for Genomes Online Database (GOLD) [178] and 152,927 metagenomes for the MG-RAST server [179]. As of October 2014, the GOLD database contained 544 metage‐ nomics studies associated with 6726 metagenome samples, whereas MG-RAST held 150,039 metagenomic samples, of which 20,415 were publicly available (Table 3). Recently, Zelezniak et al. [180] gathered and modeled NGS 16S rRNA sequence data to map interspecies metabolic exchanges and resource competition based on the genomic potential encoded by the microbial communities. They analyzed more than 1,297 communities and 261 species in soil, water, and human gut samples and concluded that the interplay between competitive and cooperative strategies for resources and the ability to exchange metabolites, such as amino acids and sugars, shapes the composition of microbial communities.

#### **8.6. Comparative genomics, phylogenomics, and the phylomes of life**

Comparative genomics and phylogenomics via NGS and the phylome (complete collec‐ tion of all gene phylogenies in a genome) provide powerful applications for classifying and understanding the differences and similarities of all life forms and for unraveling their evolutionary histories [100, 116, 176, 181–186]. The three basic domains of life, Bacteria, Archaea, and Eukarya, were first identified and classified phylogenetically on the basis of ribosomal RNA sequences [181]. Although Bacteria and Archaea are both placed into the kingdom of the Prokaryotes or Monera (lacking a membrane-bound nucleus, mitochon‐ dria, and chloroplasts but containing a cell wall), their separate rRNA sequence clusters clearly divided them into distinct domains [181]. The Eukarya (eukaryotes) have been subdivided into four basic kingdoms, Protista, Fungi (Mycota), Plantae (Metaphyta), and Animalia (Metazoa) [182]. However, on the basis of metagenomic and phylogenomic studies and NGS data, the classifications and nomenclatures of eukaryotes continue to be revised and organized into other supergroups such as Amoebozoa, Opsthokonta, Ecavata, Archae‐ plastida (Plantae), SAR (Stra/Alveo/Rhizaria), and Incertae sedis [183, 184]. On the other hand, because viruses do not have rRNA genes, they have missed out on a life-domain classification [185, 186]. There is still a strong debate about whether or not viruses without rRNA genes should be classified as a separate life form (a fourth domain) or simply be viewed as exogenous parasites, infectious agents, and endogenous mobile elements that are dependent on and exist within the life forms of the three defined domains [185, 186]. Viruses impact greatly on all life forms, so they are a major interest for NGS applications and phylogenomics [34, 114, 174, 187–189], especially emerging viruses such as dengue, Ebola, Chikungunya, MERS, lyssavirus, and norovirus (http://viralzone.expasy.org), which are of a great concern to human health [114, 171, 189].

human and animal habitations [100, 115, 168, 169]. Hospitals are a worrying source of patho‐ genic microorganisms, especially those that develop resistance to commonly used medical antibiotics [115, 168]. Thus, NGS is an important growing application for epidemiological studies of various pathogens, such as mycobacteria, *S. aureus*, *E. coli*, cholera, influenza, HIV, Ebola virus, etc. [169–171]. The Earth Microbiome Project (http://www.earthmicrobiome.org) is an ambitious multidisciplinary attempt to analyze microbial communities across the globe

The earliest metagenomic studies targeted 16S rRNA genes to genotype and identify the different species within the environment before the first NGS microbial studies using the Roche pyrosequencing and Illumina platforms targeted mining sites and the surface waters of the gulfs, seas, and oceans [114, 169]. Many big projects and consortia for sequencing metagenomes have been launched in the past 10 years, such as the TerraGenome project for soils (Table 3) and the *Tara* Oceans project on the microbiome, eukaryotic plankton, and viromes of the global

Microbes colonize the human body (microbiome) in numbers that are estimated to outnumber human genes and somatic cells by more than 100-fold [175]. These microbes (viruses, prokar‐ yotes, and eukaryotic microbes) occupy various anatomical habitats including gut, skin, vagina, and oral mucosa and are believed to markedly influence human physiology, nutrition, and health [175–177]. Advances in NGS and computing methods have enabled human microbiome studies such as the MetaHit project and the Human Microbiome Project (HMP) (Table 3). In May 2015, SRA that was established by NCBI in 2008 to store raw sequence data from the NGS technologies had over 2,068 trillion open access nucleotides in its database to massively outgrow GenBank, EMBL, and DDBJ for bacterial sequence storage. The genomic sequences continue to accumulate in other databases as well [114], such as 47,083 prokaryotic genomes projected for Genomes Online Database (GOLD) [178] and 152,927 metagenomes for the MG-RAST server [179]. As of October 2014, the GOLD database contained 544 metage‐ nomics studies associated with 6726 metagenome samples, whereas MG-RAST held 150,039 metagenomic samples, of which 20,415 were publicly available (Table 3). Recently, Zelezniak et al. [180] gathered and modeled NGS 16S rRNA sequence data to map interspecies metabolic exchanges and resource competition based on the genomic potential encoded by the microbial communities. They analyzed more than 1,297 communities and 261 species in soil, water, and human gut samples and concluded that the interplay between competitive and cooperative strategies for resources and the ability to exchange metabolites, such as amino acids and sugars,

using approximately 500,000 reconstructed microbial genomes.

30 Next Generation Sequencing - Advances, Applications and Challenges

shapes the composition of microbial communities.

**8.6. Comparative genomics, phylogenomics, and the phylomes of life**

Comparative genomics and phylogenomics via NGS and the phylome (complete collec‐ tion of all gene phylogenies in a genome) provide powerful applications for classifying and understanding the differences and similarities of all life forms and for unraveling their evolutionary histories [100, 116, 176, 181–186]. The three basic domains of life, Bacteria, Archaea, and Eukarya, were first identified and classified phylogenetically on the basis of ribosomal RNA sequences [181]. Although Bacteria and Archaea are both placed into the

oceans [172–174].

NGS, phylogenomics, and taxonomy profiling during the past decade has greatly expand‐ ed our knowledge of the diversity of bacterial genomes from the same and different species [116, 190], with the discovery of many DNA sequence repeats and transposons that contribute to at least 10% of the genome and play an important role in immunity [100, 191]. Archaea and thermophiles have a large proportion of their genomes consisting of defense genes often localized in genomic islands as a consequence of horizontal gene transfers [191, 192]. For example, the family of clustered regularly interspaced short palindromic repeats (CRISPRs) and the CRISPR-associated proteins in the CRISPR-Associated System (CAS) that have an important role in the host's adaptive immunity to pathogens and as responders to environmental stress [192–194] have been translocated between different prokaryote strains and species [191, 192]. CAS includes distinct gene families of 50 or more that show strong evidence of extensive plasticity and horizontal gene transfer to protect prokaryote cells against the replication of phage and plasmids that integrate into the CRISPR locus [193– 195]. Moreover, the CRISPR/CAS systems have been developed as an "in vitro" genetic engineering tool to be transfected into the cells of various organisms to manipulate their genes [196], including the foreign defense system introduced into human cells against HIV-1 infection [197]. Other bacterial defense systems that have been studied or discovered by NGS include the toxin/anti-toxin, antigen, novel restriction-modification, and DNA phosphorathioation systems as well as those involved with infection-induced dormancy or programmed cell death [192]. Genomic sequencing also has revealed new bacterial microcompartments, protein structures, or organelles that are used in metabolic pathways [198], such as those involved in carbon fixation and metabolism of amino alcohols, ethanol, rhamnose, and fucose [199]. Bacterial genomes also provide sequences for phylogenetic and gene comparisons, taxonomic classification, transcriptomics, and methylomics and for the assessment of sequence diversity and variants for a better understanding of gene func‐ tions [100]. Although the classical operon structure predominates in bacteria and archaea, a variety of other transcription unit architectures have been elucidated [100]. More than 4,661 transcription units have been described with an average of 1.7 promoters per operon, and transcription factor binding sites have been determined for virtually all the transcrip‐ tion factors in *E. coli* [100]. DNA methylation was first discovered in bacterial restrictionmodification systems with diverse functions in addition to cellular defense [200], and it is now seen as an evolutionarily conserved form of transcriptional repression and an ancestral form of defense against foreign DNA molecules and transposons and other mobile elements in all life forms [201].

Phylogenomics has been used to reevaluate the evolutionary affiliation between archaea and eukaryotes and to infer that the nuclear lineage in eukaryotes emerged from the archaeal radiation and most probably from the archaeal TACK superphylum [202]. Recently, Spang et al. [203] sequenced uncultivated metagenomes from a deep-sea vent and discov‐ ered novel archaeal genomes in the new phylum that they named "Lokiarchaeota." These novel archaea contain homologues of many eukaryotic proteins that function in the endomembrane system and in phagocytosis, including actin and related proteins, and Ras superfamily GTPases, suggesting that this newly discovered phylum is the missing link in eukaryogenesis. Although eukaryotes possess the membrane-enclosed mitochondrial organelle and prokaryotes do not, the eukaryotic mitochondria are believed to have evolved from a bacterial system, probably by endosymbiosis [204] involving an ancestor within the bacterial phylum Alphaproteobacteria [205]. Although mitochondrial phylogenomics suggests a monophyletic origin and assemblage, it is now evident that the mitochondria are genetic chimeras and functional mosaics with the bulk of the mitochondrial proteome originating during eukaryote evolution outside the Alphaproteobacteria and other bacteri‐ al phyla. It seems that the mitochondrial genome has expanded and contracted in various lineages during evolution with much of the original mitochondrial genetic information transferred to the nucleus [205]. Eukaryotic diploid cells appear to have evolved 2 billion years after haploid prokaryotes, and their evolution from proto-eukaryotic cells, such as the multinucleated *Giardia* organism [206], seems to have involved chromosomal crossing over from mitotic recombination to meiosis and to sexual reproduction where a set of chromosomes is inherited from each parent [207]. The genomes of diploid eukaryotes are usually larger than those in haploid prokaryotes probably because greater information complexity is needed by multicellular organisms to regulate and coordinate the multiple stages of their life cycles with the added requirement for more molecular regulatory systems to communicate and interact between multiple tissues and organs [206].

Eukaryotic genomes vary markedly in size and gene number and appear to be variable in their susceptibility to polyploidy (a doubling of the diploid sets of chromosomes), redundan‐ cy, duplication, and the persistent accumulation of interspersed repeats and mobile elements [208–210]. For example, the genomes of plants can range from the simplest like *Ostreococ‐ cus tauri* with a 12.6 Mb genome, containing less than 8,000 genes and minimal genome duplication [211], to the highly complex such as the canopy and pale-petal flowering plant *Paris japonica,* with a 150 Gb genome and eight sets of chromosomes derived by allopolyploi‐

dy and hybridization of four species [212]. The genomic size of *Paris japonica,* which has still to be fully sequenced*,* is 50 times larger than the human genome and extends the range of genome sizes to 2,400-fold across angiosperms and 66,000-fold across eukaryotes [212]. Genome duplication and polyploidy, both recent and ancient, have contributed to the considerable genomic complexity in eukaryotes, particularly in plants, amoeba, fungi, and vertebrates [208–223]. Following ancient polyploidization, most duplicated genes are deleted by intrachromosomal recombination, a process referred to as fractionation, and any remaining evidence for the polyploidy event is not easy to find by phylogenomic analy‐ sis [214]. Nevertheless, a phylogenomic comparison of gene duplications in a four-way comparison of paralogous regions in tunicate, fish, mouse, and human provided unmistak‐ able evidence of two distinct genome duplication events (the 2R event) early in verte‐ brate evolution and before the divergence of fish and mammalian lineages [215], as was proposed by Ohno in 1970 [216]. Interestingly, polyploidy also can occur in humans during normal development and cancer [208, 209]. Fetal polyploidy in the form of triploidy (69,XXX chromosomes) and tetraploidy (92,XXXX chromosomes) is a rare and lethal event, result‐ ing in spontaneous abortions or brief postpartum survival times [208], whereas polyploi‐ dy is common in stressed tissues and cells and in tumor development [208, 209]. On the other hand, comparative genomic studies have revealed that polyploidy is common in the evolutionary history of many different flowering plants [208, 214], for example, between different species of the allopolyploid tobacco plants, *Nicotiana* section Repandae [217]. In comparing the allotetraploid genomes of *Nicotiana repanda* and *Nicotiana nudicaulis,* it was assessed that the loss of low-copy sequences along with the loss of duplicate copies of genes and upstream regulators reflects genome diploidization, whereas genome size divergence between the allopolyploids is manifested through differential accumulation and/or deletion of high-copy-number sequences and transposable elements [217]. Diploidization and genome size change in *Nicotiana* allopolyploids is associated with differential dynamics of low- and high-copy sequences [218]. The induction of polyploidy is a common technique to overcome the sterility of a hybrid species during plant breeding; therefore, many agriculturally important plants such as the genus *Brassica* are polyploids [219-221]. Wheat, after millennia of hybridization and modification by humans, has strains that are diploid (2 sets of chromosomes), tetraploid (4 sets of chromosomes), and hexaploid (6 sets of chromosomes) [222, 223], whereas the invasive weed *Spartina anglica* has up to 12 sets of chromosomes [224].

tions [100]. Although the classical operon structure predominates in bacteria and archaea, a variety of other transcription unit architectures have been elucidated [100]. More than 4,661 transcription units have been described with an average of 1.7 promoters per operon, and transcription factor binding sites have been determined for virtually all the transcrip‐ tion factors in *E. coli* [100]. DNA methylation was first discovered in bacterial restrictionmodification systems with diverse functions in addition to cellular defense [200], and it is now seen as an evolutionarily conserved form of transcriptional repression and an ancestral form of defense against foreign DNA molecules and transposons and other mobile elements

Phylogenomics has been used to reevaluate the evolutionary affiliation between archaea and eukaryotes and to infer that the nuclear lineage in eukaryotes emerged from the archaeal radiation and most probably from the archaeal TACK superphylum [202]. Recently, Spang et al. [203] sequenced uncultivated metagenomes from a deep-sea vent and discov‐ ered novel archaeal genomes in the new phylum that they named "Lokiarchaeota." These novel archaea contain homologues of many eukaryotic proteins that function in the endomembrane system and in phagocytosis, including actin and related proteins, and Ras superfamily GTPases, suggesting that this newly discovered phylum is the missing link in eukaryogenesis. Although eukaryotes possess the membrane-enclosed mitochondrial organelle and prokaryotes do not, the eukaryotic mitochondria are believed to have evolved from a bacterial system, probably by endosymbiosis [204] involving an ancestor within the bacterial phylum Alphaproteobacteria [205]. Although mitochondrial phylogenomics suggests a monophyletic origin and assemblage, it is now evident that the mitochondria are genetic chimeras and functional mosaics with the bulk of the mitochondrial proteome originating during eukaryote evolution outside the Alphaproteobacteria and other bacteri‐ al phyla. It seems that the mitochondrial genome has expanded and contracted in various lineages during evolution with much of the original mitochondrial genetic information transferred to the nucleus [205]. Eukaryotic diploid cells appear to have evolved 2 billion years after haploid prokaryotes, and their evolution from proto-eukaryotic cells, such as the multinucleated *Giardia* organism [206], seems to have involved chromosomal crossing over from mitotic recombination to meiosis and to sexual reproduction where a set of chromosomes is inherited from each parent [207]. The genomes of diploid eukaryotes are usually larger than those in haploid prokaryotes probably because greater information complexity is needed by multicellular organisms to regulate and coordinate the multiple stages of their life cycles with the added requirement for more molecular regulatory systems

to communicate and interact between multiple tissues and organs [206].

Eukaryotic genomes vary markedly in size and gene number and appear to be variable in their susceptibility to polyploidy (a doubling of the diploid sets of chromosomes), redundan‐ cy, duplication, and the persistent accumulation of interspersed repeats and mobile elements [208–210]. For example, the genomes of plants can range from the simplest like *Ostreococ‐ cus tauri* with a 12.6 Mb genome, containing less than 8,000 genes and minimal genome duplication [211], to the highly complex such as the canopy and pale-petal flowering plant *Paris japonica,* with a 150 Gb genome and eight sets of chromosomes derived by allopolyploi‐

in all life forms [201].

32 Next Generation Sequencing - Advances, Applications and Challenges

A recent comparative genomic study has revealed how genomes change with speciation in an examination of genomes from five cichlid fish species, an ancestral lineage from the Nile, and four species from the East Africa lakes, Tanganyika, Malawi, and Victoria [225]. Compared to the ancestral Nile lineage, the East African cichlid genomes had many alterations in regulatory elements, accelerated evolution of protein-coding elements in genes for pigmentation, an excess of gene duplications, and other distinct features that affect gene expression associated with transposable element insertions and novel microRNA. Each species contains a reservoir of mutations different from the other species [225]. Much of the diversity between species evolves in a nonparallel manner often rapidly due to sexual selection and genetic conflicts between males and females and between different regions of the genome at a regulatory level rather than by the slower and weaker forces of classical natural selection [226].

Most genomes range between newly derived genes and the ultraconserved or the essential core coding and noncoding genes [100, 227, 228]. Comparative genomics has resulted in the discovery of ultraconserved noncoding elements (UCNE) across different phyla, starting with 481-long segments (>200 bp) that are 100% conserved between orthologous regions of the human, rat, and mouse genomes and 95% to 99% conserved in chicken and dog genomes [229]. A more recent comparison of 28 vertebrate genomes identified millions of additional con‐ served elements with distinct types of functional elements including regulatory motifs present in the promoters and untranslated regions of coregulated genes, insulators that constrain domains of gene expression, and conserved secondary structures in RNAs and in develop‐ mental regulators [230]. A webpage at http://ultraconserved.org provides study protocols, computer software, and references dedicated to ultraconserved elements [229]. Also, there are at least two databases for the conserved noncoding elements and the genomic regulatory blocks (Table 3), the UCNEbase for human and chicken [231], and the UCbase 2.0 for the 481 UCNE that were longer than 200 bp and that were discovered in the genomes of mammals [229]. The UCNEbase suggests that the evolution of species depends more on innovation and change in regulatory sequences than in proteins [231]. Indeed, there are essential genes that are indispensable for the survival of an organism and therefore are considered a foundation of life. The database of essential genes (DEG) (Table 3) catalogues known essential genomic elements, such as protein-coding genes and noncoding RNAs, within the bacteria, archaea, and eukaryotes that constitute a minimal genome and are useful for annotating newly sequenced genomes [232].

Phylomes provide the combined analysis of genome-wide collections of phylogenetic trees to aid in the inference of orthological and paralogical relationships and the detection of evolu‐ tionary events such as whole-genome duplication (polyploidization), gene family expansion and contraction, horizontal gene transfer, recombination, inversion, and incomplete lineage sorting [233, 234]. The online PhylomeDB v4 database was created as a phylogenomic reposi‐ tory and is useful for preliminary phylogenetic data analysis of genomes of interest from various phyla as well as for annotating newly derived genomic sequences [234]. As an example, Fig. 1 shows the PhylomeDB analysis of the duplications of the RLTPR gene, a gene that was first discovered in humans in 2004 [235]. The PhylomeDB analysis shows that the RLTPR gene has two paralogs, LRRC16B and LRRC16, which were generated by two separate duplication events at least prior to the divergence of mice and humans (Fig. 1). The functions of RLTPR are not well characterized, but its distinct functional domains suggest that it may multitask in protein-protein interactions, as recently demonstrated in the development of regulatory T cells in mice [236]. The analytical approach to find orthologous and paralogous relationships with maximum genomic coverage for the RLTPR gene is both gene-centric and genome-wide in PhylomeDB. Also of particular interest are the well-conserved genomic mechanisms of innate immunity, such as Apolipoprotein B Editing Catalytic subunit proteins 3 (APOBEC3s) in mammals that mutate and inactivate viral genomes [237]. Other phylogenetic databases that complement PhylomeDB in a comparative analysis are Ensembl Compara GeneTrees, TreeFam, PANTHER, PhyloFacts FATCAT, and the HOGENOM database (Table 3).

**Figure 1.** RLTPR gene tree shows the RLTPR gene orthologs and paralogs in 10 vertebrate species. The human gene RLTPR (NCBI Gene ID: 146206), first reported in 2004 [235], was used as the search query for the Phylome tree at http://phylomedb.org with the phylome data settings of AS seed in (Qf0) mouse phylome (2) and JTT (lk:-27586.1). The tree shows the speciation events (blue squares) and three duplication events (red squares) at the nodes with the first duplication event arising early in vertebrate evolution before the divergence of fish and mammalian lineages [215].

#### **8.7. Mobilomics and Horizontal Gene Transfer (HGT)**

between males and females and between different regions of the genome at a regulatory level

Most genomes range between newly derived genes and the ultraconserved or the essential core coding and noncoding genes [100, 227, 228]. Comparative genomics has resulted in the discovery of ultraconserved noncoding elements (UCNE) across different phyla, starting with 481-long segments (>200 bp) that are 100% conserved between orthologous regions of the human, rat, and mouse genomes and 95% to 99% conserved in chicken and dog genomes [229]. A more recent comparison of 28 vertebrate genomes identified millions of additional con‐ served elements with distinct types of functional elements including regulatory motifs present in the promoters and untranslated regions of coregulated genes, insulators that constrain domains of gene expression, and conserved secondary structures in RNAs and in develop‐ mental regulators [230]. A webpage at http://ultraconserved.org provides study protocols, computer software, and references dedicated to ultraconserved elements [229]. Also, there are at least two databases for the conserved noncoding elements and the genomic regulatory blocks (Table 3), the UCNEbase for human and chicken [231], and the UCbase 2.0 for the 481 UCNE that were longer than 200 bp and that were discovered in the genomes of mammals [229]. The UCNEbase suggests that the evolution of species depends more on innovation and change in regulatory sequences than in proteins [231]. Indeed, there are essential genes that are indispensable for the survival of an organism and therefore are considered a foundation of life. The database of essential genes (DEG) (Table 3) catalogues known essential genomic elements, such as protein-coding genes and noncoding RNAs, within the bacteria, archaea, and eukaryotes that constitute a minimal genome and are useful for annotating newly

Phylomes provide the combined analysis of genome-wide collections of phylogenetic trees to aid in the inference of orthological and paralogical relationships and the detection of evolu‐ tionary events such as whole-genome duplication (polyploidization), gene family expansion and contraction, horizontal gene transfer, recombination, inversion, and incomplete lineage sorting [233, 234]. The online PhylomeDB v4 database was created as a phylogenomic reposi‐ tory and is useful for preliminary phylogenetic data analysis of genomes of interest from various phyla as well as for annotating newly derived genomic sequences [234]. As an example, Fig. 1 shows the PhylomeDB analysis of the duplications of the RLTPR gene, a gene that was first discovered in humans in 2004 [235]. The PhylomeDB analysis shows that the RLTPR gene has two paralogs, LRRC16B and LRRC16, which were generated by two separate duplication events at least prior to the divergence of mice and humans (Fig. 1). The functions of RLTPR are not well characterized, but its distinct functional domains suggest that it may multitask in protein-protein interactions, as recently demonstrated in the development of regulatory T cells in mice [236]. The analytical approach to find orthologous and paralogous relationships with maximum genomic coverage for the RLTPR gene is both gene-centric and genome-wide in PhylomeDB. Also of particular interest are the well-conserved genomic mechanisms of innate immunity, such as Apolipoprotein B Editing Catalytic subunit proteins 3 (APOBEC3s) in mammals that mutate and inactivate viral genomes [237]. Other phylogenetic databases that complement PhylomeDB in a comparative analysis are Ensembl Compara GeneTrees,

TreeFam, PANTHER, PhyloFacts FATCAT, and the HOGENOM database (Table 3).

rather than by the slower and weaker forces of classical natural selection [226].

34 Next Generation Sequencing - Advances, Applications and Challenges

sequenced genomes [232].

The science of mobile genetic elements (mobilomics) developed long before the advent of genomics and NGS [238]. The 1983 Nobel Prize winner Barbara McClintock first reported the existence of mobile elements as jumping genes in maize in the late 1940s [239]. The discovery of new classes and families of DNA transposons and autonomous and nonautonomous retrotransposons continued slowly for the next five decades until the first online repeat element screening webserver CENSOR and database REPBASE (Table 3) was established by Jerzy Jurka and his colleagues between 1992 and 1996 [240, 241]. Since then, RepeatMasker (Table 3) and other tools such as Mobster [242], Red [243], and Visual TE [244] have followed on to help define the mobilome, the totality of mobile genetic elements in a particular genome. A list and description of some of the families, types, and classes of transposons and retro‐ transposons in prokaryotes and eukaryotes can be found in the following reviews [238, 245– 251]. A recent survey of repeats and mobile elements that affect genomic stability has eluci‐ dated how some bacteria can control the mobilome through postsegregation killing systems [192–195, 247]. Different classes of TEs are found in the genomes of different eukaryotes that contribute to at least 50% of the human genome [237] and up to 90% of the maize genome [252]. In humans, there are solitary Long Terminal Repeats (LTR) and LTR retrotransposons (endogenous retroviruses) that are characterized by the presence of LTR at both ends; Long Interspersed Nuclear Elements (LINEs) like L1 that represent families of non-LTR TEs about 6 kb in length and encode two proteins, a nucleic acid chaperone, and a reverse transcriptase/ nuclease for retrotransposition; Nonautomomous Miniature Inverted-Repeat Transposable Elements (MITEs); Mammalian-wide Interspersed Repeats (MIRs), an ancient family of tRNAderived SINEs exapted as enhancers and regulatory sequences; and Short Interspersed Nuclear Elements (SINEs) like Alu that are usually less than 300 bp and need a helper transposon element like L1 for transposition [245]. Most ERVs, SINEs, and LINEs in the human genome are now remnants of past insertions and are no longer capable of actively "jumping" like functional TEs [238, 245, 248]. Indeed, many of the TE ancient relics have undergone exaptation and developed new functions, such as transcript repeat elements, within regulatory gene networks to generate lineage-specific adaptation [145, 249].

The importance of widespread HGT in creating genomic diversity in microbes has been highlighted by the many comparative genomic studies using metagenome data [191]. Comparative genomic analysis of different strains of *E. coli* revealed that up to 30% of genes in pathogenic strains were acquired by HGT often creating duplication events and modifying metabolic networks by adding operons that encode two or more enzymes [253]. Comparative genomics of photosynthetic prokaryotes revealed that they have evolved as complex mosaics via multiple HGT events [254]. Similarly, photosynthetic gene clusters and gene clusters that encode various toxins, resistance genes, metabolic genes, and compo‐ nents of secretion systems appear to be the products of HGT [247, 253–255]. Indeed, many HGT events probably were mediated by genomic mobile elements, such as bacteriophag‐ es, plasmids, viruses, transposable elements, and toxin/antitoxin systems that are persis‐ tent in all life forms [191, 228, 246, 255, 256].

Before the new millennium, transposons and repeat elements were largely viewed as junk and as parasites that created unnecessary burden on the genome. Comparative genomics and online databases dedicated to transposons and repeat elements such as SINES, LINES, and ERVs, however, began to change this picture in the 1990s, and it soon became evident that these elements were the drivers of evolutionary innovation. Many integrated transposons mutate with time to interact with the host transcriptional machinery and therefore provide a useful substrate for evolution of novel regulatory elements [145, 228, 255–258]. Moreover, some of the ancient integrated retrotransposons appear to have been involved in advantageous segmental genomic duplications such as in the major histocompatibility complex region [259– 261], and others have dispersed regulatory controls to provide coordinated regulation across the genome [257, 258].

#### **8.8. Agrigenomics**

Agrigenomics or agricultural genomics can be defined as the research and development activities that translate NGS and genomics technology into a better understanding of plant biology and advancing crop improvements. During the past decade, NGS had an enormous impact on developing fundamental genome resources to directly address many of today's concerns in agriculture and agronomics. Since the publication in 2000 of the first plant genome, *Arabidopsis thaliana*, 54 new plant genomes were published by 2013 [221] followed by at least another 6 plant genomes including the hexaploid bread wheat genome [223]. In reviewing the first 55 plant genomes, Michael and Jackson [221] concluded that, although these genomes have provided a glimpse at the gene number, types, and numbers of repeats and genomic growth, contraction, and rearrangement, we are only just at the beginning of defining the functional aspects of plant genomes "and various other 'omics' data layered on genomes."
