**7. Bioinformatics: DNA and RNA data analysis and storage**

quality scores tend to overestimate the base accuracy. A key consideration for generating highquality, unbiased, and interpretable data from next-generation sequencing studies is to achieve sufficient sequence depth and coverage for statistical certainty. Low sequencing depth can contribute to high error rates stemming from base calling and mapping errors, which in turn can affect the statistical significance for identifying true genotypes, nucleotide variants, and single nucleotide polymorphism. Increased depth of coverage can help sequence alignment mapping to differentiate between true variants and errors, although it might not resolve errors due to assembly gaps. Good sequence library preparation is paramount to producing good sequence depth and coverage. A number of different library methods are available to achieve this goal depending on the NGS applications [55]. Sims et al. [92] reviewed in critical detail the guidelines and precedents for optimal sequencing depth and coverage in regard to sequencing genomes, exomes, transcriptomes, methylomes, and epigenomes by chromatin immunopre‐

No single study has compared the performance of all the available NGS platforms simultane‐ ously using the same control genomic sequences. However, a comparison of three bench-top sequencers, the Roche GS Junior, the Illumina MiSeq, and Ion PGM, revealed large differences in cost, sequence capacity, and performance outcomes of genome depth, stability of coverage and read lengths, and quality for sequencing bacterial genomes [54, 93]. Most sequencing errors arose with indel polymorphisms, GC-rich regions, and homopolymeric regions. Overall, the two laboratories concluded that all the machines had certain limitations that needed to be taken into account when designing sequencing experiments [54, 93]. In a comparison of bacterial genome sequencing between PacBio, Ion Torrent, and three Illumina machines (MiSeq, GAIIx, and HiSeq 2000), the sequencers all provided high accuracy for GC-rich, neutral, and moderately AT-rich genomes [94]. The main exception was the poor coverage in the extremely AT-rich region of *Plasmodium falciparum* with a single 316 chip for the Ion Torrent PGM that resulted in no coverage for 30% of the genome. In this study, PacBio generated the longest reads but produced the least accurate SNP detection and the highest error rate of 13% compared to 1.78% for Ion Torrent and less than 0.04% for the Illumina platforms. In a different comparison, the performance of whole-genome sequencing platforms Illumina's HiSeq2000, Life Technologies' SOLiD 4 and 5500xl SOLiD, and Complete Genomics' sequencing system were evaluated for their ability to call SNVs and to evenly cover the genome and specific genomic regions [95]. The authors concluded that all the platforms had their shortfalls with a pronounced GC bias in GC-rich regions and false-positive rates and that the best solution is to integrate the sequencing data from the four different platforms because it combined the strengths of different technologies. In an analysis of bacterial CREBBP exons, three different NGS platforms appear to have worked comparably well for targeted exomic sequencing with the percentage of total read numbers aligned to a reference sequence resulting in 99.8% for Roche 454, 98.1% for Illumina MiSeq, and 90.7% for Ion Torrent PGM sequence reads [96]. However, the Illumina MiSeq data showed the highest substitution error rate, whereas the PGM data revealed the highest indel error rate. On the other hand, there was little difference between the Junior Roche and the Ion PGM platforms for "in phase" sequence genotyping of HLA loci, and either platform could be used with excellent results [16]. In this case, the lower cost of reagents and a slightly quicker turnaround time favored the Ion PGM platform [97].

cipitation and sequencing and/or chromosome conformation capture.

16 Next Generation Sequencing - Advances, Applications and Challenges

Bioinformatics is a major rate-limiting step for NGS technology with respect to overcoming the growing challenges of storage, analysis, and interpretation of NGS data [98–100]. There are at least four tiers of nucleotide sequence analysis to consider when using the NGS platforms [98–104]. The first is generation of sequence reads using the software integrated within the sequencing instruments that convert the raw signals into base calling with short reads of nucleotide sequences and associated quality scores. The second is the alignment and assembly of contigs and scaffolds and variant detection. The third is annotation, data integration, and visualization of the assembled sequence. The fourth is the amalgamation of all the data from the different NGS platforms into a single, coherent, bioinformatic output with accessible links and tools for general and particular biological or forensic interest. The Internet-web addresses to source the bioinformatics tools and databases for NGS data analysis from the original raw sequencing data to functional biology can be obtained from the following references [99–104] and Table 3.

The raw sequencing signals produced by the manufacturer's sequencing machine or system are converted into nucleotide bases of short read data (base calling) with base quality scoring using the system's FASTQ format or the native raw data file formats (Illumina, SFF, HDF5, CG, or SOLID). Storage of raw signal (image) and sequencing data as short read archives in the FASTQ format or native raw data file formats is a problem in regard to computing resources for many research sequencing laboratories and commercial service providers. Thus, the conversion of FASTQ files to the more compact Sequence Alignment Map (SAM) format and its compressed Binary Alignment Map (BAM) format is recommended because it is easier to read and process for later bioinformatics analysis [99, 102]. The safe storage of the original raw sequences is important for bioinformatics analysis and corrections because it is the source of the initial sequencing errors that are either filtered out or left within the final assembled sequence. Quality checks are necessary to remove reads with low phred levels, sequence errors, and sequences such as primers, vectors, adapters, tags, and tails that were introduced exper‐ imentally during the preparation of the sequencing libraries [101]. Errors or biases associated with raw reads from the Illumina, Roche, and SOLiD platforms are mainly fluorophoredependent errors, whereas the non-fluorophore platforms such as Ion Torrent produce their own unique errors and biases [99, 101]. Therefore, many different signal and image detection programs and base calling algorithms still need to be developed and tested in an attempt to improve the accuracy of base calling [101}. The raw sequence data (a mixture of raw files and other metadata) from the NGS technologies can be submitted to the NCBI Sequence Read Archive database for DNA studies and to Gene Expression Omnibus and ArrayExpress for mRNA-seq or ChIP-Seq studies in order to receive a database accession number and to reference the raw sequence data in scientific publications [105]. The Sequence Read Archive (SRA) at NCBI also provides a fee-free, downloadable SRA computing toolkit to read the raw graphs and files from the different NGS platforms and to convert between different file formats (Table 3). Archive files in the SRA format (.sra) are converted into the FASTQ or SAM/BAM formats for input to downstream analysis using software programs (Table 3) to undertake the second tier analysis of sequence alignment (spliced and genomic), assembly, and variant detection.

The requirement for sequence alignment and variant detection at the second tier of bioinfor‐ matics depends on the complexity of the NGS project. Small sequence reads from small genomes (e.g., viruses) are less complex and easier to compute and align and assemble than the many more reads generated from large genomes of mammals or higher plants. The transfer of the preedited DNA data in the correct format to alignment and variant detection software is generally straightforward and there are many free and commercial software packages available to perform these tasks [99–104]. As often is the case, a single package does not suit all analytical requirements. There may need to be a degree of interchange and testing to find the best solutions as well as using appropriate and informative controls for standardization and normalization. Schlotterer et al. [104] have reviewed programs for genotype and SNP calling. ANGSD is a new multithreaded program suite that was developed recently to perform association mapping, population genetic analyses (population structure measures, allele frequency for cases and controls, admixture, and neutrality tests), SNP discovery, and genotype calling using the raw sequence data and genotype likelihoods in NGS data of human DNA samples for the 1000 Genomes Project [106].

The alignment of sequences to provide long assemblies (contigs and/or scaffolds) may take two different paths. One is comparative mapping of short reads aligned to reference sequences and the other is *de novo* assembly of overlapping reads [101]. The accuracy of *de novo* assembly can be confirmed or improved by integrating it with comparative alignment mapping to reference genomic sequences. Sequencing assemblers may employ different graph construc‐ tion algorithms and preprocessing and postprocessing filter computations to flag, correct, or eliminate sequencing errors with no single computation solution. Some genome assemblers forgo the preprocess filtering step and they all differ in their ease of use, in the accuracy, efficiency, and quality of assembly, ability to fill gaps, and differentiate between error driven variants and true variants or SNPs and in the detection and elimination of repeats and sequencing errors [99]. According to El-Metwally et al. [99], an ideal assembler should have a set of layers with clearly defined inputs, communication output messages to facilitate the development of innovative, interactive, independent assemblers using the SAM/BAM file formats and the language of FASTG (http://fastg.sourceforge.net) for the next-generation environment. Another way to improve the quality of sequencing and assembly is to adopt a hybrid approach by using two or more different sequencing platforms and assembly software. A new software package *anytag* that fills gaps between paired-end reads to generate near-errorfree contigs of up to 190 kb appears to be a fivefold improvement over existing *de novo* genome assemblers such as *soap* and *Newbler* [107].

other metadata) from the NGS technologies can be submitted to the NCBI Sequence Read Archive database for DNA studies and to Gene Expression Omnibus and ArrayExpress for mRNA-seq or ChIP-Seq studies in order to receive a database accession number and to reference the raw sequence data in scientific publications [105]. The Sequence Read Archive (SRA) at NCBI also provides a fee-free, downloadable SRA computing toolkit to read the raw graphs and files from the different NGS platforms and to convert between different file formats (Table 3). Archive files in the SRA format (.sra) are converted into the FASTQ or SAM/BAM formats for input to downstream analysis using software programs (Table 3) to undertake the second tier analysis of sequence alignment (spliced and genomic), assembly, and variant

The requirement for sequence alignment and variant detection at the second tier of bioinfor‐ matics depends on the complexity of the NGS project. Small sequence reads from small genomes (e.g., viruses) are less complex and easier to compute and align and assemble than the many more reads generated from large genomes of mammals or higher plants. The transfer of the preedited DNA data in the correct format to alignment and variant detection software is generally straightforward and there are many free and commercial software packages available to perform these tasks [99–104]. As often is the case, a single package does not suit all analytical requirements. There may need to be a degree of interchange and testing to find the best solutions as well as using appropriate and informative controls for standardization and normalization. Schlotterer et al. [104] have reviewed programs for genotype and SNP calling. ANGSD is a new multithreaded program suite that was developed recently to perform association mapping, population genetic analyses (population structure measures, allele frequency for cases and controls, admixture, and neutrality tests), SNP discovery, and genotype calling using the raw sequence data and genotype likelihoods in NGS data of human

The alignment of sequences to provide long assemblies (contigs and/or scaffolds) may take two different paths. One is comparative mapping of short reads aligned to reference sequences and the other is *de novo* assembly of overlapping reads [101]. The accuracy of *de novo* assembly can be confirmed or improved by integrating it with comparative alignment mapping to reference genomic sequences. Sequencing assemblers may employ different graph construc‐ tion algorithms and preprocessing and postprocessing filter computations to flag, correct, or eliminate sequencing errors with no single computation solution. Some genome assemblers forgo the preprocess filtering step and they all differ in their ease of use, in the accuracy, efficiency, and quality of assembly, ability to fill gaps, and differentiate between error driven variants and true variants or SNPs and in the detection and elimination of repeats and sequencing errors [99]. According to El-Metwally et al. [99], an ideal assembler should have a set of layers with clearly defined inputs, communication output messages to facilitate the development of innovative, interactive, independent assemblers using the SAM/BAM file formats and the language of FASTG (http://fastg.sourceforge.net) for the next-generation environment. Another way to improve the quality of sequencing and assembly is to adopt a hybrid approach by using two or more different sequencing platforms and assembly software. A new software package *anytag* that fills gaps between paired-end reads to generate near-error-

detection.

DNA samples for the 1000 Genomes Project [106].

18 Next Generation Sequencing - Advances, Applications and Challenges

In a recent evaluation of the most commonly used *de novo* genome assemblers to assemble the genomes of three vertebrate species (snake, bird, and fish) by Assemblathon, the authors recommended not to trust the results of any single assembly, nor place too much faith in a single metric of quality or accuracy, but instead to choose an assembler that excels in the area of interest and expectation to provide sufficient coverage, continuity, and error-free bases [108]. End users were reminded that the use of the assembly tools is not straightforward and that they should first gain considerable familiarity with the computing hardware and software and become aware of the "ease of installation, use, and management" of each assembly tool. Many problems with *de novo* genome assembly remain inherent with recognizing and evaluating highly heterozygous and repetitive regions, segmental duplications, and sequenc‐ ing errors and gaps. This is complicated further by the different read lengths, read counts, and error profiles that are produced by different NGS technologies. In addition, most assembled genomic sequences in publicly accessible databases are at the level of or below a standard draft (minimum standards for submission to public databases) rather than a "high-quality draft" assembly that is completed to at least 90% of the expected genome size.

The third tier of bioinformatics is to annotate, transcribe, and translate the genomic sequences to a higher informatics level, such as defining gene exon coding (CDS) and noncoding (5′ noncoding, introns, and 3′ terminal end) untranslated regions (UTRs), alternate transcript isoforms, signal peptides, repeat elements, and other nontranscribed regions such as viral integration sites and chromosomal common fragile sites [103]. Genomic sequences of prokar‐ yotes are a thousand times smaller and less complex than those of eukaryotes and consequently are easier to assemble and annotate. A typical methodology for prokaryote annotation suggested by the National Pathogen Data Resource to annotate 1000 genomes is to first submit the genomic sequence to the Rapid Annotation Server (RAST) at the Argonne National Laboratory and receive back the protein-encoded genes (CDS), the RNA-encoded genes (tRNAs and rRNAs), and identified subsystems such as metabolic pathways, complex structures, and phenotypes (Table 3). This initial annotation should then be reanalyzed in detail to find discrepancies between the sequence and the translation using any other public or commercial genomic tools to fix miscalled genes and variants, frameshifts, insertion sequences, and pseudogenes. The public web server CRISPRfinder detects and annotates the bacterial CRISPRs and tandem repeat sequences that may impact on genes and pseudogenes (Table 3). After the reanalysis and final fixes, the annotated and curated genome should be rerun through RAST to update the subsystems output. Other useful web-based microbial annotation servers can be accessed at MicroScope, BASys, and NCBI's Prokaryotic Genome Annotation Pipeline (PGAP), with additional software provided at Prokka (Table 3). A typical prokaryotic genome annotation process is outlined at NCBI (http://www.ncbi.nlm.nih.gov/genome/annota‐ tion\_prok/process/).

Eukaryote genome annotation is more complex and challenging than prokaryote genome annotation. In an overview of the available tools and best practices for eukaryotic genome annotation, Yandell and Ence [103] pointed to five basic categories of annotation software: (1)

*ab initio* and evidence-drivable gene predictors; (2) EST, protein, and RNA-seq aligners and assemblers; (3) choosers and combiners; (4) genome annotation pipelines; and (5) genome browsers for curation. A typical eukaryotic genome annotation pipeline is outlined by NCBI at http://www.ncbi.nlm.nih.gov/genome/annotation\_euk/process/. The essential first step for eukaryote genome annotation and gene determination is to identify and mask repeat elements (microsatellites, retrotransposons, and transposons) using RepeatMasker, Censor, or Win‐ dowMasker (Table 3). Without the initial masking step, the repeats would seed millions of spurious BLAST alignments and create incorrect gene annotations and corrupt the genome annotation with artifacts and false metadata. After masking, the annotation pipeline includes the following steps: transcript, RNA-seq read, protein/domain alignments; guided/*ab initio* gene model predictions; curated genomic sequence alignments; selection of the best evidence based models; gene naming and locus typing; assignment of GeneIDs; and annotation of small RNAs. In addition, there are the special considerations such as annotation of multiple assem‐ blies and updated assemblies before the annotated products can obtain an Annotation Release number and a release date for availability in various NCBI resources, including the databases for nucleotides, proteins, BLAST, gene, Map Viewer, and FTP sites. Other websites and tools considered important for eukaryote annotation are BUSCO for assessing the "core" eukaryote genes, Babelomics for the functional analysis of transcriptomic and genomic data, the PASA and MAKER tools for updating annotations with RNA-seq data, and other data and informa‐ tion (Table 3). The annotated and mapped data can then be integrated, visualized, and presented at a fourth tier of bioinformatics with genome browsers such as those displayed at UCSC, Ensembl, JBrowse, Web Apollo (Table 3), and others such as Genome Maps [109]. The new Emsembl 2015 provides an up-to-date genomic interpretation system for annotations, query tools, and access methods for chordates and key model organisms [110].

Gene ontology is a bioinformatics initiative that provides (a) defined terms representing gene product properties and pathways covering biological domains such as cellular components, molecular function, and biological processes with their various subcategories and (b) func‐ tional annotation tools to find functions for large gene lists. It sits somewhere between the third tier (annotation) and the fourth tier of bioinformatic analyses and structures. The first major Gene Ontology (GO) project was founded in 1998 to address a need for standard filtered descriptions of gene products across different databases. GO is a collaborative effort that started between three model organism databases, FlyBase (*Drosophila*), the *Saccharomyces* Genome Database (SGD) and the Mouse Genome Database (MGD) but now incorporates many databases for plant, animal, and microbial genomes. The GO Contributors page lists all member organizations (http://geneontology.org/page/go-consortium-contributors-list). Some other ontology providers among many are the Open Biological and Biomedical Ontologies (OBBO), Reactome, DAVID, and the KEGG Pathway database (Table 3).

NGS manufacturers provide their own unique software for the first tier analysis to process the raw acquisition data and produce read files that contain high-quality consensus reads for the draft assemblies. However, only a few have attempted to include all three tiers of nucleotide sequence analysis into a fourth tier that is an easily accessible single integrated package. Illumina has provided the BaseSpace genomics cloud-computing program for integrated data storage and analysis (Table 3). This cloud storage and analysis program permits instrument integration with sequence analysis viewing and access to a wide range of software applications to align, assemble, and analyze reads and variants for RNA and DNA. These apply to various workflows, including basic analyses for prokaryotic and eukaryotic genomics and transcrip‐ tomics, metagenomics, and for more specialist interests such as detection and analysis of tumor variants, haplotype analysis, pathways and networks, forensic profiles, and many others, too numerous to list here. In comparison, Ion Torrent has storage devices and servers with a web browser driving the Torrent Suite Software (Table 3) on computers attached to their respective sequencing instruments. The manufacturer's software can be used to preprocess the DNA sequencing read data before transferring the preedited data onto other analytical software systems that are either provided by the manufacturer (vendor) or obtained from elsewhere. The National Center for Biotechnology Information (NCBI) is an example of a fourth tier bioinformatics provider (Table 3) that is a free, one-stop shop for DNA and RNA sequence data, analysis, and information. There are direct links at NCBI to 65 accessible databases, 35 download sites (for databases, tools, and utilities), 17 public submission portals, and 60 computing tools for sequence and data analysis, reports, and tutorials. In addition, NCBI is a resource for books and journals through its online library and the PubMed webpage.

*ab initio* and evidence-drivable gene predictors; (2) EST, protein, and RNA-seq aligners and assemblers; (3) choosers and combiners; (4) genome annotation pipelines; and (5) genome browsers for curation. A typical eukaryotic genome annotation pipeline is outlined by NCBI at http://www.ncbi.nlm.nih.gov/genome/annotation\_euk/process/. The essential first step for eukaryote genome annotation and gene determination is to identify and mask repeat elements (microsatellites, retrotransposons, and transposons) using RepeatMasker, Censor, or Win‐ dowMasker (Table 3). Without the initial masking step, the repeats would seed millions of spurious BLAST alignments and create incorrect gene annotations and corrupt the genome annotation with artifacts and false metadata. After masking, the annotation pipeline includes the following steps: transcript, RNA-seq read, protein/domain alignments; guided/*ab initio* gene model predictions; curated genomic sequence alignments; selection of the best evidence based models; gene naming and locus typing; assignment of GeneIDs; and annotation of small RNAs. In addition, there are the special considerations such as annotation of multiple assem‐ blies and updated assemblies before the annotated products can obtain an Annotation Release number and a release date for availability in various NCBI resources, including the databases for nucleotides, proteins, BLAST, gene, Map Viewer, and FTP sites. Other websites and tools considered important for eukaryote annotation are BUSCO for assessing the "core" eukaryote genes, Babelomics for the functional analysis of transcriptomic and genomic data, the PASA and MAKER tools for updating annotations with RNA-seq data, and other data and informa‐ tion (Table 3). The annotated and mapped data can then be integrated, visualized, and presented at a fourth tier of bioinformatics with genome browsers such as those displayed at UCSC, Ensembl, JBrowse, Web Apollo (Table 3), and others such as Genome Maps [109]. The new Emsembl 2015 provides an up-to-date genomic interpretation system for annotations,

20 Next Generation Sequencing - Advances, Applications and Challenges

query tools, and access methods for chordates and key model organisms [110].

(OBBO), Reactome, DAVID, and the KEGG Pathway database (Table 3).

Gene ontology is a bioinformatics initiative that provides (a) defined terms representing gene product properties and pathways covering biological domains such as cellular components, molecular function, and biological processes with their various subcategories and (b) func‐ tional annotation tools to find functions for large gene lists. It sits somewhere between the third tier (annotation) and the fourth tier of bioinformatic analyses and structures. The first major Gene Ontology (GO) project was founded in 1998 to address a need for standard filtered descriptions of gene products across different databases. GO is a collaborative effort that started between three model organism databases, FlyBase (*Drosophila*), the *Saccharomyces* Genome Database (SGD) and the Mouse Genome Database (MGD) but now incorporates many databases for plant, animal, and microbial genomes. The GO Contributors page lists all member organizations (http://geneontology.org/page/go-consortium-contributors-list). Some other ontology providers among many are the Open Biological and Biomedical Ontologies

NGS manufacturers provide their own unique software for the first tier analysis to process the raw acquisition data and produce read files that contain high-quality consensus reads for the draft assemblies. However, only a few have attempted to include all three tiers of nucleotide sequence analysis into a fourth tier that is an easily accessible single integrated package. Illumina has provided the BaseSpace genomics cloud-computing program for integrated data




**Program Website**

*3. Eukaryote annotation web servers*

*4. Archives and databases*

CRISPRfinder http://crispr.u-psud.fr/Server/CRISPRfinder.php

MicroScope https://www.genoscope.cns.fr/agc/microscope/home/index.php

NCBI pipeline http://www.ncbi.nlm.nih.gov/genome/annotation\_euk/process/

WindowMasker http://nebc.nerc.ac.uk/bioinformatics/docs/windowmasker.html

dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/snp\_summary.cgi

Complete Genomics data http://www.completegenomics.com/public-data/

PGAP http://www.ncbi.nlm.nih.gov/genome/annotation\_prok/ Prokka http://www.vicbioinformatics.com/software.prokka.shtml

Mreps http://bioinfo.lifl.fr/mreps/mreps.php

BaSys https://www.basys.ca

22 Next Generation Sequencing - Advances, Applications and Challenges

RepeatMasker http://www.repeatmasker.org/ Censor http://www.girinst.org/censor/

BUSCO http://busco.ezlab.org

PASA http://pasapipeline.github.io

Babelomics http://www.babelomics.org

DDBJ http://www.ddbj.nig.ac.jp EMBL http://www.embl.org

REPBASE http://www.girinst.org

GenBank http://www.ncbi.nlm.nih.gov/genbank/

dbGAP http://www.ncbi.nlm.nih.gov/gap

SRA http://www.ncbi.nlm.nih.gov/sra OMIM http://www.ncbi.nlm.nih.gov/omim COSMIC http://cancer.sanger.ac.uk/cosmic ENCODE https://www.encodeproject.org GTEx http://www.gtexportal.org FANTOM http://fantom.gsc.riken.jp

Roadmap epigenomics http://www.roadmapepigenomics.org Blueprint epigenomics http://www.blueprint-epigenome.eu

Regulome DB http://regulomedb.org

CEGMA tool http://korflab.ucdavis.edu/datasets/cegma/

MAKER http://www.yandell-lab.org/software/maker.html


**Table 3.** Useful websites for NGS tools, browsers, portals, providers, and online databases.
