**2.1 The sampling process and library construction for metagenomic analysis**

Metagenomic analysis is a sophisticated process and involves several steps. Of these steps, the sampling process is very crucial for the downstream applications. Sample collection, preparation, and storage should be handled carefully to prevent lysis and decomposition of the sample compositions. Multiple freezing–thawing cycles may cause changes in the microbial community profile under investigation [10]. As well, a suitable DNA extraction protocol should be adopted to cope with the different chemical and physical characteristics of each sample. For instance, soils contain many substances that are co-extracted with the genomic DNA and may have inhibitory effects on the downstream experiments. Examples include humic and fulvic acids [11]. Therefore, optimization and comparison between different extraction methods are usually required for each type of samples [12, 13, 14, 15].

The extracted DNA is used to construct the DNA library. This is usually achieved by connecting specific adaptors to one or both ends of the DNA fragments [16]. The reason for utilizing DNA adaptor is to deal with the pool of samples and then connect them to its original sample. Handling DNA at this stage should be careful to avoid chemical, physical, or enzymatic damage of DNA molecules [17]. The construction of a DNA library is usually achieved through two approaches. The first one is called meta-pair where the library is characterized by long fragment insert. The second approach is called paired-end libraries with short fragment insert. In both approaches, the DNA is fragmented into different fragment sizes that would allow for their cloning. The DNA fragments obtained from such processes are cloned into the proper cloning vector. The size of the resulting fragments determines the suitable vector for the cloning process. The small DNA fragments are usually cloned into plasmid vectors, whereas fragments up to 40 kbp are cloned into cosmid or fosmid vectors. Bacterial artificial chromosome (BAC) vectors are usually used to clone inserts with sizes that exceed 40 Kbp [18]. Finally, the free adaptor, dimers of the adaptor, and any other artifacts must be removed to avoid noisy sequencing data [17].

### **2.2 Sequencing approaches**

During the 1970s, the first-generation sequencing techniques, chain termination [19], and chemical sequencing approaches [20] were developed. In contrast to the chemical sequencing approach, the Sanger sequencing method ultimately prevailed and found immense applications due to its simplicity and is more amenable to being scaled up [21]. Simply, the basis of Sanger sequencing depends on the incubation of a specific primer and the template DNA in the presence of DNA polymerase. The reaction is accomplished by the addition of a mixture of deoxyribonucleotide triphosphates and dNTPs' dideoxyribonucleotide triphosphates for chain termination, one of which was labeled with phosphorus-32. The resulting pool of DNA amplicons will be with the same 5′ residue and different dNTP residues at the 3′ end (**Figure 1**). This pool of DNA fragments is then fractionated by denaturing polyacrylamide gel electrophoresis giving a band pattern. In this way, DNA decoding can be achieved

*High-Throughput Sequencing and Metagenomic Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.89944*

#### **Figure 1.**

*Sanger DNA sequencing. (1) The gene to be decoded is amplified by PCR. (2) The sequencing process is performed by the addition of modified 2*′*,3*′*-dideoxynucleotide (ddNTPs) to the nascent chain. The modified nucleotides act by terminating the chain extension, and the resulting DNA fragments of different sizes are eluted by capillary gel electrophoresis. (3) Chromatograms are then analyzed to obtain the DNA sequences.*

by the use of nucleotide analogs and other nucleotides in separate incubations and concomitant electrophoretic analysis [22]. Currently, the use of fluorescent dNTPs associated with the capillary electrophoresis provides full automation of the Sanger approach. This modification allows retrieving up to 96 sequences per run with an average 800–1000 bp size of DNA fragments [21, 23, 24]. Although the Sanger sequencing was the mainstay of the original human genome project, this approach still has some limitations. These limitations include high cost and low throughput, and it is inadequate for studying unculturable organisms in complex environments [25].

#### *2.2.1 Next-generation sequencing (NGS)*

Due to the limitations of Sanger sequencing technique, next-generation sequencing emerged in 2005 [26]. Indeed, next-generation sequencing has made it possible to study and identify organisms directly from their habitats without prior preparations [27]. Compared to the first-generation sequencing, NGS can generate several hundred thousand to millions of sequencing reads in parallel. As well, sequencing can be generated without some conventional steps such as vector-based cloning procedure and hence reduces the chance of DNA contamination from other organisms [28]. Therefore, several next-generation sequencing platforms have been introduced including Roche 454, Illumina®, Applied Biosystems SOLiD sequencer, and Ion Torrent. All next-generation sequencing or real-time sequencing (Roche 454, Illumina®, and AB SOLiD) utilized optical sensors that detect luminescent signal, which are produced during incorporation of bases in the sequence. The principles and characteristics of NSG, SGS, and TGS are summarized in **Table 1** [21]. In the subsequent sections, the features and limitations of each of the NGS techniques are discussed.

#### *2.2.1.1 Roche 454 genome sequence*

Roche/454 pyrosequencing is the first NGS technology that launched and became commercially available in 2005. It uses real-time sequencing-by-synthesis


#### **Table 1.**

*The features and principles of first-generation sequencing, SGS, and TGS.*

(SBS) pyrosequencing technology, and it depends on the detection of pyrophosphate (PPi) molecule that is initiated from the incorporation of a nucleotide in the DNA polymerase (**Figure 2**) [29]. Briefly, the 454 pyrosequencing technology is proceeding as follows: (i) the library fragments are connected to beads that carry oligonucleotides complementary to adapter sequence ligated at the ends, (ii) amplifying the library fragments by emulsion PCR resulting in DNA beads that carry millions of copies of DNA fragments on their surface, and (iii) the amplified beads are inserted into picotiter plate (PTP) that consists of millions of wells. Each well can hold only one amplified bead and contains diluted pyrosequence enzyme beads, DNA amplified beads, PPiase beads, and pyrosequence beads. Finally, the light emission from PTP is recorded by a CCD camera and is translated to nucleotide sequences [29]. In comparison with other NGS platform, 454 pyrosequencing has

*High-Throughput Sequencing and Metagenomic Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.89944*

#### **Figure 2.**

*Pyrosequencing technique. (1) Beads coated with either streptavidin or complementary oligonucleotides complementary to adapter sequences attached to the ends of the fragment to be sequenced. This allows the binding of sequencing fragments to the beads. (2) The fragments to be sequenced are amplified through emulsion PCR. (3) Loaded beads are transferred into the sequencing plate with millions of wells. (4) By the addition of a nucleotide to the nascent chain that is connected to the beads by DNA polymerase, the ATP sulfurylase enzyme converts released pyrophosphate to ATP with the emission of light that is detected by a CCD camera and is translated to nucleotide sequences.*

the longest reading (up to 1000–1200 bp). On the other hand, 454 pyrosequencing has the highest cost per base and the lowest output [30].

#### *2.2.1.2 Illumina sequencing (Solexa genome analyzer)*

Illumina, formerly known as Solexa, has been introduced commercially in 2007. Illumina technology utilizes bridge PCR amplification coupled with SBS in the flow cell (**Figure 3**). Simply, the principle of Illumina sequencing is that the DNA fragments with barcoding primer (adaptor) are attached to the flow cell. The sequencing reaction is performed in the flow cell by adding labeled nucleotides. When the nucleotide is incorporated, a luminescent signal is generated and then recorded by optical sensors. After that, the fluorescent molecules are removed and the next labeled nucleotide incorporated. However, the DNA fragment can be sequenced on one side that is called single-end (SE) or from both sides known as paired-end (PE). Nowadays, the most common sequencing used is PE due to the ability to generate two reads for one DNA fragment which is useful in order to determine the distance between two ends of the DNA fragment [31]. In fact, due to its low cost per base and high yield, Illumina becomes the most widely used and popular NGS platform. The output of Illumina sequencing is the highest among all NGS, making it suitable for multiplexing hundreds of samples at the same time [32].

#### *2.2.1.3 Applied biosystems (AB) SOLiD sequencer*

AB SOLiD refers to sequencing by oligonucleotide ligation and detection. It has been developed by Applied Biosystems (Life Technology) and became commercially available in 2007. The AB SOLiD sequencing approach differs from the other

#### **Figure 3.**

*Illumina/Solexa sequencing approach. (1) The DNA templates with the attached adapter sequences are connected via a glass surface coated with oligos complementary sequences (2, 3, 4). DNA molecules fold over into a bridge shape and bridge PCR amplification is applied. (5) Bridge amplification and the formation of millions of copies or cluster formation. (6) Cluster sequencing is achieved through the process of cyclic reversible termination method. Finally, the resulting reads (tens of millions) are analyzed and the DNA sequence is recoded.*

#### **Figure 4.**

*Applied biosystems (AB) SOLiD sequencing approach. (1) Preparation of DNA library from the sample and ligation of specific adaptor and the beads are then covered with the sequences complementary to one of the adapter sequences. (2) The adapter sequences will then bind to its complementary sequences on the beads. (3) The hybridization process resulted in the attachment of millions of DNA sequences to the bead. (4) Removal of the unloaded beads and selection of the loaded beads. (5) An interrogation probe contains six universal bases and two-base encoded probe. The universal bases are attached to the fluorescent label. (6) When an integrated probe is ligated with primers using DNA ligase, fluorescent light is generated and detected. This process is repeated several times till the targeted DNA is completely sequenced.*

two next-generation sequencing technologies, Illumina, and 454 pyrosequencing. AB SOLiD platform relies on sequencing-by-oligo-ligation (SBL) (**Figure 4**), whereas others rely on sequencing-by-synthesis (SBS) [33]. In SOLiD sequencer,

#### *High-Throughput Sequencing and Metagenomic Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.89944*

the DNA library is prepared from the sample, and specific adaptor is then amplified by emPCR [34]. Instead of utilizing DNA polymerase, short nucleotides marked by DNA ligase known as interrogation probes are used. The interrogation probe contains six universal bases and two-base encoded probe. The universal bases are attached to the fluorescent label. When an integrated probe is ligated with primers using DNA ligase, fluorescent light is generated and detected. After the 5′ end that is linked to the fluorescent label by cleavable linkage is cleaved and removed, thereby the next interrogation probe is connected. This process is repeated several times until the targeted DNA is completely sequenced. In fact, the read length of SOLiD is short about 85 bp leading to inaccurate read assembly as it requires more time for sequencing but it has the highest accuracy among other NGS [35]. Application of SOLiD includes whole genome sequencing, targeted sequencing, transcriptome, and epigenome [35].

### *2.2.1.4 Ion torrent sequencing*

Ion Torrent has been launched in 2010 by Life Technology. Some authors have classified the Ion Torrent platform as a technique between the next-generation and the third-generation sequencing. This could be attributed to the dependence of this approach on optical sensors. However, it relies on chemical sensors that detect the hydrogen-ion concentration change that occurred during the incorporation of a nucleotide in the sequence [21]. Ion Torrent sequencing quality is high and stable due to the utilizing of a chemical sensor instead of fluorescence and camera. In addition, the Ion Torrent approach is characterized by its high speed and low cost compared with pyrosequencing and Illumina [35].

## *2.2.2 Third-generation sequencing*

The major limitations of NGS are that the short-read length and the PCR bias are introduced by clonal amplification and the fluorescent-based signaling detection [21]. Therefore, the third-generation sequencing or single-molecule-sequencing technologies (SMS) overcome these limitations by dispensing PCR before sequencing, and the signal is captured in real time by monitoring the enzymatic reaction [36, 21]. The following sections discuss some TGS platforms.
