**5. Bioinformatics Analysis**

Researchers have a continuing interest in improving this technique, since it can be used for clinical investigation due to its high accuracy: for example, in patients with genetic mutations or somatic mutations. LEA-seq can assist in the search for knowledge about intestinal micro‐ biota, as it may reveal their composition, opening up prospects for the diagnosis, treatment

Ishino et al. (1987) were the first to describe CRISPR [55]. This system has been identified in 40% of bacterial genomes so far [56] and they are defined as short repetitions of grouped bases. The determination of the CRISPR locus and the characterization of adjacent genes, known as *cas* genes, responsible for the function of CRISPR, only occurred in 2002 [57]. The CRISPR/Cas system uses small non-coding RNAs in association with Cas proteins. Cas9 is a nuclease which cleaves DNA in the selected region, so that the CRISPR system/Cas9 can be used to edit

CRISPR/Cas activity involves three main mechanisms: (1) acquisition, the step in which the DNA fragment is inserted into the CRISPR locus in the genome of interest; (2) transcription, in which the CRISPR locus is transcribed and processed; (3) interference, in which the ejection of nucleic acids occurs. All those mechanisms contribute to bacterial persistence in the environment [58, 59]. Furthermore, CRISPR provides mechanisms to limit the spread of antibiotic resistance or virulence factors. However, Gophna et al. (2015) demonstrated that, even though there are different measurements to evaluate horizontal gene transfer, it is not possible to identify a correlation between the CRISPR/Cas system and the evolution of the

RNA-seq helped in the annotation transcription of regions, mainly non-coding, and also enabled the identification of CRISPR elements in prokaryotes [61]. The CRISPR system can also be used as a tool in studies centered on gene regulation, since this system is able to activate

Zoephel and Randau (2013) discuss how the structure of CRISPR can affect the maturation of RNA and, thus, influence the functionality of the CRISPR/Cas system [62]. The RNA-seq approach was used to evaluate differential gene expression in *S. aureus*, a pathogen of major importance. It was able to identify the CRISPR in these strains and helped in investigating their possible role, since these regions show an adaptive response to infection [63]. Thus, we see the importance of the use of the RNA-seq approach in the magnification of knowledge about

The RNA-seq approach can be applied to different next-generation sequencing platforms and the results obtained by them are proportional to the machine capability. In Table 3, a compar‐

ison is made with some of the platforms currently most employed [64].

**3.7. CRISPR (clustered regularly interspaced short palindromic repeats)**

and prevention of gastrointestinal tract diseases.

216 Next Generation Sequencing - Advances, Applications and Challenges

species. Changes occur only at the population level [60].

genomes.

or repress genes.

function in prokaryotes.

**4. RNA Sequencing Platforms**

Experimental investigations in prokaryotes have been facilitated, extended and complemented using computational approaches [65]. Large amounts of data have been generated from RNAseq experiments which need to be stored and analyzed using computational techniques and tools [66]. This amount has become a bottleneck to bioinformatics analysis and to biologists, since today's transcriptome analysis consists of experiments and data evaluation [65]. Extract‐ ing biological information from RNA-seq datasets requires bioinformatics knowledge and tools, making the software choice an important issue for successful RNA-seq analysis [65, 67].

According to Chierico et al. (2015) [68] and Pinto et al. (2011) [67], RNA-seq can be understood as a five-step process: (1) isolation of the total RNA of the organism; (2) mRNA enrichment; (3) synthesis of cDNA; (4) NGS sequencing, which returns raw data to the (5) bioinformatics analysis [67]. A flowchart of this process can be seen in Figure 4.

**Figure 4.** RNA-seq five-step process.

This session focuses on bioinformatics analysis and the computational tools available. Based on a literature review [29, 65, 67–69], bioinformatics analysis can be comprehended as the extraction and classification/division of biological information gleaned from the sequencing of raw data (Figure 5).

**Figure 5.** Bioinformatics analysis workflow

#### **5.1. Bioinformatics workflow**

The quality check step aims to increase the accuracy of the results by removing sequences that may contain errors [70]; trimming sequences introduced in the library preparation step, such as adapters and poly(A)-tails [71]; and, removing reads with low phred quality. However, in that regard, the use of poor-quality databases can lead to less precise results [72]; considering this, the quality check can affect the next steps drastically.

Some RNA-Seq pipelines, like ReaDemption [71], implement quality checking which performs quality trimming, removes adapters and poly(A) tails and discards reads shorter than a given cut-off (the default cut-off is 12 nucleotides (nt)). Quality assessment [72] evaluates the quality based on quality-graph analysis and estimated coverage. According to Backofen et al. (2014) [65], FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastq c/) is a tool commonly used to check read quality and to determine the quality profile of the reads. Software suites can also be used for this purpose, FASTX-Toolkit (http://hannonlab.cshl.edu/fastx\_toolkit/) provides tools to remove sequences attached in previous steps and to perform other preprocessing strategies on raw data.

After the quality check, if a reference genome is available, then a mapping step will be done; otherwise, *de novo* assembly. Mapping consists of producing the transcriptome map by aligning reads to a reference genome [67]. This aims to detect the right position of the reads and to distinguish between sequencing errors and genetic variations [73]. Abundant mapping software has been released, differing in their algorithms, memory management, velocity and computational cost [65]. This makes the choice of a mapping tool a challenge. McClure et al. (2013) [69] made a comparison between SOAP2, BWA, Bowtie and Bowtie2 aligners using 75 RNA-seq experiment data. The comparison of mapping algorithms applied to IonTorrent data can be seen in [73]. After mapping quality is evaluated, ReadXplorer software offers quality classification of read mapping in order to provide information about the quality and quantity of each single read mapping [74]. This approach is recommended when a high-quality genome is available as a reference. If one is unavailable, transcripts should be assembled *de novo* [29].

*De novo* assembly can be used when investigating poorly studied organisms [14], complex microbial communities or uncultivable organisms [29]. Both DNA and RNA must be assem‐ bled, but transcriptome assembly is significantly different than genome assembly [75]; thus, it is important to use RNA assemblers. Tjaden (2015) [29] affirms that assemblers should be specifically designed to prokaryotes, owing to the different challenges of eukaryotic and prokaryotic transcriptomes. Bacterial genomes are often denser than eukaryotic genomes, considering the proximity of the genes. Neighbouring bacterial transcripts can overlap, making it difficult to identify transcript boundaries appropriately. Non-coding eukaryotic RNA models are not appropriate for detecting bacterial small regulatory RNAs [29]. An assembly comparison of three different software titles (Trinity, SOAPdenovo2 and Rockhoop‐ er 2), using data from nine different bacteria, can be seen in [29].

When reference mapping or *de novo* assembly is done, data can be analyzed structurally and differentially. The main purpose of differential analysis is to determine the differences in expression among different growth conditions or treatments [76]. Several software titles have been released for this purpose, but there is no consensus about best practices, which makes it difficult to select a tool or method. Seyednasrollah et al. (2013) [76] compared eight differential expression software packages using two real, publicly available datasets. Software that analyzes differential expression can be based on the Poisson method (DEGseq and Myrna), negative binomial method (edgeR and DEseq) or other methods [67, 76]. Pinto et al. (2011) [67] recommends using DEseq or edgeR when analyzing replicates.

Transcriptome annotation and classification can be based on structural analysis, evaluating transcripts regarding the genomic region with which they have been associated and in which they have been classified: protein-coding, non-coding and intergenic regions [65]. Aiming to predict ncRNA transcripts, several computational methods have been developed. Herbig and Nieselt (2011) [77] highlight the SIPHT, sRNAFinder, sRNAscanner, NOCORNAr and sRNAPredict software. NOCORNAr distinguishes itself as it is useful for predicting and characterizing ncRNAs in bacteria [77].

Assessing transcripts concerning genomic regions rely on transcript annotation. The compu‐ tational approach is convenient to use due to its velocity and precision, compared to manual annotation. However, human supervision of the results is considered important in order to avoid false-positives or missing features [1]. With this technique, some main structures must be detected: 5' transcript ends, 3' transcript ends, TSS and operon [1, 65].

**a.** Transcript boundaries identification

**Figure 4.** RNA-seq five-step process.

218 Next Generation Sequencing - Advances, Applications and Challenges

**Figure 5.** Bioinformatics analysis workflow

processing strategies on raw data.

this, the quality check can affect the next steps drastically.

**5.1. Bioinformatics workflow**

of raw data (Figure 5).

This session focuses on bioinformatics analysis and the computational tools available. Based on a literature review [29, 65, 67–69], bioinformatics analysis can be comprehended as the extraction and classification/division of biological information gleaned from the sequencing

The quality check step aims to increase the accuracy of the results by removing sequences that may contain errors [70]; trimming sequences introduced in the library preparation step, such as adapters and poly(A)-tails [71]; and, removing reads with low phred quality. However, in that regard, the use of poor-quality databases can lead to less precise results [72]; considering

Some RNA-Seq pipelines, like ReaDemption [71], implement quality checking which performs quality trimming, removes adapters and poly(A) tails and discards reads shorter than a given cut-off (the default cut-off is 12 nucleotides (nt)). Quality assessment [72] evaluates the quality based on quality-graph analysis and estimated coverage. According to Backofen et al. (2014) [65], FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastq c/) is a tool commonly used to check read quality and to determine the quality profile of the reads. Software suites can also be used for this purpose, FASTX-Toolkit (http://hannonlab.cshl.edu/fastx\_toolkit/) provides tools to remove sequences attached in previous steps and to perform other pre-

After the quality check, if a reference genome is available, then a mapping step will be done; otherwise, *de novo* assembly. Mapping consists of producing the transcriptome map by aligning reads to a reference genome [67]. This aims to detect the right position of the reads and to distinguish between sequencing errors and genetic variations [73]. Abundant mapping software has been released, differing in their algorithms, memory management, velocity and computational cost [65]. This makes the choice of a mapping tool a challenge. McClure et al.

Annotation of transcript boundaries is important for operon identification and regulatory analyses [1]. Identifying 5' UTR is not always possible; a significant number of transcripts lacking 5' UTR were found in bacteria and called leaderless transcripts. In this situation, the transcript translation start site and the transcription start site remain in almost the same position [65]. Annotation of 3' UTR is important in order to obtain the entire analytical value of the RNA-seq data. Creecy and Conway (2014) [1] affirm that the current best method for detecting 3' ends is to search for correlations between replicates data. They highlight that the software package TransTermHP can find intrinsic terminators successfully.

**b.** TSS identification

TSS annotation can assist in ncRNA annotation and polycistronic transcripts [65]. According to Creecy and Conway (2013) [1], it is essential to discover unknown transcripts and to analyze operon, 5' UTR and promoters architecture. Although there are no well-established strategies for TSS identification, owing to scarce knowledge about transcription start sites in bacteria, with computational developments in both computational analyses and "wet-lab" experiments, TSS annotation has become more feasible [65]. TSSAR is a dRNA-seq data-based tool for rapid annotation of TSS that considers dRNA-seq library statistics [78]. According to Backofen et al. (2014) [65], the main advantage is in the statistical analysis presented as an easy-to-use web service. The TSSpredator tool provides automated TSS detection and classification from RNAseq data, performing a genome-wide comparative prediction of TSS [79]. A comparison among manual annotation, TSSpredator and TSSAR annotation can be seen in [78].

**c.** Operon identification

The operon represents clusters of co-transcribed genes regulated by the same regulatory sequence and co-transcribed into a single mRNA. This structure has immense biological importance, improving functional gene annotation and giving important information to studies of drug targeting, functional analyses and antibiotic resistance [80]. To handle operon occurrence complexity, the occurrence should be detected using operon architecture (i.e., 5' ends and 3' ends) and have sufficient read coverage to connect promoters and terminators. A strong indication that an operon is real is that at least 90% of the bases of the reads is covered [1]. Chuang et al. (2012) [80] classify computational methods to predict operons and they evaluate 15 algorithms with respect to accuracy, specificity and sensitivity.
