**2. Bioinformatics analysis of RNASeq data**

**RNASeq methods Description Reference**

ribonucleoprotein complexes (miRNPs)

transcripts that undergo degradation

sequencing

114 Next Generation Sequencing - Advances, Applications and Challenges

**Table 1.** Various RNASeq-based methods to study transcriptome

the bioinformatics data analysis steps, tools, and methods.

To identify and sequence the binding sites of cellular RNAbinding proteins (RBPs) and microRNA-containing

It sequences and captures nascent RNA transcripts after immunoprecipitation of RNA Pol II elongation complex

To detect and identify miRNA cleavage sites and uncapped

RNA isoforms are identified after 5' and 3' paired-end

One of the first steps while designing the RNASeq experiment is choosing an appropriate sequencing platform. Several sequencing platforms such as Illumina, Roche, PacBio, and Ion Torrent, which are based on different sequencing chemistry and technology, are available [reviewed in 34, 35]. Current leading platform for RNASeq (and other NGS-based analyses) is the HiSeq series of sequencers from Illumina (https://www.illumina.com/systems.html) because it provides high throughput, deep sequencing, low sequence error, and long enough read data to be useful in multiple applications. Recently, the PacBio RS II (http://www.pacif‐ icbiosciences.com/) is gaining popularity for better transcriptome construction, because of its ability to generate long reads. Once the millions of reads are generated from an RNASeq experiment, the bioinformatics data analysis begins. In the following section, we briefly present

To detect and identify translating mRNAs [174]

Single-cell transcriptomics methods [177]

[172]

[173]

[175]

[176]

PAR-CLIP

Seq(Photoactivatable-Ribonucleoside-Enhanced Cross-Linking and Immunoprecipitation Sequencing)

NETSeq (Native Elongation Transcript Sequencing)

PARESeq (Parallel Analysis of RNA Ends Sequencing) and GMUCT (Genome-wide Mapping of Uncapped

TIFSeq (Transcript Isoform Sequencing) or Paired-End Analysis of Transcription start

CELSeq (Cell Expression by Linear amplification and Sequencing), SMARTSeq (Switching Mechanism At the 5′ end of the RNA Template Sequencing), STRT (Single-cell Tagged Reverse Transcription)

TRAPSeq (Targeted Purification of Polysomal mRNA Sequencing)

Transcripts)

site (PEAT)

Analysis of the RNASeq data is a multistep process that typically includes quality check, data preprocessing, transcriptome assembly (reference-guided and *de novo* transcriptome assem‐ bly), quantification, statistical analysis, and functional annotation (Figure 1). These steps are described in details in the following.

**Figure 1.** Basic RNASeq data analysis workflow. Firstly, raw sequenced data are checked for the quality and, if re‐ quired, low-quality reads and artifacts are removed. In the case of reference-based assembly, the reads are mapped to the reference genome in order to know their location. All the mapped reads are then analyzed for expression profiling. Further, differentially expressed genes and isoforms can be annotated using Gene Ontology (GO) and Pathway enrich‐ ment analyses. In *de novo* assembly approach, after preprocessing of the raw reads, transcriptome can be assembled using different *de novo* transcriptome assemblers. Once transcripts are constructed and abundance estimate is obtained, the complete Open Reading Frames (ORFs) transcripts are predicted. The predicted ORFs can be annotated or ana‐ lyzed for expression profiling and then annotated using remote homology search method, GO, and pathway enrich‐ ment analyses.

#### **2.1. Quality check and data preprocessing**

Next generation sequencers assign a Phred quality score, which is the probability of the base call being inaccurate, to the called bases. Low Phred scores (Q< 30) indicate read data of poor quality. Poor-quality read data can arise from problems in the library preparation or from sequencing itself. Additionally, PCR artifacts, sequence-specific biasness, untrimmed adapter sequences, and other possible contaminants can lead to poor data quality. These factors can affect the downstream analysis and data interpretation, and can give inaccurate results. In order to assess quality of raw sequenced data several tools such as FastQC (http://www.bioin‐ formatics.babraham.ac.uk/projects/fastqc/) and PRINSEQ [36] are available. Once the data are checked for quality, they should be processed to remove reads with low-quality bases, adapter sequences, and other contaminating sequences. Tools such as Cutadapt [37], Trimmomatic [38], TrimGalore (http://www.bioinformatics.babraham.ac.uk/projects/trim\_galore/), FASTX-Toolkit (http://hannonlab.cshl.edu/fastx\_toolkit/), which trim adapter or other contaminants based upon user-provided parameters, can be used for performing these operations. A brief description of some of these quality and data preprocessing tools is provided below:

**FastQC:** FastQC is a simple, easy-to-use tool that evaluates the quality of read data generated from the next generation sequencers. The input file/s for FastQC can be in Fastq, SAM, or BAM format either in the compressed or uncompressed form. FastQC reports basic statistics for the read data such as overrepresented sequences, k-mer content, base quality and content, adapter content, read duplication level, etc. FastQC is available as a stand-alone Java-based program with a graphical user interface and can be run from both Linux (using command line) and Windows systems.

**PRINSEQ:** PRINSEQ reports base quality, GC content, duplicates, adapters, presence of ambiguous sequences represented as "N," poly A tails, etc. Unlike FastQC, PRINSEQ also has the option of trimming and filtering reads. PRINSEQ is available as stand-alone as well as web application (http://prinseq.sourceforge.net/). It accepts uncompressed files in Fasta, Qual, and Fastq formats.

**Trimmomatic:** Trimmomatic is a Java-based program for the preprocessing of NGS read data (http://www.usadellab.org/cms/?page=trimmomatic). It can trim contaminant sequences, adapters, and filter reads based upon the quality. It supports compressed files in Fastq format and generates output in Fastq format. Because of its multithreading option, its data processing speed is higher than other tools available to perform the same function. Unlike some of the other tools, Trimmomatic can analyze both single-end as well as paired-end read data.

**Cutadapt:** Cutadapt is a python-based tool for read preprocessing and can be run as a command line application (https://cutadapt.readthedocs.org/en/stable). It accepts compressed files in Fasta, Qual, and Fastq formats, and supports both paired-end and single-end files. It trims low-quality bases, multiple adapter sequences from either 3', 5', or from both ends. In addition, Cutadapt can remove fixed number of bases from either ends of the sequences and supports demultiplexing, i.e., reads can be written to different output files depending upon the adapter sequence found in the reads. The demultiplexing feature is particularly useful since pooling multiple samples in a single run is an increasingly common practice as a result of increased sequencer throughput.

**TrimGalore:** TrimGalore is a wrapper tool written around FastQC and Cutadapt for quality check and adapter trimming for regular as well as MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. It accepts compressed Fastq files and supports pairedend and single-end data.

**FASTX-Toolkit:** FASTX-Toolkit is a collection of tools that accepts read data in Fasta and Fastq file formats and trim the data based on base quality and adapter sequence contamination. Additionally, the toolkit has tools that can perform file format conversion, split sequences based upon barcodes, and generate reverse complement of sequences.

sequencing itself. Additionally, PCR artifacts, sequence-specific biasness, untrimmed adapter sequences, and other possible contaminants can lead to poor data quality. These factors can affect the downstream analysis and data interpretation, and can give inaccurate results. In order to assess quality of raw sequenced data several tools such as FastQC (http://www.bioin‐ formatics.babraham.ac.uk/projects/fastqc/) and PRINSEQ [36] are available. Once the data are checked for quality, they should be processed to remove reads with low-quality bases, adapter sequences, and other contaminating sequences. Tools such as Cutadapt [37], Trimmomatic [38], TrimGalore (http://www.bioinformatics.babraham.ac.uk/projects/trim\_galore/), FASTX-Toolkit (http://hannonlab.cshl.edu/fastx\_toolkit/), which trim adapter or other contaminants based upon user-provided parameters, can be used for performing these operations. A brief

116 Next Generation Sequencing - Advances, Applications and Challenges

description of some of these quality and data preprocessing tools is provided below:

Windows systems.

Fastq formats.

increased sequencer throughput.

end and single-end data.

**FastQC:** FastQC is a simple, easy-to-use tool that evaluates the quality of read data generated from the next generation sequencers. The input file/s for FastQC can be in Fastq, SAM, or BAM format either in the compressed or uncompressed form. FastQC reports basic statistics for the read data such as overrepresented sequences, k-mer content, base quality and content, adapter content, read duplication level, etc. FastQC is available as a stand-alone Java-based program with a graphical user interface and can be run from both Linux (using command line) and

**PRINSEQ:** PRINSEQ reports base quality, GC content, duplicates, adapters, presence of ambiguous sequences represented as "N," poly A tails, etc. Unlike FastQC, PRINSEQ also has the option of trimming and filtering reads. PRINSEQ is available as stand-alone as well as web application (http://prinseq.sourceforge.net/). It accepts uncompressed files in Fasta, Qual, and

**Trimmomatic:** Trimmomatic is a Java-based program for the preprocessing of NGS read data (http://www.usadellab.org/cms/?page=trimmomatic). It can trim contaminant sequences, adapters, and filter reads based upon the quality. It supports compressed files in Fastq format and generates output in Fastq format. Because of its multithreading option, its data processing speed is higher than other tools available to perform the same function. Unlike some of the other tools, Trimmomatic can analyze both single-end as well as paired-end read data.

**Cutadapt:** Cutadapt is a python-based tool for read preprocessing and can be run as a command line application (https://cutadapt.readthedocs.org/en/stable). It accepts compressed files in Fasta, Qual, and Fastq formats, and supports both paired-end and single-end files. It trims low-quality bases, multiple adapter sequences from either 3', 5', or from both ends. In addition, Cutadapt can remove fixed number of bases from either ends of the sequences and supports demultiplexing, i.e., reads can be written to different output files depending upon the adapter sequence found in the reads. The demultiplexing feature is particularly useful since pooling multiple samples in a single run is an increasingly common practice as a result of

**TrimGalore:** TrimGalore is a wrapper tool written around FastQC and Cutadapt for quality check and adapter trimming for regular as well as MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. It accepts compressed Fastq files and supports pairedOnce the read data are filtered and trimmed to remove low-quality bases, adapter sequence, and contaminants, they are ready for transcriptome assembly and profiling analysis. There are two different approaches for constructing full-length transcripts: reference-based assembly (when a reference genome is available) and *de novo* assembly (when the reference genome is not available), a computationally intensive and complex process (Table 2). Reference-based or genome-guided assembly refers to mapping sequenced reads to the reference genome followed by assembling the transcripts. In contrast, in *de novo* transcriptome assembly, transcripts are constructed directly from the overlapping sequenced reads. For the transcrip‐ tome assembly of organisms without reference genome, only *de novo* transcriptome assembly approach is available for transcriptome construction. However, for organisms with known reference genome, both reference-based and *de novo* transcriptome assembly can be employed for transcriptome construction. In fact, in this case, *de novo* transcriptome assembly will be more effective in filling in the gaps (observed due to variation in reference genome sequence and poor-quality annotation) and hence would complement the reference-based transcriptome assembly. More details on these two transcriptome assembly approaches are discussed in the following sections.


**Table 2.** Difference between reference-based and *de novo* assembly approaches
