**3. Sample preparation and library construction**

After defining the experimental design, a typical RNA‐seq experiment workflow consists of (i) RNA preparation, (ii) cDNA library construction, (iii) sequencing and (iv) bioinformatic analysis. Each step will be briefly discussed below.

#### **3.1. RNA preparation**

Since RNA is more labile than DNA and RNases are ubiquitous and very stable enzymes, special precautions and more stringent working practices should be taken to obtain pure and high‐quality RNA. Best practices can be found at [51] or spread on diverse companies' web‐ sites such as Thermo Fisher Scientific, Qiagen and Ambion.

In an RNA‐seq experiment, the RNA preparation consists basically of isolation/extraction and enrichment. Many RNA sample preparation techniques and commercial kits are available. No unique method is optimal for every application, and combination of methods may vary depending on the sample type and the study goals. It is always recommended to carefully follow manufacturer's instructions.

#### *3.1.1. RNA isolation and extraction*

In order to isolate high‐quality RNA, the samples need to be processed immediately after harvest. If an immediate isolation is not possible, samples can be stabilized in an intermedi‐ ary solution to preserve RNA integrity and allow storage. Commonly used stabilizers are RNA*later* (Thermo Fisher Scientific and Qiagen) and RNAstable (Sigma‐Aldrich). RNA isola‐ tion and extraction methods can be manual (e.g., TRIzol–Thermo Fisher Scientific) or auto‐ mated (e.g., RNeasy—Qiagen), and different types of samples require different approaches, although all of them comprise: (i) sample solubilization in the presence of detergent and chao‐ tropic agents, (ii) sample homogenization for complete cell disruption and (iii) RNA recovery from the lysate with organic or solid‐phase extraction. It is also important to have a final RNA free of genomic DNA (gDNA) contaminants. Some protocols can carry over some gDNA into total RNA samples that can be removed by a DNAse treatment. gDNA contamination can lead to a counting bias in downstream analysis and can be detected by reads background over the whole genome (false positive signal). Further information about sample preparation tech‐ niques and some commercial kits available can be found in Ref. [52]. Different commercial kits demonstrated satisfactory RNA yield, but differences in the quality of extracted RNA were observed, which can interfere on the downstream analysis [53].

RNA quality can be assessed by gel electrophoresis (agarose or polyacrylamide) or through Agilent Bioanalyzer. RNA quantity can be assessed using spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) or Agilent Bioanalyzer. No single RNA quantification and quality control method are ideal, and it is necessary to know the limits of each method. We recom‐ mend Bioanalyzer since it measures the RNA integrity and level of degradation by the RNA Integrity Number (RIN) score that allows sample quality comparison by a scale with a range from 1 (most degraded) to 10 (most intact) [54, 55]. There is no consensus about the RIN cut‐ off for sample inclusion or exclusion in a study, but RIN ≥ 6 are commonly acceptable. DGE analysis could be performed even with RIN scores around 4 [56], but non‐degraded RNA is preferred for a successful transcriptome analysis. It is also important to highlight that some organisms do not present typical rRNAs peaks and cannot be evaluated by RIN value. Most insect RNA shows a cleavage of 28S rRNA into two similar fragments (28Sα and 28Sβ) that comigrate with 18S rRNA depending on pretreatment and electrophoresis conditions. This comigration is due to the disruption of the hydrogen bonds responsible for maintaining the two 28S fragments together. This profile should not be misinterpreted as low integrity and degradation [57]. In these cases, check the overall Bioanalyzer trace. More information about each method and a comparison study can be found in Refs. [58, 59], respectively.

#### *3.1.2. RNA enrichment*

to library construction [47]. Ambion ERCC spike‐in control mixes (Thermo Fisher Scientific) are commercially available. Sequins, another set of spike‐in RNA standards, can also be used as inter‐ nal controls and are freely available for non‐profit research upon request [48]. Normalization methods should be carefully chosen to ensure that spike‐in will behave as expected. The R pack‐

After defining the experimental design, a typical RNA‐seq experiment workflow consists of (i) RNA preparation, (ii) cDNA library construction, (iii) sequencing and (iv) bioinformatic

Since RNA is more labile than DNA and RNases are ubiquitous and very stable enzymes, special precautions and more stringent working practices should be taken to obtain pure and high‐quality RNA. Best practices can be found at [51] or spread on diverse companies' web‐

In an RNA‐seq experiment, the RNA preparation consists basically of isolation/extraction and enrichment. Many RNA sample preparation techniques and commercial kits are available. No unique method is optimal for every application, and combination of methods may vary depending on the sample type and the study goals. It is always recommended to carefully

In order to isolate high‐quality RNA, the samples need to be processed immediately after harvest. If an immediate isolation is not possible, samples can be stabilized in an intermedi‐ ary solution to preserve RNA integrity and allow storage. Commonly used stabilizers are RNA*later* (Thermo Fisher Scientific and Qiagen) and RNAstable (Sigma‐Aldrich). RNA isola‐ tion and extraction methods can be manual (e.g., TRIzol–Thermo Fisher Scientific) or auto‐ mated (e.g., RNeasy—Qiagen), and different types of samples require different approaches, although all of them comprise: (i) sample solubilization in the presence of detergent and chao‐ tropic agents, (ii) sample homogenization for complete cell disruption and (iii) RNA recovery from the lysate with organic or solid‐phase extraction. It is also important to have a final RNA free of genomic DNA (gDNA) contaminants. Some protocols can carry over some gDNA into total RNA samples that can be removed by a DNAse treatment. gDNA contamination can lead to a counting bias in downstream analysis and can be detected by reads background over the whole genome (false positive signal). Further information about sample preparation tech‐ niques and some commercial kits available can be found in Ref. [52]. Different commercial kits demonstrated satisfactory RNA yield, but differences in the quality of extracted RNA were

age *erccdashboard* [49] and Anaquin [50] can be used for spike‐in analysis.

8 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

**3. Sample preparation and library construction**

sites such as Thermo Fisher Scientific, Qiagen and Ambion.

observed, which can interfere on the downstream analysis [53].

analysis. Each step will be briefly discussed below.

**3.1. RNA preparation**

follow manufacturer's instructions.

*3.1.1. RNA isolation and extraction*

The type of the desired RNA molecule drives the RNA enrichment approach. Selection of mature mRNAs by their poly(A) tails is the most common application and can be carried out with magnetic or cellulose beads coated with oligo(dT) molecules or through oligo(dT) prim‐ ing for reverse transcription (RT). Therefore, since RNAs from formalin‐fixed and paraffin‐ embedded (FFPE) are degraded and mRNA‐seq poorly captures degraded mRNAs, it is not an appropriate method to use with FFPE samples [42], unless adapted protocols are applied such as the recently described protocol based on *in vitro* T7 transcription for linear ampli‐ fication of mRNA [60]. In order to surpass this limitation, rRNA depletion protocols have been developed based on hybridization in highly conserved ribosomal regions, including the selective depletion of abundant RNA (SDRNA) with RNase H [61, 62], Ribominus (Thermo Fisher Scientific), Ribo‐Zero (Illumina), GeneRead (Qiagen) and RiboGone (Takara). Another approach is the duplex‐specific nuclease (DSN) normalization by depletion of abundant tran‐ scripts, such as rRNAs and tRNAs [63, 64]. Samples can be also enriched of small ncRNAs (e.g., miRNA, siRNA and piRNA) via size‐selection through electrophoresis or based on solid phase extraction with commercial kits such as mirVana (Thermo Fisher Scientific) and miR‐ Neasy (Qiagen). For comparison studies between these methods, see Refs. [42, 65]. rRNA depletion is recommended rather than oligo(dT) because it can capture a complete view of the transcriptome and can be used for low‐quality RNA samples [65].

#### **3.2. cDNA Library construction**

The library construction includes four steps: (i) RNA/cDNA fragmentation, (ii) cDNA synthe‐ sis, (iii) adapters ligation and (iv) quantification. Some specific points will be briefly discussed below, but additional information can be found in Refs. [41, 45].

#### *3.2.1. RNA/cDNA fragmentation*

The length of your RNA insert is a key factor for library construction and sequencing. Since most current platforms sequence only short reads, most protocols incorporate an RNA or cDNA fragmentation step that allows amplification and sequencing. For short RNAs (under 200 pb), no fragmentation is required. There are three main ways to fragment the nucleic acid samples: physical (e.g., sonication, nebulization), enzymatic (e.g., RNase III, DNase I or Fragmentase) and chemical (e.g., heat, metal ion) shearing. Little information is known about which is the best method for each application. A comparison study of nebulization, sonica‐ tion and enzymatic digestion showed that all three methods presented equal performance and that fragmentation is indicated [66]. In most cases, RNA is fragmented before conver‐ sion into cDNA. Furthermore, it is important to highlight that due to FFPE samples degrada‐ tion, cDNA fragmentation must be performed instead of RNA fragmentation when using oligo(dT) priming for first‐strand synthesis.

or more extra functional elements such as barcode/index to allow multiplexing and a sec‐ ond sequencing‐priming site to allow paired‐end sequencing. The addition of adapter via Y‐ adapter PCR is the most commonly used technique. Adapters can also be added via RT/PCR

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 11

To ensure the maximum yield (i.e., data output) and quality from your RNA‐seq experiment, it is important to have a precise quantification of your NGS libraries. Inaccurate quantifi‐ cation may lead to lower throughput, lower sequences qualities and poor samples balance within your multiplex. There are many ways to quantify your libraries, but the most accurate and effective method is quantitative real‐time PCR (qPCR). qPCR is more sensitive and only quantifies amplifiable DNA molecules (i.e., molecules that contain both adaptor sequence), providing a more precise estimation. Some commercial kits available are KAPA Library Quantification Kit (Kapa Biosystem), GeneRead Library Quant System (Qiagen), Ion Library TaqMan Quantitation Kit (Thermo Fisher Scientific), QPCR NGS Library Quant Kit (Agilent), PerfeCTa NGS Quantitation Kit (Quantabio) and NEBNext Library Quant Kit (New England BioLabs). Other methods are similar to the previously mentioned for RNA quantification: spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) and Agilent Bioanalyzer. However, since these methods measure total nucleic acid concentrations, including non‐ amplifiable DNA, they can lead to inaccurate results. It is also recommended to verify the libraries fragment size distribution, which can be performed by electrophoresis, preferably Agilent Bioanalyzer. Bioanalyzer electropherogram needs to show a narrow distribution with a peak height of the average size fragmentation value. After quantification, the library must be sequenced with the platforms discussed in Section 2.3, and data must be analyzed through

RNA‐seq data analysis involves many different strategies that depend on the goals and biolog‐ ical questions established at the time of the study design. A typical data analysis includes qual‐ ity control, reads preprocessing, alignment to a reference or *de novo* assembly and downstream analysis such as transcripts annotation, DGE, gene fusion analysis and alternative splicing. In the following topics, we will emphasize common steps and applications of this technology. A detailed workflow for data analysis is presented in **Figure 2**. Bioinformatic tools discussed in this chapter are compiled at **Table 1**, and a more exhaustive list of available tools can be found in Ref. [71]. For those with limited access for computational resources or little experience with command‐line execution of these bioinformatic tools, free online (Galaxy [72]) and commercial (Illumina BaseSpace [73] and Geneious [74]) platforms can be very helpful and intuitive.

A complete pipeline for an RNA‐seq analysis demands some checkpoints in order to ensure the quality of the results and elimination of noise from the biological samples. After sequencing,

during the first‐ and second‐strand synthesis process or via ligation.

bioinformatic tools. RNA‐seq data analysis will be discussed below.

*3.2.4. Library quantification*

**4. Data analysis**

**4.1. Quality control and reads preprocessing**

#### *3.2.2. cDNA synthesis*

After an adequate RNA preparation, RNA must be converted to double complementary DNA (cDNA) via RT, generating a cDNA:RNA hybrid. This process is known as first‐strand cDNA synthesis and requires an oligonucleotide primer. Three options are available: oligo(dT) prim‐ ing, random priming or gene‐specific priming. The first two are the mainly used for RNA‐ seq. Oligo(dT) priming is one of the oldest methods for first‐strand synthesis and involves oligo(dT) primer to capture the poly(A) tail of mature mRNA. Because of their specificity for poly(A) tails, oligo(dT) priming is not compatible with fragmented RNA, such as FFPE samples, nor for RNAs that lack poly(A) tails, such as non‐mRNAs (e.g., microRNAs (miR‐ NAs)). If using this methodology, cDNA fragmentation must be performed instead of RNA fragmentation. Besides that, RTs are not highly processive polymerases and can prematurely terminate the strand biosynthesis, leading to 3′ end bias and under‐representation of the 5′ ends. Random priming involves oligonucleotides with random base sequences that prime at random positions along the RNA (i.e., no template specificity), and it is preferable to oligo(dT) priming. This approach allows recovery of non‐poly(A) RNAs and prevents 3′ end bias, result‐ ing in a more uniform transcript coverage. However, it was shown that random priming is not completely random leading to a nucleotide bias across the first reads positions [67, 68].

The first‐strand cDNA is used as a template to generate double‐stranded cDNA. Second‐ strand cDNA synthesis can be performed by (i) RNA nicking of the RNA template by RNase H and synthesis with *E. coli* DNA polymerase I and T4 DNA ligase [69], (ii) using an oligo that is complementary to an adapter located in the 5′ end of the RNA template or by (iii) Clontech's SMART (Switching Mechanism At 5′ end of RNA Transcript) technology [70]. RNase H method presented a better performance for low‐quality RNA when compared to four other methods (Ribo‐Zero, NuGEN, SMART and DSN‐lite) [65].

#### *3.2.3. Adapters sequences and ligation*

Adapters sequences must be ligated at the ends of every single molecule during library prepa‐ ration, and this process varies depending upon the sequencing platform. It can contain one or more extra functional elements such as barcode/index to allow multiplexing and a sec‐ ond sequencing‐priming site to allow paired‐end sequencing. The addition of adapter via Y‐ adapter PCR is the most commonly used technique. Adapters can also be added via RT/PCR during the first‐ and second‐strand synthesis process or via ligation.

#### *3.2.4. Library quantification*

*3.2.1. RNA/cDNA fragmentation*

oligo(dT) priming for first‐strand synthesis.

*3.2.2. cDNA synthesis*

The length of your RNA insert is a key factor for library construction and sequencing. Since most current platforms sequence only short reads, most protocols incorporate an RNA or cDNA fragmentation step that allows amplification and sequencing. For short RNAs (under 200 pb), no fragmentation is required. There are three main ways to fragment the nucleic acid samples: physical (e.g., sonication, nebulization), enzymatic (e.g., RNase III, DNase I or Fragmentase) and chemical (e.g., heat, metal ion) shearing. Little information is known about which is the best method for each application. A comparison study of nebulization, sonica‐ tion and enzymatic digestion showed that all three methods presented equal performance and that fragmentation is indicated [66]. In most cases, RNA is fragmented before conver‐ sion into cDNA. Furthermore, it is important to highlight that due to FFPE samples degrada‐ tion, cDNA fragmentation must be performed instead of RNA fragmentation when using

10 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

After an adequate RNA preparation, RNA must be converted to double complementary DNA (cDNA) via RT, generating a cDNA:RNA hybrid. This process is known as first‐strand cDNA synthesis and requires an oligonucleotide primer. Three options are available: oligo(dT) prim‐ ing, random priming or gene‐specific priming. The first two are the mainly used for RNA‐ seq. Oligo(dT) priming is one of the oldest methods for first‐strand synthesis and involves oligo(dT) primer to capture the poly(A) tail of mature mRNA. Because of their specificity for poly(A) tails, oligo(dT) priming is not compatible with fragmented RNA, such as FFPE samples, nor for RNAs that lack poly(A) tails, such as non‐mRNAs (e.g., microRNAs (miR‐ NAs)). If using this methodology, cDNA fragmentation must be performed instead of RNA fragmentation. Besides that, RTs are not highly processive polymerases and can prematurely terminate the strand biosynthesis, leading to 3′ end bias and under‐representation of the 5′ ends. Random priming involves oligonucleotides with random base sequences that prime at random positions along the RNA (i.e., no template specificity), and it is preferable to oligo(dT) priming. This approach allows recovery of non‐poly(A) RNAs and prevents 3′ end bias, result‐ ing in a more uniform transcript coverage. However, it was shown that random priming is not completely random leading to a nucleotide bias across the first reads positions [67, 68].

The first‐strand cDNA is used as a template to generate double‐stranded cDNA. Second‐ strand cDNA synthesis can be performed by (i) RNA nicking of the RNA template by RNase H and synthesis with *E. coli* DNA polymerase I and T4 DNA ligase [69], (ii) using an oligo that is complementary to an adapter located in the 5′ end of the RNA template or by (iii) Clontech's SMART (Switching Mechanism At 5′ end of RNA Transcript) technology [70]. RNase H method presented a better performance for low‐quality RNA when compared to

Adapters sequences must be ligated at the ends of every single molecule during library prepa‐ ration, and this process varies depending upon the sequencing platform. It can contain one

four other methods (Ribo‐Zero, NuGEN, SMART and DSN‐lite) [65].

*3.2.3. Adapters sequences and ligation*

To ensure the maximum yield (i.e., data output) and quality from your RNA‐seq experiment, it is important to have a precise quantification of your NGS libraries. Inaccurate quantifi‐ cation may lead to lower throughput, lower sequences qualities and poor samples balance within your multiplex. There are many ways to quantify your libraries, but the most accurate and effective method is quantitative real‐time PCR (qPCR). qPCR is more sensitive and only quantifies amplifiable DNA molecules (i.e., molecules that contain both adaptor sequence), providing a more precise estimation. Some commercial kits available are KAPA Library Quantification Kit (Kapa Biosystem), GeneRead Library Quant System (Qiagen), Ion Library TaqMan Quantitation Kit (Thermo Fisher Scientific), QPCR NGS Library Quant Kit (Agilent), PerfeCTa NGS Quantitation Kit (Quantabio) and NEBNext Library Quant Kit (New England BioLabs). Other methods are similar to the previously mentioned for RNA quantification: spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) and Agilent Bioanalyzer. However, since these methods measure total nucleic acid concentrations, including non‐ amplifiable DNA, they can lead to inaccurate results. It is also recommended to verify the libraries fragment size distribution, which can be performed by electrophoresis, preferably Agilent Bioanalyzer. Bioanalyzer electropherogram needs to show a narrow distribution with a peak height of the average size fragmentation value. After quantification, the library must be sequenced with the platforms discussed in Section 2.3, and data must be analyzed through bioinformatic tools. RNA‐seq data analysis will be discussed below.

## **4. Data analysis**

RNA‐seq data analysis involves many different strategies that depend on the goals and biolog‐ ical questions established at the time of the study design. A typical data analysis includes qual‐ ity control, reads preprocessing, alignment to a reference or *de novo* assembly and downstream analysis such as transcripts annotation, DGE, gene fusion analysis and alternative splicing. In the following topics, we will emphasize common steps and applications of this technology. A detailed workflow for data analysis is presented in **Figure 2**. Bioinformatic tools discussed in this chapter are compiled at **Table 1**, and a more exhaustive list of available tools can be found in Ref. [71]. For those with limited access for computational resources or little experience with command‐line execution of these bioinformatic tools, free online (Galaxy [72]) and commercial (Illumina BaseSpace [73] and Geneious [74]) platforms can be very helpful and intuitive.

#### **4.1. Quality control and reads preprocessing**

A complete pipeline for an RNA‐seq analysis demands some checkpoints in order to ensure the quality of the results and elimination of noise from the biological samples. After sequencing, the analysis starts with files containing the raw reads. The FASTQ [75] is the standard format used to store the nucleotide sequences along with a per base quality score in Phred log scale. The qualities, typically with scores from 0 to 40, are represented by single letters encoded with pre‐defined ranges of characters from the American Standard Code for Information Interchange (ASCII) table. Currently, there are two patterns: Phred + 64, used in initial Illumina versions 1.3 + and 1.5 + and Phred + 33, the default encoding for Sanger and more recent sequencers. The FASTQ is widely accepted and used in most downstream software, although the unmapped BAM (uBAM) format has been recently encouraged as it is capable of storing important sequencing metadata not present in FASTQ, and for being binary, it demands less disk storage. Some sequencing platforms, like Ion Torrent, have already included uBAM as default output format in their pipelines. Both formats are interchangeable by using Picard [76], BamUtil [77] and BamTools [78].

**Category Tools** Experimental design Scotty [12] Raw reads quality control FASTQC [75]

Reads preprocessing Read clipping/trimming Picard [72], BamUtil [73], BamTools [74],

Unspliced mapping Hash Table index based BFAST [90], MAQ [92], Mosaik [93], Novoalign

Paired‐end reads overlapping detection

Spliced‐aware mapping Hash Table index based GSNAP [91], RNASEQR [103]

Alignment quality assessment Picard [72], BamUtil [73], BamTools [74],

Assembly Genome‐guided Cufflinks [111], Scripture [112], StringTie [113]

Assembly quality assessment Detonate [122], TransRate [123], BUSCO [124]

Raw read counts Mapped‐based featureCounts [129], HTSeq‐count [130], RSEM

Differential expression DESeq [132], DESeq2 [133], edgeR [134],

Alternative splicing Cufflinks [111], Scripture [112], StringTie [113] Differential alternative splicing CuffDiff [111], Ballgown [139], DEXSeq [161],

Fusion genes SOAPfuse [170], FusionCatcher [171], JAFFA

miRNA miRdeep2 [179], miReNA [180], miRanalyzer

Annotation BLAST [145, 146], DIAMOND [147],

Alignment visualization IGV [125], Tablet [126], UCSC [127]

Raw read counts quality assessment NOISeq [131]

Enrichment analysis GSEA [155]

**Table 1.** Tools for RNA‐seq data analysis.

PRINSEQ [76], Cutadapt [78], FASTX‐Toolkit

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 13

[79], Trimmomatic [80]

FLASh [84], PEAR [85]

FM‐index‐based TopHat2 [98], HISAT2 [99], SOAP‐splice [101],

[108], SAMstat [109]

*De novo* Rnnotator [115], Trans‐ABySS [116], Trinity [119], Oases [120]

[136]

Pseudoalignment Kallisto [140], Salmon [141]

[94], RMAP [95], SHRiMP [96]

STAR [102], RNASEQR [103]

Samtools [106], Qualimap2 [107], BAMstats

CuffDiff2 [137], BitSeq [138], Ballgown [139]

InterProScan [148], tRNAscan‐SE [149], RNAmmer [150], Blast2GO [151], Annocript [152], TRAPID [153], Trinotate [154]

rMATS [162], SpliceR [163], MISO [164],

DiffSplice [165]

[172]

[181]

Reads error correction SEECER [86], Rcorrector [87]

FM‐index based Bowtie2 [97], BWA [100]

The first step is to perform a quality control (QC) of the data, checking parameters like amount of reads per sample, general read and base qualities, mean reads length, G + C content, pres‐ ence of unclipped adapters or PCR primers and unexpected repetitive sequences. This general overview will indicate if library construction and sequencing were properly performed, or if errors like contaminants, poor ribosomal RNA depletion or low sequencing output will demand a new round of experiments. The most common software used to retrieve these basic statistics is FastQC [79] and PRINSEQ [80]. The first was mainly designed for Illumina, while the later for 454/Roche technology and may be also used for preprocessing. Both programs are available with intuitive graphical user interfaces (GUI), accept other sequencing technologies input files and generate graphical reports, which are very useful for guiding the choice of filtering thresholds.

**Figure 2.** RNA‐seq data analysis. (1) Raw single‐end and paired‐end reads obtained from NGS sequencing; (2) Adapters clipping and base quality trimming. Alternatively error correction can be performed; (3) Mapping without preprocessing using soft‐clipping; (4) Unspliced or spliced‐aware reads mapping; (5) Assess mapping quality and biases; (6) Mapping visualization; (7) Transcriptome genome‐guided assembly; (8) Per feature quantification using mapped reads; (9) Per feature quantification using quasi‐mapping approach; (10) Transcriptome de novo assembly; (11) Mapping reads to de novo assembled transcriptome; (12) Downstream data analysis.


**Table 1.** Tools for RNA‐seq data analysis.

the analysis starts with files containing the raw reads. The FASTQ [75] is the standard format used to store the nucleotide sequences along with a per base quality score in Phred log scale. The qualities, typically with scores from 0 to 40, are represented by single letters encoded with pre‐defined ranges of characters from the American Standard Code for Information Interchange (ASCII) table. Currently, there are two patterns: Phred + 64, used in initial Illumina versions 1.3 + and 1.5 + and Phred + 33, the default encoding for Sanger and more recent sequencers. The FASTQ is widely accepted and used in most downstream software, although the unmapped BAM (uBAM) format has been recently encouraged as it is capable of storing important sequencing metadata not present in FASTQ, and for being binary, it demands less disk storage. Some sequencing platforms, like Ion Torrent, have already included uBAM as default output format in their pipelines. Both formats are interchangeable by using Picard [76], BamUtil [77]

12 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

The first step is to perform a quality control (QC) of the data, checking parameters like amount of reads per sample, general read and base qualities, mean reads length, G + C content, pres‐ ence of unclipped adapters or PCR primers and unexpected repetitive sequences. This general overview will indicate if library construction and sequencing were properly performed, or if errors like contaminants, poor ribosomal RNA depletion or low sequencing output will demand a new round of experiments. The most common software used to retrieve these basic statistics is FastQC [79] and PRINSEQ [80]. The first was mainly designed for Illumina, while the later for 454/Roche technology and may be also used for preprocessing. Both programs are available with intuitive graphical user interfaces (GUI), accept other sequencing technologies input files and generate graphical reports, which are very useful for guiding the choice of filtering thresholds.

**Figure 2.** RNA‐seq data analysis. (1) Raw single‐end and paired‐end reads obtained from NGS sequencing; (2) Adapters clipping and base quality trimming. Alternatively error correction can be performed; (3) Mapping without preprocessing using soft‐clipping; (4) Unspliced or spliced‐aware reads mapping; (5) Assess mapping quality and biases; (6) Mapping visualization; (7) Transcriptome genome‐guided assembly; (8) Per feature quantification using mapped reads; (9) Per feature quantification using quasi‐mapping approach; (10) Transcriptome de novo assembly; (11) Mapping reads to de

novo assembled transcriptome; (12) Downstream data analysis.

and BamTools [78].

The following preprocessing step is crucial and can greatly influence the data analysis [81]. Besides PRINSEQ, other tools like Cutadapt [82], FASTX‐Toolkit [83] and Trimmomatic [84] are efficient in preprocessing reads, but FASTX‐Toolkit cannot be used with paired‐end reads. Generally, due to problems inherent in sequencing technologies, the bases in 3'end of reads have lower quality, and one may choose to filter off reads with low mean quality or trim only the low‐quality ends. Trimming in most cases may improve mappability, although shorter reads have a higher probability of erroneous mapping. Therefore, it is recommended to remove short reads in conjunction with non‐aggressive base‐quality trimming to avoid spurious mapping and incorrect inferences [85, 86]. Adapter removal and trimming low‐qual‐ ity ends improve RNA‐seq assembly, single nucleotide polymorphism (SNP) detection and gene‐expression analysis.

indels, reads spanning exon junctions and repetitive regions or pseudogenes in references. To guarantee reproducibility, it is highly recommended reporting alignment parameter details, such as mapper and reference versions and sources, allowed seed mismatches, minimal align‐

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 15

Mappers can be roughly divided by the algorithm chosen to create indexes and by the ability to recognize exon‐exon junctions. Indexes have the purpose of making the alignments signifi‐ cantly faster and are mainly divided into Hash Table or compressed prefix or suffix array‐like structures (FM‐index). Their principle is to quickly find small local alignments representing sub‐ strings of whole reads—designated as seeds—in the reference and then extend those alignments surpassing a defined quality threshold toward the read ends, assigning a Phred‐based mapping quality score for each read. Unfortunately, most mappers have developed their own mapping quality formulas, creating a non‐uniform mapping qualification. Some well‐known Hash Table‐ based algorithms are BFAST [94], GSNAP [95], MAQ [96], Mosaik [97], Novoalign [98], RMAP [99] and SHRiMP [100], while Bowtie2 [101], TopHat2 [102], HISAT2 [103], BWA [104], SOAP‐

Regarding the splicing events, they can be divided into unspliced and splice‐aware aligners. Most recent mappers are capable of using reference annotation files to deal with known exon‐ exon junctions and to predict new splice sites, which is essential when analyzing RNA‐seq data from most eukaryotes. GSNAP, SOAP‐splice, RNASEQR [107], STAR and TopHat2 are some recommended options for spliced alignments, but for intronless species, miRNA and transcriptomes, unspliced aligners can be used. Comparative evaluations showed that FM‐ index‐based mappers are preferable [108] and that, again, no tool is the best for every perfor‐

The standard alignment output is the Sequence Alignment/Map (SAM) format or its binary version BAM and they are essential inputs for many downstream applications. Picard [76] and Samtools [110] are frequently used to manipulate these files. It is advisable to assess the alignment quality from SAM/BAM files with tools like Qualimap2 [111], BAMstats [112] and

Short RNA‐seq reads represent only a small portion of most transcripts, and therefore, over‐ laps have to be detected in order to fully reconstruct the original molecules. Paralogous genes, alternative splicing, alternative transcription initiation and termination sites increase the complexity and impose computational challenges in Eukaryotic assembly analysis [114]. For Bacteria, Archaea and lower eukaryotes, the absence or smaller amount of introns makes the

RNA‐seq assemblers greatly differ from DNA‐seq algorithms because a wide range of tran‐ scripts coverage is expected, and several gene isoforms can be observed resulting in thousands of contigs stead of ideally one per chromosome. When a good quality reference genome is available, the usual procedure is to use the coordinates of aligned reads to separate them into clusters and perform a *de novo* alignment individually for each locus, from which individual

mance parameters like speed, alignment yield, exon discovery and accuracy [109].

SAMstat [113] for general characterization or for comparing mappers' performances.

*4.2.2. Genome‐guided assembly*

assembly more straightforward.

ment score and treatment given to multi mapping reads.

splice [105] and STAR [106] are examples of FM‐index based algorithms.

Modern mapping tools (see next section) are capable of labeling the unaligned read ends, a process known as soft‐clipping, without actually removing them (hard‐clipping). There is no consensus on which approach is the best, but it has been considered that keeping as much as information as possible would be better for downstream analysis. For example, the soft‐ clipped reads are important for detection of genomic structural variants [87].

When the goal is to perform RNA‐seq *de novo* assembly, supplementary tools can be used to join overlapping paired‐end reads, like FLASh [88] and PEAR [89]. Additionally, base error correction can be applied as an alternative to read trimming and filtering, increasing the amount of useful data and consequently the contig sizes. SEECER [90] and Rcorrector [91] were specifically designed for this task. Both strategies will likely improve assembly qualities.

In summary, preprocessing is beneficial, but there is no best tool for any experiment or general rule for filtering thresholds. All software has its own standard parameters, advantages and limitations, being recommended a case‐by‐case analysis and a thorough software comparison.

#### **4.2. Mapping, assembling and visualizing mapped reads**

Now that the raw reads have been preprocessed, alternative approaches can be chosen according to the availability of a reference sequence. If present, reads can be mapped to the genome and the gene that originated the transcript from which the reads were derived may be inferred and expression quantified. The genome may also be used to guide transcriptome assembly, resulting in several contigs representing the genes and its isoforms. On the other hand, if the studied species still lacks a reference sequence, reads can be *de novo* assembled, and transcripts can now be used as a mapping reference.

#### *4.2.1. Mapping to a reference*

Mapping reads to a reference can be also seen as a traditional pair‐wise sequence alignment, as observed in common Basic Local Alignment Search Tool (BLAST) [92], but with the main difference that a vast amount of reads are compared with a database composed of fewer and longer sequences instead of several thousand nucleotides/proteins. This is a field under con‐ stant development with plenty of tools available [93]. These tools have to deal with inherent mapping challenges, such as sequencing errors, natural sequence variability like SNPs and indels, reads spanning exon junctions and repetitive regions or pseudogenes in references. To guarantee reproducibility, it is highly recommended reporting alignment parameter details, such as mapper and reference versions and sources, allowed seed mismatches, minimal align‐ ment score and treatment given to multi mapping reads.

Mappers can be roughly divided by the algorithm chosen to create indexes and by the ability to recognize exon‐exon junctions. Indexes have the purpose of making the alignments signifi‐ cantly faster and are mainly divided into Hash Table or compressed prefix or suffix array‐like structures (FM‐index). Their principle is to quickly find small local alignments representing sub‐ strings of whole reads—designated as seeds—in the reference and then extend those alignments surpassing a defined quality threshold toward the read ends, assigning a Phred‐based mapping quality score for each read. Unfortunately, most mappers have developed their own mapping quality formulas, creating a non‐uniform mapping qualification. Some well‐known Hash Table‐ based algorithms are BFAST [94], GSNAP [95], MAQ [96], Mosaik [97], Novoalign [98], RMAP [99] and SHRiMP [100], while Bowtie2 [101], TopHat2 [102], HISAT2 [103], BWA [104], SOAP‐ splice [105] and STAR [106] are examples of FM‐index based algorithms.

Regarding the splicing events, they can be divided into unspliced and splice‐aware aligners. Most recent mappers are capable of using reference annotation files to deal with known exon‐ exon junctions and to predict new splice sites, which is essential when analyzing RNA‐seq data from most eukaryotes. GSNAP, SOAP‐splice, RNASEQR [107], STAR and TopHat2 are some recommended options for spliced alignments, but for intronless species, miRNA and transcriptomes, unspliced aligners can be used. Comparative evaluations showed that FM‐ index‐based mappers are preferable [108] and that, again, no tool is the best for every perfor‐ mance parameters like speed, alignment yield, exon discovery and accuracy [109].

The standard alignment output is the Sequence Alignment/Map (SAM) format or its binary version BAM and they are essential inputs for many downstream applications. Picard [76] and Samtools [110] are frequently used to manipulate these files. It is advisable to assess the alignment quality from SAM/BAM files with tools like Qualimap2 [111], BAMstats [112] and SAMstat [113] for general characterization or for comparing mappers' performances.

#### *4.2.2. Genome‐guided assembly*

The following preprocessing step is crucial and can greatly influence the data analysis [81]. Besides PRINSEQ, other tools like Cutadapt [82], FASTX‐Toolkit [83] and Trimmomatic [84] are efficient in preprocessing reads, but FASTX‐Toolkit cannot be used with paired‐end reads. Generally, due to problems inherent in sequencing technologies, the bases in 3'end of reads have lower quality, and one may choose to filter off reads with low mean quality or trim only the low‐quality ends. Trimming in most cases may improve mappability, although shorter reads have a higher probability of erroneous mapping. Therefore, it is recommended to remove short reads in conjunction with non‐aggressive base‐quality trimming to avoid spurious mapping and incorrect inferences [85, 86]. Adapter removal and trimming low‐qual‐ ity ends improve RNA‐seq assembly, single nucleotide polymorphism (SNP) detection and

14 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

Modern mapping tools (see next section) are capable of labeling the unaligned read ends, a process known as soft‐clipping, without actually removing them (hard‐clipping). There is no consensus on which approach is the best, but it has been considered that keeping as much as information as possible would be better for downstream analysis. For example, the soft‐

When the goal is to perform RNA‐seq *de novo* assembly, supplementary tools can be used to join overlapping paired‐end reads, like FLASh [88] and PEAR [89]. Additionally, base error correction can be applied as an alternative to read trimming and filtering, increasing the amount of useful data and consequently the contig sizes. SEECER [90] and Rcorrector [91] were specifically designed for this task. Both strategies will likely improve assembly qualities. In summary, preprocessing is beneficial, but there is no best tool for any experiment or general rule for filtering thresholds. All software has its own standard parameters, advantages and limitations, being recommended a case‐by‐case analysis and a thorough software comparison.

Now that the raw reads have been preprocessed, alternative approaches can be chosen according to the availability of a reference sequence. If present, reads can be mapped to the genome and the gene that originated the transcript from which the reads were derived may be inferred and expression quantified. The genome may also be used to guide transcriptome assembly, resulting in several contigs representing the genes and its isoforms. On the other hand, if the studied species still lacks a reference sequence, reads can be *de novo* assembled,

Mapping reads to a reference can be also seen as a traditional pair‐wise sequence alignment, as observed in common Basic Local Alignment Search Tool (BLAST) [92], but with the main difference that a vast amount of reads are compared with a database composed of fewer and longer sequences instead of several thousand nucleotides/proteins. This is a field under con‐ stant development with plenty of tools available [93]. These tools have to deal with inherent mapping challenges, such as sequencing errors, natural sequence variability like SNPs and

clipped reads are important for detection of genomic structural variants [87].

**4.2. Mapping, assembling and visualizing mapped reads**

and transcripts can now be used as a mapping reference.

*4.2.1. Mapping to a reference*

gene‐expression analysis.

Short RNA‐seq reads represent only a small portion of most transcripts, and therefore, over‐ laps have to be detected in order to fully reconstruct the original molecules. Paralogous genes, alternative splicing, alternative transcription initiation and termination sites increase the complexity and impose computational challenges in Eukaryotic assembly analysis [114]. For Bacteria, Archaea and lower eukaryotes, the absence or smaller amount of introns makes the assembly more straightforward.

RNA‐seq assemblers greatly differ from DNA‐seq algorithms because a wide range of tran‐ scripts coverage is expected, and several gene isoforms can be observed resulting in thousands of contigs stead of ideally one per chromosome. When a good quality reference genome is available, the usual procedure is to use the coordinates of aligned reads to separate them into clusters and perform a *de novo* alignment individually for each locus, from which individual isoforms can be inferred. Cufflinks [115], Scripture [116] and StringTie [117] are recommended tools, and their algorithm strategies have been reviewed [118], with StringTie [117] presenting better transcript reconstruction performance. Paired‐end, strand‐specific libraries and longer reads are highly encouraged for better assemblies and to allow distinction in overlapping tran‐ scripts from opposite strands for gene‐dense species and antisense transcription. Genome‐ guided assembled transcriptomes can be used to improve gene structures annotation through detection of transcription boundaries and splice‐sites.

Reads Per Kilobase per Million (RPKM mapped reads) or Fragments Per Kilobase per Million (FPKM mapped reads) metrics. Their principle is to count the amount of raw reads mapped to each genomic feature and normalize considering the gene length and library depth. Although still widely applied, these normalization metrics should be avoided as RPKM has shown to be inconsistent and Transcripts Per Million (TPM) is preferable [133]. Raw reads counting can be obtained with feature counts [134] and HTSeq‐count [135], which are capable of detecting multi‐mapping reads, exon junctions and overlapping reference features. NOISeq [136] can be used to assess the count quality parameters, such as saturation and specificity, in a set of

DESeq [137], DESeq2 [138] and edgeR [139] packages are recommended for between‐sample comparisons to detect differences in the relative abundances of genes [140]. Quantification at transcript level can be analyzed with Cufflinks [115] and RSEM [141] and compared with DESeq2, CuffDiff2 [142], BitSeq [143] or Ballgown [144]. Variations in expression between dif‐ ferent conditions are usually measured in log2 fold‐change units. DESeq2 can also perform

Generally, a control set of housekeeping genes should present non‐differential expression and

Analysis (PCA) plots [18]. For a set of 12 or less replicates, at gene level, edgeR or DESeq2 is recommended to detect differential expression and DESeq when more than 12 replicates are available [21]. Thresholds in log2 fold‐change should be applied to increase the true positive and decrease the false positive rates, but this parameter is highly dependent on the amount of

Recently, quasi‐mapping (or pseudoalignment) approaches have been proposed for RNA‐ seq quantification, like Kallisto [145] and Salmon [146]. Their main difference is that reads are assigned to reference sequences without base‐to‐base alignment, making analyses usu‐ ally considerably faster. They have shown comparable performance over complete mapping‐ based methods, can incorporate information from multi‐mapping reads, and provide counts and abundances already as normalized TPM values, which can be used as input for differen‐

Although RNA‐seq provides a precise and accurate estimation of RNA abundance, these findings are still widely required to be further validated through quantitative PCR, also known as qPCR or real‐time PCR as it is still considered the gold standard for gene expres‐ sion quantification. However, it is still questionable whether qPCR validation is still necessary for RNA‐seq studies. High correlation between RNA‐seq and qPCR results has been observed in previous studies [7, 147, 148]. Due to this high consistency, qPCR may be more useful when performed on different biological replicate samples from those already sequenced, confirm‐

In computational biology, annotation is the process of identifying the location and sequence of genomic elements and/or assigning biological function to them. Despite the annotation process

tial expression analysis. These are promising although under development tools.

ing the DGE findings and validating the biological conclusions.

≥ 0.9) observable in Principal Component

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 17

comprehensive plots.

*4.3.2. Annotation*

pair‐wise and time series analysis.

a high between replicates correlation (Spearman R<sup>2</sup>

biological replicates, varying from 0.1 to 0.5 [21].

#### *4.2.3. De novo assembly*

In the absence of a reference sequence or if only a fragmented draft genome is available, overlaps have to be detected from the complete read set in a *de novo* assembly approach. The independence from a good quality reference and mapping procedures can be also seen as an advantage. The counterpart is that sequencing depth must be obtained in a higher cover‐ age, estimated around 30× [119], while genome‐guided approach requires about 10× [120, 121] to find full‐length transcripts. The higher throughput increases the processing require‐ ments, so data digital normalization is recommended in order to remove redundancy without impacting the assembly outcome [122]. Although the *de novo* approach is usually more error prone and computationally intensive, it allows the discovery of novel splicing events, unpre‐ dicted genes and exons, chromosomal rearrangements and trans‐splicing. Trinity [123], Oases [124], Rnnotator [119] and Trans‐ABySS [120] are advised for this task. Whenever possible, a combined genome‐guided/*de novo* strategy is recommended, as enhanced performance is observed [125]. A comprehensive overview of transcriptome assembly can be found in Ref. [121]. Evaluation of the assembly quality and transcriptome completeness can be assessed with Detonate [126], TransRate [127] and BUSCO [128].

#### *4.2.4. Visualization*

Alignment output SAM files are hard to be interpreted with common text editors, and there‐ fore, a number of graphical browsers have been developed to inspect NGS sequencing data at any specific loci at nucleotide level. IGV [129], Tablet [130], Browser Genome [131] and UCSC [132] are extremely useful when validating novel transcripts and gene junctions, checking the coverage support for genomic variants and spot read piles, which may represent repetitive regions.

#### **4.3. Downstream analyses**

After conducting these general steps, the experiments can be directed to specific applications in order to address the scientific questions, designated as downstream analysis.

#### *4.3.1. Quantification and differential expression*

The primary goal of most RNA‐seq projects is to quantify and compare the gene expres‐ sion under different conditions and infer biological function to differential expression at gene or transcript level. Intra‐sample abundance comparisons were commonly performed with Reads Per Kilobase per Million (RPKM mapped reads) or Fragments Per Kilobase per Million (FPKM mapped reads) metrics. Their principle is to count the amount of raw reads mapped to each genomic feature and normalize considering the gene length and library depth. Although still widely applied, these normalization metrics should be avoided as RPKM has shown to be inconsistent and Transcripts Per Million (TPM) is preferable [133]. Raw reads counting can be obtained with feature counts [134] and HTSeq‐count [135], which are capable of detecting multi‐mapping reads, exon junctions and overlapping reference features. NOISeq [136] can be used to assess the count quality parameters, such as saturation and specificity, in a set of comprehensive plots.

DESeq [137], DESeq2 [138] and edgeR [139] packages are recommended for between‐sample comparisons to detect differences in the relative abundances of genes [140]. Quantification at transcript level can be analyzed with Cufflinks [115] and RSEM [141] and compared with DESeq2, CuffDiff2 [142], BitSeq [143] or Ballgown [144]. Variations in expression between dif‐ ferent conditions are usually measured in log2 fold‐change units. DESeq2 can also perform pair‐wise and time series analysis.

Generally, a control set of housekeeping genes should present non‐differential expression and a high between replicates correlation (Spearman R<sup>2</sup> ≥ 0.9) observable in Principal Component Analysis (PCA) plots [18]. For a set of 12 or less replicates, at gene level, edgeR or DESeq2 is recommended to detect differential expression and DESeq when more than 12 replicates are available [21]. Thresholds in log2 fold‐change should be applied to increase the true positive and decrease the false positive rates, but this parameter is highly dependent on the amount of biological replicates, varying from 0.1 to 0.5 [21].

Recently, quasi‐mapping (or pseudoalignment) approaches have been proposed for RNA‐ seq quantification, like Kallisto [145] and Salmon [146]. Their main difference is that reads are assigned to reference sequences without base‐to‐base alignment, making analyses usu‐ ally considerably faster. They have shown comparable performance over complete mapping‐ based methods, can incorporate information from multi‐mapping reads, and provide counts and abundances already as normalized TPM values, which can be used as input for differen‐ tial expression analysis. These are promising although under development tools.

Although RNA‐seq provides a precise and accurate estimation of RNA abundance, these findings are still widely required to be further validated through quantitative PCR, also known as qPCR or real‐time PCR as it is still considered the gold standard for gene expres‐ sion quantification. However, it is still questionable whether qPCR validation is still necessary for RNA‐seq studies. High correlation between RNA‐seq and qPCR results has been observed in previous studies [7, 147, 148]. Due to this high consistency, qPCR may be more useful when performed on different biological replicate samples from those already sequenced, confirm‐ ing the DGE findings and validating the biological conclusions.

#### *4.3.2. Annotation*

isoforms can be inferred. Cufflinks [115], Scripture [116] and StringTie [117] are recommended tools, and their algorithm strategies have been reviewed [118], with StringTie [117] presenting better transcript reconstruction performance. Paired‐end, strand‐specific libraries and longer reads are highly encouraged for better assemblies and to allow distinction in overlapping tran‐ scripts from opposite strands for gene‐dense species and antisense transcription. Genome‐ guided assembled transcriptomes can be used to improve gene structures annotation through

In the absence of a reference sequence or if only a fragmented draft genome is available, overlaps have to be detected from the complete read set in a *de novo* assembly approach. The independence from a good quality reference and mapping procedures can be also seen as an advantage. The counterpart is that sequencing depth must be obtained in a higher cover‐ age, estimated around 30× [119], while genome‐guided approach requires about 10× [120, 121] to find full‐length transcripts. The higher throughput increases the processing require‐ ments, so data digital normalization is recommended in order to remove redundancy without impacting the assembly outcome [122]. Although the *de novo* approach is usually more error prone and computationally intensive, it allows the discovery of novel splicing events, unpre‐ dicted genes and exons, chromosomal rearrangements and trans‐splicing. Trinity [123], Oases [124], Rnnotator [119] and Trans‐ABySS [120] are advised for this task. Whenever possible, a combined genome‐guided/*de novo* strategy is recommended, as enhanced performance is observed [125]. A comprehensive overview of transcriptome assembly can be found in Ref. [121]. Evaluation of the assembly quality and transcriptome completeness can be assessed

Alignment output SAM files are hard to be interpreted with common text editors, and there‐ fore, a number of graphical browsers have been developed to inspect NGS sequencing data at any specific loci at nucleotide level. IGV [129], Tablet [130], Browser Genome [131] and UCSC [132] are extremely useful when validating novel transcripts and gene junctions, checking the coverage support for genomic variants and spot read piles, which may represent repetitive

After conducting these general steps, the experiments can be directed to specific applications

The primary goal of most RNA‐seq projects is to quantify and compare the gene expres‐ sion under different conditions and infer biological function to differential expression at gene or transcript level. Intra‐sample abundance comparisons were commonly performed with

in order to address the scientific questions, designated as downstream analysis.

detection of transcription boundaries and splice‐sites.

16 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

with Detonate [126], TransRate [127] and BUSCO [128].

*4.2.3. De novo assembly*

*4.2.4. Visualization*

**4.3. Downstream analyses**

*4.3.1. Quantification and differential expression*

regions.

In computational biology, annotation is the process of identifying the location and sequence of genomic elements and/or assigning biological function to them. Despite the annotation process being mostly carried over genomic sequences, such as newly sequenced genomes, RNA‐seq data can provide valuable information to improve existing annotations [149] or create novel transcript annotations for an unsequenced organism [123].

have been published since then, the original algorithm is still the most widely used. In order to perform an enrichment analysis from RNA‐Seq data, the GSEAPreranked software is rec‐

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 19

A gene set is a set of genes related to the feature to be tested for enrichment. A variety of fea‐ tures can be tested from general features such as pathways and chromosome location, to more specific features such as cancer signatures or miRNAs targets. Gene sets can be obtained from the Molecular Signatures Database (MSigDB) that comprehends thousands of pre‐defined

A ranked list of the genes needs to be provided to test if the chosen gene set is significantly enriched at either end of the ranking. The list can be ranked according to any quantitative

Alternative splicing (AS) is a post‐transcriptional mechanism present in the majority of eukaryotes that greatly increases the diversity of proteins that can be encoded by a deter‐ mined genome. This process occurs when particular regions of a gene are included or excluded, through splicing, from the final processed mRNA sequence. AS can occur in sev‐ eral ways, such as exon skipping, intron retention, alternative 5′ donor and 3′ receptor sites [161, 162], analysis of new AS events or patterns is relevant since many traits, especially genetic diseases such as cancers, are related with disorders in splicing patterns that generates

AS analysis by deep sequencing requires splice‐aware programs capable of aligning tran‐ scripts reads to a reference genome while performing the difficult task of placing spliced reads across introns by determining the exon‐intron boundaries. A systematic evaluation of splice‐aware alignment programs for RNA‐seq data performed by the RNA‐seq Genome Annotation Assessment Project (RGASP) Consortium [109] tested 26 RNA‐seq alignment pro‐ tocols and concluded that, in general, GSNAP [164], MapSplice [165] and STAR [106] com‐ pared favorably to other methods. Still, two of this software (GSNAP and STAR) presented many false exons junctions in the output if they were not filtered based on the number of

Following the alignment step, software like cufflinks [115], scripture [116] and StringTie [117] can be used to perform transcript reconstruction, which can reveal new splicing isoforms evidenced by the alignments. This step usually yields an updated GTF annotation file as output that can be

If data from different conditions are available, differential AS analysis can be performed. With the alignment results (SAM file) and a GTF annotation file at hand, differential exon usage analysis can be performed with DEXSeq [166] and differential analysis of AS events, such as skipped exon, alternative 5′ and 3′ splice site, mutually exclusive exons, and retained intron events can be performed with rMATS [167]. There are plenty of other software specialized in performing differential AS analysis each one with their advantages and disadvantages, such

as CuffDiff [115], Ballgown [144], SpliceR [168], MISO [169] and DiffSplice [170].

ommended and it requires two types of data: a gene set list and a ranked list.

feature such as gene expression or fold‐change results from DGE analysis.

gene sets, or it can be created by the user.

*4.3.4. Alternative splicing*

aberrant variants [162, 163].

supporting alignments.

used in subsequent steps.

The major drawback of using genome sequences for annotation is that only features with pat‐ terns or conservation with annotated elements, such as open reading frames (ORFs), tRNAs and rRNAs can be inferred from it. On the other hand, RNA‐seq data provide a new layer of information that allows precise identification of pattern less features such as untranslated regions (UTRs), non‐coding RNAs and post‐transcriptional events. Even though some features can be somewhat inferred through DNA sequences, for example, Transcription Start Site (TSS), TATA box/CpG islands and splicing sites, transcriptomic data still provide a more reliable annotation.

Transcriptome assembly, *de novo* or reference‐guided, often reveals new potential transcripts whose functions are unknown. Before any further step can be made, it is crucial to gather information on these transcripts function in order to extract any meaningful answer.

The most common approach to annotate a transcript is to look for similar known transcripts or protein sequences in large databases. This is usually done using versatile tools like BLAST/ BLASTX [150, 151] or DIAMOND [152] when looking for similar nucleotide or protein sequences. It is often better to perform searches at protein level since it is easier to find homology, as they tend to be more conserved than nucleotide sequences, especially if the study subject has no close species sequenced.

InterProScan [153] can be used to search for conserved protein signatures. This is especially useful when it is difficult to find full sequence homologs given that the study organism might be too divergent from species sequences available in the database. Protein families often present signature domains that are well conserved even among divergent species, so these signatures can give insights into the putative function of the protein. The process for anno‐ tating non‐coding transcripts differs from protein coding transcripts. They usually present poor sequence conservation since their function relies on factors, such as secondary structure, rather than amino acid sequences. Therefore, their annotation process requires specialized software to detect those intrinsic characteristics of a given class of non‐coding transcripts, for example, tRNAscan‐SE [154] for tRNAs and RNAmmer [155] for rRNAs.

Given the importance of annotation, there are plenty of tools and pipelines developed to streamline this process. Some annotation tools like Blast2GO [156] are generic and very user‐ friendly, although it requires a paid license to use it. Others like Annocript [157], TRAPID [158] and Trinotate [159] are pipelines developed specifically for annotating transcriptomes. It is important to note that although automatic pipelines often ease and speed up the analysis, it comes at a cost of lesser control of the annotation process.

#### *4.3.3. Enrichment analysis*

Functional enrichment analysis is a computational method capable of determining whether a pre‐defined set of genes shows significant differences between samples. The GSEA software from Broad Institute runs the original GSEA algorithm [160]. Although alternative algorithms have been published since then, the original algorithm is still the most widely used. In order to perform an enrichment analysis from RNA‐Seq data, the GSEAPreranked software is rec‐ ommended and it requires two types of data: a gene set list and a ranked list.

A gene set is a set of genes related to the feature to be tested for enrichment. A variety of fea‐ tures can be tested from general features such as pathways and chromosome location, to more specific features such as cancer signatures or miRNAs targets. Gene sets can be obtained from the Molecular Signatures Database (MSigDB) that comprehends thousands of pre‐defined gene sets, or it can be created by the user.

A ranked list of the genes needs to be provided to test if the chosen gene set is significantly enriched at either end of the ranking. The list can be ranked according to any quantitative feature such as gene expression or fold‐change results from DGE analysis.

#### *4.3.4. Alternative splicing*

being mostly carried over genomic sequences, such as newly sequenced genomes, RNA‐seq data can provide valuable information to improve existing annotations [149] or create novel

The major drawback of using genome sequences for annotation is that only features with pat‐ terns or conservation with annotated elements, such as open reading frames (ORFs), tRNAs and rRNAs can be inferred from it. On the other hand, RNA‐seq data provide a new layer of information that allows precise identification of pattern less features such as untranslated regions (UTRs), non‐coding RNAs and post‐transcriptional events. Even though some features can be somewhat inferred through DNA sequences, for example, Transcription Start Site (TSS), TATA box/CpG islands and splicing sites, transcriptomic data still provide a more reliable

Transcriptome assembly, *de novo* or reference‐guided, often reveals new potential transcripts whose functions are unknown. Before any further step can be made, it is crucial to gather

The most common approach to annotate a transcript is to look for similar known transcripts or protein sequences in large databases. This is usually done using versatile tools like BLAST/ BLASTX [150, 151] or DIAMOND [152] when looking for similar nucleotide or protein sequences. It is often better to perform searches at protein level since it is easier to find homology, as they tend to be more conserved than nucleotide sequences, especially if the study subject has no close

InterProScan [153] can be used to search for conserved protein signatures. This is especially useful when it is difficult to find full sequence homologs given that the study organism might be too divergent from species sequences available in the database. Protein families often present signature domains that are well conserved even among divergent species, so these signatures can give insights into the putative function of the protein. The process for anno‐ tating non‐coding transcripts differs from protein coding transcripts. They usually present poor sequence conservation since their function relies on factors, such as secondary structure, rather than amino acid sequences. Therefore, their annotation process requires specialized software to detect those intrinsic characteristics of a given class of non‐coding transcripts, for

Given the importance of annotation, there are plenty of tools and pipelines developed to streamline this process. Some annotation tools like Blast2GO [156] are generic and very user‐ friendly, although it requires a paid license to use it. Others like Annocript [157], TRAPID [158] and Trinotate [159] are pipelines developed specifically for annotating transcriptomes. It is important to note that although automatic pipelines often ease and speed up the analysis,

Functional enrichment analysis is a computational method capable of determining whether a pre‐defined set of genes shows significant differences between samples. The GSEA software from Broad Institute runs the original GSEA algorithm [160]. Although alternative algorithms

example, tRNAscan‐SE [154] for tRNAs and RNAmmer [155] for rRNAs.

it comes at a cost of lesser control of the annotation process.

information on these transcripts function in order to extract any meaningful answer.

transcript annotations for an unsequenced organism [123].

18 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

annotation.

species sequenced.

*4.3.3. Enrichment analysis*

Alternative splicing (AS) is a post‐transcriptional mechanism present in the majority of eukaryotes that greatly increases the diversity of proteins that can be encoded by a deter‐ mined genome. This process occurs when particular regions of a gene are included or excluded, through splicing, from the final processed mRNA sequence. AS can occur in sev‐ eral ways, such as exon skipping, intron retention, alternative 5′ donor and 3′ receptor sites [161, 162], analysis of new AS events or patterns is relevant since many traits, especially genetic diseases such as cancers, are related with disorders in splicing patterns that generates aberrant variants [162, 163].

AS analysis by deep sequencing requires splice‐aware programs capable of aligning tran‐ scripts reads to a reference genome while performing the difficult task of placing spliced reads across introns by determining the exon‐intron boundaries. A systematic evaluation of splice‐aware alignment programs for RNA‐seq data performed by the RNA‐seq Genome Annotation Assessment Project (RGASP) Consortium [109] tested 26 RNA‐seq alignment pro‐ tocols and concluded that, in general, GSNAP [164], MapSplice [165] and STAR [106] com‐ pared favorably to other methods. Still, two of this software (GSNAP and STAR) presented many false exons junctions in the output if they were not filtered based on the number of supporting alignments.

Following the alignment step, software like cufflinks [115], scripture [116] and StringTie [117] can be used to perform transcript reconstruction, which can reveal new splicing isoforms evidenced by the alignments. This step usually yields an updated GTF annotation file as output that can be used in subsequent steps.

If data from different conditions are available, differential AS analysis can be performed. With the alignment results (SAM file) and a GTF annotation file at hand, differential exon usage analysis can be performed with DEXSeq [166] and differential analysis of AS events, such as skipped exon, alternative 5′ and 3′ splice site, mutually exclusive exons, and retained intron events can be performed with rMATS [167]. There are plenty of other software specialized in performing differential AS analysis each one with their advantages and disadvantages, such as CuffDiff [115], Ballgown [144], SpliceR [168], MISO [169] and DiffSplice [170].

#### *4.3.5. Fusion genes*

Fusion genes or chimeras are aberrant alterations commonly found in tumor cells [171] that can be useful biomarkers or therapeutic targets [172]. They may originate from chromosomal rearrangements, insertions, deletions and inversions or even by trans‐splicing events. The increasing throughput and reads length from NGS technologies have facilitated their detec‐ tion and supported the development of several bioinformatic tools [173]. For fusion detection, most and more accurate methods rely on good quality read alignments supporting discordant mappings (read segments aligning to different genes) and both single‐ or paired‐end sequenc‐ ing, although paired data increase the probability of fusion detection [174]. A recent evalua‐ tion defined SOAPfuse [175], FusionCatcher [176] and JAFFA [177] the best tools among 18 options for real and simulated data, and their combination has shown increased performance, albeit high false‐positive rates are still a reality in this field, with space for improvements [178].

is very common to find multi‐mapping reads since we are dealing with very small sequences. Similarly to conventional RNA‐seq, multi‐mapping reads are usually not taken into account for the abundance estimation, since it is impossible to know from where the read was origi‐ nated. As long as these aligners are properly set, they should yield similar results [181].

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 21

Please note that for the aforementioned pipeline, miRNA annotations or sequences are usu‐ ally required for raw counting estimation. If annotations are not available for the study subject or looking for novel miRNAs candidates, algorithms such as miRdeep2 [184], miReNA [185] and miRanalyzer [186] can be used to annotate novel canonical and non‐canonical miRNA.

After raw miRNAs abundances are estimated, a normalization step is required in order to remove bias of non‐biological origin (e.g., sequencing depth, sample handling, library prepa‐ ration). A good normalization technique should reduce those biases without generating noise, so that the remaining differences between samples are truly of biological origin. Previous comparative studies on normalization procedures for miRNA data resulted in conflicting results. A study from Garmire and Subramaniam [187] supported the use of quantile and Lowess normalization, while Tam et al. [181] and Dillies et al. [140] advocated for the use of Trimmed Mean of M (TMM) and Upper quartile (UQ) normalization. Nevertheless, the results from any of these methods and also DESeq2 normalization [138] method should be highly similar, while other normalization methods such as CPM, total count scaling and lin‐ ear regression should be avoided since they tend to present higher variance and bias [181]. Several R/Bioconductor packages can be used to normalize the data and also run differential expression, such as edgeR [139] (TMM and UQ), DESeq2 [138] (DESeq normalization) and

After all these processing steps, the resulting miRNA estimation is ready to conduct down‐ stream analysis. This can be done with useful databases. Being the primary miRNA sequence repository, miRBase [189] contains several features that may help to investigate the roles for miRNAs of interest, such as annotations for a wide range of species, references links for stud‐

Quantitative trait loci (QTLs) are genomic regions that contain sequence variants that can affect any given trait. Since genome‐wide association studies (GWAS) started [190], thousands of variants have been associated with complex traits and diseases. The process of assigning variants to a gene is relatively straightforward when variants are located in coding regions that can have a direct effect on a gene product; however, most variants are found in non‐cod‐ ing regions making difficult to identify the causal genes [191]. By integrating transcriptomic data, it is possible to identify causal genes for non‐coding variants that affect its expression. When the trait in question is gene expression, they are referred as expression quantitative trait loci (eQTLs) that, similarly to other QTLs, are sequence variants capable of affecting the expression level of one or more genes that will ultimately result in different phenotypes. eQTLs can be classified according to the location of the QTL itself and its targeted gene, and

limma [188] (quantile and cyclic loess).

ies and deep sequencing evidence.

according to the mechanism that affects the expression [192].

*4.3.7. eQTL*

#### *4.3.6. miRNA*

MicroRNAs (miRNAs) are a subset of small non‐coding RNAs, usually 21–23 nt long that play a post‐transcriptional regulatory role in several pathogenic and developmental pro‐ cesses [179]. These molecules are part of an RNA‐induced silencing complex (RISC) contain‐ ing Dicer, Argonaute and many associated proteins that can cause enhanced decay/cleavage of mRNA target, elongation and ribosomal binding inhibition, thus acting at transcriptional and translational levels [180].

A common miRNA pipeline follows the same steps as the conventional RNA‐seq: (i) raw data must be preprocessed as previously described where adapters and low quality bases are trimmed with a minimum length filter (e.g., 18–21 nt for miRNAs), (ii) sequences are mapped to a reference (genome, RefSeq, miRBase) and raw counts are estimated, (iii) the raw count of mapped reads is normalized and (iv) downstream analysis is conducted to investigate biologically relevant questions. Due to its small nature, miRNA sequencing analysis has some caveats that require attention especially in steps (ii) and (iii).

The read mapping step is crucial for accurate miRNA abundance estimation, and there‐ fore, the alignment algorithm must be carefully selected and adjusted to deal with its small size. Although a wide range of software are available to perform this task, some aligners are designed and optimized for specific tasks (e.g., SNP calling, splicing detection, gapped alignment) that might not be appropriate for the task at hand [181]. Compared with conven‐ tional RNA‐seq, indels and splicing events are usually not relevant to miRNA alignment, and therefore, splice‐aware aligners are not required for this task. To these extent general purpose aligners such as BWA‐MEM, bowtie [182] and STAR [106] can be used. Most aligners default settings are set for conventional longer RNA‐seq reads, and since miRNAs are very short, aligners' parameters should be tweaked. The default seed size for these aligners is longer than miRNA sizes and therefore should be set to a value that is at least shorter than the smallest read size. Given that sequencing errors might occur and the fact that many miRNAs often does not present an exact match with their target, it is recommended to allow at least one mismatch in the seeding and alignment process as well [183]. Also during the mapping step, it is very common to find multi‐mapping reads since we are dealing with very small sequences. Similarly to conventional RNA‐seq, multi‐mapping reads are usually not taken into account for the abundance estimation, since it is impossible to know from where the read was origi‐ nated. As long as these aligners are properly set, they should yield similar results [181].

Please note that for the aforementioned pipeline, miRNA annotations or sequences are usu‐ ally required for raw counting estimation. If annotations are not available for the study subject or looking for novel miRNAs candidates, algorithms such as miRdeep2 [184], miReNA [185] and miRanalyzer [186] can be used to annotate novel canonical and non‐canonical miRNA.

After raw miRNAs abundances are estimated, a normalization step is required in order to remove bias of non‐biological origin (e.g., sequencing depth, sample handling, library prepa‐ ration). A good normalization technique should reduce those biases without generating noise, so that the remaining differences between samples are truly of biological origin. Previous comparative studies on normalization procedures for miRNA data resulted in conflicting results. A study from Garmire and Subramaniam [187] supported the use of quantile and Lowess normalization, while Tam et al. [181] and Dillies et al. [140] advocated for the use of Trimmed Mean of M (TMM) and Upper quartile (UQ) normalization. Nevertheless, the results from any of these methods and also DESeq2 normalization [138] method should be highly similar, while other normalization methods such as CPM, total count scaling and lin‐ ear regression should be avoided since they tend to present higher variance and bias [181]. Several R/Bioconductor packages can be used to normalize the data and also run differential expression, such as edgeR [139] (TMM and UQ), DESeq2 [138] (DESeq normalization) and limma [188] (quantile and cyclic loess).

After all these processing steps, the resulting miRNA estimation is ready to conduct down‐ stream analysis. This can be done with useful databases. Being the primary miRNA sequence repository, miRBase [189] contains several features that may help to investigate the roles for miRNAs of interest, such as annotations for a wide range of species, references links for stud‐ ies and deep sequencing evidence.

#### *4.3.7. eQTL*

*4.3.5. Fusion genes*

*4.3.6. miRNA*

and translational levels [180].

caveats that require attention especially in steps (ii) and (iii).

Fusion genes or chimeras are aberrant alterations commonly found in tumor cells [171] that can be useful biomarkers or therapeutic targets [172]. They may originate from chromosomal rearrangements, insertions, deletions and inversions or even by trans‐splicing events. The increasing throughput and reads length from NGS technologies have facilitated their detec‐ tion and supported the development of several bioinformatic tools [173]. For fusion detection, most and more accurate methods rely on good quality read alignments supporting discordant mappings (read segments aligning to different genes) and both single‐ or paired‐end sequenc‐ ing, although paired data increase the probability of fusion detection [174]. A recent evalua‐ tion defined SOAPfuse [175], FusionCatcher [176] and JAFFA [177] the best tools among 18 options for real and simulated data, and their combination has shown increased performance, albeit high false‐positive rates are still a reality in this field, with space for improvements [178].

20 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

MicroRNAs (miRNAs) are a subset of small non‐coding RNAs, usually 21–23 nt long that play a post‐transcriptional regulatory role in several pathogenic and developmental pro‐ cesses [179]. These molecules are part of an RNA‐induced silencing complex (RISC) contain‐ ing Dicer, Argonaute and many associated proteins that can cause enhanced decay/cleavage of mRNA target, elongation and ribosomal binding inhibition, thus acting at transcriptional

A common miRNA pipeline follows the same steps as the conventional RNA‐seq: (i) raw data must be preprocessed as previously described where adapters and low quality bases are trimmed with a minimum length filter (e.g., 18–21 nt for miRNAs), (ii) sequences are mapped to a reference (genome, RefSeq, miRBase) and raw counts are estimated, (iii) the raw count of mapped reads is normalized and (iv) downstream analysis is conducted to investigate biologically relevant questions. Due to its small nature, miRNA sequencing analysis has some

The read mapping step is crucial for accurate miRNA abundance estimation, and there‐ fore, the alignment algorithm must be carefully selected and adjusted to deal with its small size. Although a wide range of software are available to perform this task, some aligners are designed and optimized for specific tasks (e.g., SNP calling, splicing detection, gapped alignment) that might not be appropriate for the task at hand [181]. Compared with conven‐ tional RNA‐seq, indels and splicing events are usually not relevant to miRNA alignment, and therefore, splice‐aware aligners are not required for this task. To these extent general purpose aligners such as BWA‐MEM, bowtie [182] and STAR [106] can be used. Most aligners default settings are set for conventional longer RNA‐seq reads, and since miRNAs are very short, aligners' parameters should be tweaked. The default seed size for these aligners is longer than miRNA sizes and therefore should be set to a value that is at least shorter than the smallest read size. Given that sequencing errors might occur and the fact that many miRNAs often does not present an exact match with their target, it is recommended to allow at least one mismatch in the seeding and alignment process as well [183]. Also during the mapping step, it

Quantitative trait loci (QTLs) are genomic regions that contain sequence variants that can affect any given trait. Since genome‐wide association studies (GWAS) started [190], thousands of variants have been associated with complex traits and diseases. The process of assigning variants to a gene is relatively straightforward when variants are located in coding regions that can have a direct effect on a gene product; however, most variants are found in non‐cod‐ ing regions making difficult to identify the causal genes [191]. By integrating transcriptomic data, it is possible to identify causal genes for non‐coding variants that affect its expression. When the trait in question is gene expression, they are referred as expression quantitative trait loci (eQTLs) that, similarly to other QTLs, are sequence variants capable of affecting the expression level of one or more genes that will ultimately result in different phenotypes. eQTLs can be classified according to the location of the QTL itself and its targeted gene, and according to the mechanism that affects the expression [192].

Regarding the eQTL‐Gene position, when they are located close to the genes, they influence they are called local eQTLs. Local eQTLs can affect a gene in two ways: in *cis* (cis‐eQTL) when the variant affects only the gene that is located on the same chromosome and not affecting the copy of the homologous chromosome, thus causing an allelic imbalance; and in *trans* (trans‐eQTL) when the eQTLs do not affect the target expression directly, but instead affect an intermediate factor that will ultimately affect its target expression. Since the inter‐ mediate factor acts equally for both alleles, it does not cause allelic imbalance. On the other hand, eQTLs located further away from their target genes are referred as distant eQTLs, usually act in *trans* and are harder to find [192]. Several eQTL‐mapping studies published in the past few years showed that many variants often affect gene expression levels of nearby and distant genes [193–197] highlighting the importance of integrating transcriptomic and genomic data.

**Author details**

**References**

Michele Araújo Pereira<sup>1</sup>

1 Hermes Pardini Group, Vespasiano, Brazil

Science. 21 Jun 1991;**252**(5013):1651‐1656

Science (80‐). 20 Oct 1995;**270**(5235):484‐487

Biotechnology. Jun 2000;**18**(6):630‐634

Nature Reviews Genetics. Jan 2009;**10**(1):57‐63

ing technology. Trends in Genetics. Sep 2014;**30**(9):418‐426

**320**(5881):1344‐1349

Jan 2010;**11**(1):31‐46

\*, Eddie Luidy Imada2

2 Federal University of Minas Gerais, Belo Horizonte, Brazil

\*Address all correspondence to: michele.pereira@hermespardini.com.br

[1] Ramsay G. DNA chips: State‐of‐the art. Nature Biotechnology. Jan 1998;**16**(1):40‐44

[2] Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, et al. Complementary DNA sequencing: expressed sequence tags and human genome project.

[3] Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression.

[4] Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, et al. CAGE:

[5] Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D. et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature

[6] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain‐terminating inhibitors. Proceedings of the National Academy of Sciences. 1 Dec 1977;**74**(12):5463‐5467

[7] Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (80‐). 6 Jun 2008;

[8] Wang Z, Gerstein M, Snyder M. RNA‐Seq: A revolutionary tool for transcriptomics.

[9] Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next‐genera‐ tion sequencing technologies. Nature Reviews Genetics. 17 May 2016 May;**17**(6):333‐351

[10] van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next‐generation sequenc‐

[11] Metzker ML. Sequencing technologies—The next generation. Nature Reviews Genetics.

Cap analysis of gene expression. Nature Methods. 3 Mar 2006;**3**(3):211‐222

and Rafael Lucas Muniz Guedes1

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 23

Despite the mapping process for eQTL analysis being conceptually simple, since this anal‐ ysis is dealing with allelic specific expression, some caution is required during its counting estimation. For the aligning process, general purpose aligners or variant aware aligners such as GSNAP [164] can be used. After the alignment, some steps are recommended for retrieving allelic‐specific counts, such as removing duplicate reads that may arise from PCR artifacts. However, it is important that the choice for discarding a duplicate read is not done by mapping score as this might bias toward the reference allele [198]. Also, mapping bias should be controlled by filtering sites with likely bias [199]. Some tools like ASEReadCounter from GATK for allelic‐specific expression implement these filters by default [200].

The GTEx portal is a valuable resource to study human gene expression and regulation related to genetic variation. It hosts data from several eQTL studies and much information on laboratory and analysis methods for eQTL [201].
