**2. Experimental design**

In order to obtain a successful RNA‐seq experiment, it is critical to have a good experimental design. Despite its importance, a proper planning is not always done. There are many experi‐ mental options available, and to fully comprehend each step, it is essential to make right decisions, avoiding inconclusive results. These choices depend on extrinsic (e.g., cost, time, samples availability) and intrinsic (e.g., experimental design complexity, transcriptional vari‐ ability among tissues, samples and organisms) factors. The amount of available resources is usually the main extrinsic limiting factor driving researchers' decisions. First, it is necessary to identify the main goal of an RNA‐seq experiment in order to be able to choose the best approach. Qualitative (e.g., annotation) and quantitative (e.g., differential gene expression— DGE) data analyses have some different requirements such as those related to the starting RNA amount, the number and type of replicates, library type and preparation, sequencing platforms, throughput, coverage and depth, and read length. Scotty [12], RNASeqPower [13] and RnaSeqSampleSize [14] are statistical tools designed to aid in the conception of the experimental design, adjusting many of these variables to the main objective and taking into account the financial limitations. A detailed workflow from experimental design to library sequencing is presented in **Figure 1**.

#### **2.1. Starting sample amount**

The necessary starting amount of an RNA sample varies between kits and platforms, and the amount of available RNA is one of the limiting factors for an RNA‐seq experiment. The major‐ ity of library construction kits require micrograms of RNA, sometimes limited to high‐quality samples. Takara Bio USA Inc presents some kits for low quantity and/or quality RNA sam‐ ples: SMARTer Ultra Low mRNA‐seq kits (as little as 1 cell or 10 pg of total RNA), SMARTer **Figure 1.** A typical RNA‐seq workflow. (1) Experimental design definition of qualitative and quantitative goals. Differential gene expression among different conditions is exemplified; (2) Sample selection, RNA extraction and elimination of contaminants such as genomic DNA; (3) Assessment of RNA integrity; (4‐6) RNA enrichment. (4) mRNA enrichment using magnetic or cellulose beads coated with oligo(dT) molecules or oligo(dT) priming; (5) mRNA enrichment through rRNA depletion with conserved probes or Selective Depletion of abundant RNA (SDRNA); (6) Small RNA size‐selection through electrophoresis or based on solid phase extraction; (7‐9) cDNA single/double strand synthesis. (7) cDNA synthesis followed by fragmentation; (8) mRNA fragmentation followed by cDNA synthesis; (9) cDNA synthesis for small RNA without fragmentation; (10) Adapters ligation; (11) Library quantification and (12)

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 5

Library sequencing with NGS technology.

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 5

Sanger sequencing [6], but with advances in next‐generation sequencing technology (NGS), transcriptomic studies have evolved considerably and RNA‐seq [7, 8] became the state‐of‐art

RNA‐seq consists of the direct sequencing of transcripts by NGS. Several NGS platforms [9–11] are commercially available nowadays. In general, an RNA set of interest is converted to a library of complementary DNA (cDNA) fragments and sequenced in a high‐throughput manner. Compared to ESTs, RNA‐seq provides better resolution and representativeness, whereas when compared to microarrays, the independence of reference sequences facilitates the discovery of

RNA‐seq experiments harbors challenges from the experimental design to data analysis. Since a complete comprehension of each step is critical to make right decision, this chapter will encompass essential principles required for a successful RNA‐seq experiment, focusing on best recommended practices based on specialized and recent literature. Basic techniques and well‐known algorithms are presented and discussed, guiding both beginners and experi‐

In order to obtain a successful RNA‐seq experiment, it is critical to have a good experimental design. Despite its importance, a proper planning is not always done. There are many experi‐ mental options available, and to fully comprehend each step, it is essential to make right decisions, avoiding inconclusive results. These choices depend on extrinsic (e.g., cost, time, samples availability) and intrinsic (e.g., experimental design complexity, transcriptional vari‐ ability among tissues, samples and organisms) factors. The amount of available resources is usually the main extrinsic limiting factor driving researchers' decisions. First, it is necessary to identify the main goal of an RNA‐seq experiment in order to be able to choose the best approach. Qualitative (e.g., annotation) and quantitative (e.g., differential gene expression— DGE) data analyses have some different requirements such as those related to the starting RNA amount, the number and type of replicates, library type and preparation, sequencing platforms, throughput, coverage and depth, and read length. Scotty [12], RNASeqPower [13] and RnaSeqSampleSize [14] are statistical tools designed to aid in the conception of the experimental design, adjusting many of these variables to the main objective and taking into account the financial limitations. A detailed workflow from experimental design to library

The necessary starting amount of an RNA sample varies between kits and platforms, and the amount of available RNA is one of the limiting factors for an RNA‐seq experiment. The major‐ ity of library construction kits require micrograms of RNA, sometimes limited to high‐quality samples. Takara Bio USA Inc presents some kits for low quantity and/or quality RNA sam‐ ples: SMARTer Ultra Low mRNA‐seq kits (as little as 1 cell or 10 pg of total RNA), SMARTer

for transcriptome analysis.

novel genes and isoforms [8].

**2. Experimental design**

sequencing is presented in **Figure 1**.

**2.1. Starting sample amount**

enced users in the implementation of reliable experiments.

4 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

**Figure 1.** A typical RNA‐seq workflow. (1) Experimental design definition of qualitative and quantitative goals. Differential gene expression among different conditions is exemplified; (2) Sample selection, RNA extraction and elimination of contaminants such as genomic DNA; (3) Assessment of RNA integrity; (4‐6) RNA enrichment. (4) mRNA enrichment using magnetic or cellulose beads coated with oligo(dT) molecules or oligo(dT) priming; (5) mRNA enrichment through rRNA depletion with conserved probes or Selective Depletion of abundant RNA (SDRNA); (6) Small RNA size‐selection through electrophoresis or based on solid phase extraction; (7‐9) cDNA single/double strand synthesis. (7) cDNA synthesis followed by fragmentation; (8) mRNA fragmentation followed by cDNA synthesis; (9) cDNA synthesis for small RNA without fragmentation; (10) Adapters ligation; (11) Library quantification and (12) Library sequencing with NGS technology.

Stranded kits (100 pg, regardless of RNA quality) and SMARTer Universal kits (200 pg, regard‐ less of RNA quality). These kits are compatible with both Illumina and Ion Torrent platforms. NuGEN company has also some kits with input RNA levels of 10 pg (Ovation Ultralow Library System V2 and Ovation SoLo RNA‐Seq System) available only for Illumina. For a comparison study of four commercially available RNA amplification kits using low‐input RNA samples, see Ref. [15].

correlation between sequencing depth and accuracy demonstrated that as low as one million reads can provide similar information of transcript abundance as more than 30 million reads for highly expressed genes. This result was consistently shown in all six widely used model organisms (*Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus* and *Saccharomyces cerevisiae*) that represent a wide range of genome sizes [32]. For the majority of human tissue genes, the amount required was about 15–50 million reads [33]. It is noteworthy that there is a point of sequencing depth saturation where a deeper sequencing results in only a small gain of information. More about the impact of sequencing depth on gene detection, gene expression quantification and structural variants discovery can

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 7

Short‐read sequencing is cheaper than long‐read sequencing. RNA‐seq experiments usually make use of short‐reads; however, longer reads can be helpful and more informative. Reads are usually shorter than full‐length transcripts, and a single read may map to multiple posi‐ tions in the genome stickling expression analysis and transcriptome assembly. Longer read length reduces mapping bias and ambiguity in assigning reads to genomic elements [34] and improves splicing detection [35, 36] and complex transcriptome analysis [37, 38]. However, some studies question the advantages of long reads sustaining that for humans, there are no substantial improvements in transcriptome assembly quality with reads over 150 base pairs

Standard RNA‐seq library protocols do not retain the strand orientation for each original tran‐ script, making it difficult to discriminate gene expression from overlapping genes. Therefore, it is often desirable to construct strand‐specific libraries [40–42]. There are several strand‐spe‐ cific protocols available, and they can be performed by two main alternatives. One method consists of marking the second strand by chemical modification, preventing it from being amplified by PCR and leading to the amplification of the first strand only. The deoxy‐UTP (dUTP) approach [43] is a well‐known example, and it is one of the leading protocols. The other method involves adapter's ligation in a known orientation in the RNA molecule such as Illumina RNA ligation method [44]. A comparison between seven library‐construction proto‐ cols reveals strong differences and substantial variation in the experimental complexity [40]. Stranded RNA‐seq provides more accurate downstream expression analysis, and it is the rec‐ ommend approach for RNA‐seq studies [40, 42]. Moreover, the dUTP and the Illumina RNA

The External RNA Control Consortium (ERCC) [46] has developed a set of 92 polyadenylated synthetic spike‐in controls for normalization and noise reduction of gene expression. ERCC spike‐ins mimic eukaryotic mRNAs and can be added ('spiked') equally to each sample prior

[39] and in differential expression analysis with reads over 50 base pairs [35].

ligation methods were identified as the best overall protocols [40, 45].

be found in Ref. [33].

**2.5. Read length**

**2.6. Library type**

**2.7. Spike‐in**

### **2.2. Replicates**

The variability of an RNA‐seq experiment depends on the organism, the biological question under investigation and the available laboratory techniques, and it can be measured by tech‐ nical and biological variances. Technical replication consists on the repeated analysis of the same sample to infer the variance associated with the technology, that is, equipment and pro‐ tocols [16]. If only experimental errors analysis is desired, technical replication is satisfactory. Otherwise, biological replicates are necessary [17]. Three biological replicates are the mini‐ mum suggested for any inferential analysis [18], although the minimum amount required for a reliable RNA‐seq experiment depends on the desired statistical power. For example, in DGE analysis, performing more biological replication is recommended over increasing the sequencing depth [19, 20], and from 6 to 12 biological replicates have been suggested [21]. Biological replication is often preferable to enrich the inferential analysis and increase your statistical power. Statistical knowledge helps to understand the different statistical analysis methods required for different levels of replication [16, 17, 22].

#### **2.3. Sequencing platforms**

There are several sequencing platforms available with diverse data formats, throughputs and qualities [9–11]. Two commonly used approaches are sequencing by synthesis (e.g., Illumina, Helicos and PacBio) and ion semiconductor sequencing (Ion Torrent). They can also be clas‐ sified as clonal amplification‐based sequencing (e.g., Illumina and Ion Torrent) or single‐mol‐ ecule‐based sequencing (e.g., Helicos, PacBio, Nanopore). For RNA‐seq experiments, the most popular platform is Illumina due to its high throughput and low‐error rates. PacBio has gained attention due to read length increases since its reads can be long enough to recapitu‐ late a full‐length cDNA transcript [23–26]. RNA‐seq approaches can also be combined to take advantage of each method benefits. Further information and comparison studies are available in Refs. [11, 27–29].

#### **2.4. Sequencing depth**

The required sequencing depth for RNA‐seq experiments varies over several degrees. Transcripts are expressed at different levels within the cell, and their coverage differs con‐ siderably in any RNA‐seq experiment. A deeper sequencing is required to detect low abun‐ dance transcripts and rare splicing events, but their relevance can only be assessed with a good biological replication [30]. However, deeper sequencing may increase the detection of off‐target RNA species and the number of false positives in differential expression calls [31]. A correlation between sequencing depth and accuracy demonstrated that as low as one million reads can provide similar information of transcript abundance as more than 30 million reads for highly expressed genes. This result was consistently shown in all six widely used model organisms (*Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus* and *Saccharomyces cerevisiae*) that represent a wide range of genome sizes [32]. For the majority of human tissue genes, the amount required was about 15–50 million reads [33]. It is noteworthy that there is a point of sequencing depth saturation where a deeper sequencing results in only a small gain of information. More about the impact of sequencing depth on gene detection, gene expression quantification and structural variants discovery can be found in Ref. [33].

#### **2.5. Read length**

Stranded kits (100 pg, regardless of RNA quality) and SMARTer Universal kits (200 pg, regard‐ less of RNA quality). These kits are compatible with both Illumina and Ion Torrent platforms. NuGEN company has also some kits with input RNA levels of 10 pg (Ovation Ultralow Library System V2 and Ovation SoLo RNA‐Seq System) available only for Illumina. For a comparison study of four commercially available RNA amplification kits using low‐input RNA samples,

6 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

The variability of an RNA‐seq experiment depends on the organism, the biological question under investigation and the available laboratory techniques, and it can be measured by tech‐ nical and biological variances. Technical replication consists on the repeated analysis of the same sample to infer the variance associated with the technology, that is, equipment and pro‐ tocols [16]. If only experimental errors analysis is desired, technical replication is satisfactory. Otherwise, biological replicates are necessary [17]. Three biological replicates are the mini‐ mum suggested for any inferential analysis [18], although the minimum amount required for a reliable RNA‐seq experiment depends on the desired statistical power. For example, in DGE analysis, performing more biological replication is recommended over increasing the sequencing depth [19, 20], and from 6 to 12 biological replicates have been suggested [21]. Biological replication is often preferable to enrich the inferential analysis and increase your statistical power. Statistical knowledge helps to understand the different statistical analysis

There are several sequencing platforms available with diverse data formats, throughputs and qualities [9–11]. Two commonly used approaches are sequencing by synthesis (e.g., Illumina, Helicos and PacBio) and ion semiconductor sequencing (Ion Torrent). They can also be clas‐ sified as clonal amplification‐based sequencing (e.g., Illumina and Ion Torrent) or single‐mol‐ ecule‐based sequencing (e.g., Helicos, PacBio, Nanopore). For RNA‐seq experiments, the most popular platform is Illumina due to its high throughput and low‐error rates. PacBio has gained attention due to read length increases since its reads can be long enough to recapitu‐ late a full‐length cDNA transcript [23–26]. RNA‐seq approaches can also be combined to take advantage of each method benefits. Further information and comparison studies are available

The required sequencing depth for RNA‐seq experiments varies over several degrees. Transcripts are expressed at different levels within the cell, and their coverage differs con‐ siderably in any RNA‐seq experiment. A deeper sequencing is required to detect low abun‐ dance transcripts and rare splicing events, but their relevance can only be assessed with a good biological replication [30]. However, deeper sequencing may increase the detection of off‐target RNA species and the number of false positives in differential expression calls [31]. A

methods required for different levels of replication [16, 17, 22].

see Ref. [15].

**2.2. Replicates**

**2.3. Sequencing platforms**

in Refs. [11, 27–29].

**2.4. Sequencing depth**

Short‐read sequencing is cheaper than long‐read sequencing. RNA‐seq experiments usually make use of short‐reads; however, longer reads can be helpful and more informative. Reads are usually shorter than full‐length transcripts, and a single read may map to multiple posi‐ tions in the genome stickling expression analysis and transcriptome assembly. Longer read length reduces mapping bias and ambiguity in assigning reads to genomic elements [34] and improves splicing detection [35, 36] and complex transcriptome analysis [37, 38]. However, some studies question the advantages of long reads sustaining that for humans, there are no substantial improvements in transcriptome assembly quality with reads over 150 base pairs [39] and in differential expression analysis with reads over 50 base pairs [35].

#### **2.6. Library type**

Standard RNA‐seq library protocols do not retain the strand orientation for each original tran‐ script, making it difficult to discriminate gene expression from overlapping genes. Therefore, it is often desirable to construct strand‐specific libraries [40–42]. There are several strand‐spe‐ cific protocols available, and they can be performed by two main alternatives. One method consists of marking the second strand by chemical modification, preventing it from being amplified by PCR and leading to the amplification of the first strand only. The deoxy‐UTP (dUTP) approach [43] is a well‐known example, and it is one of the leading protocols. The other method involves adapter's ligation in a known orientation in the RNA molecule such as Illumina RNA ligation method [44]. A comparison between seven library‐construction proto‐ cols reveals strong differences and substantial variation in the experimental complexity [40]. Stranded RNA‐seq provides more accurate downstream expression analysis, and it is the rec‐ ommend approach for RNA‐seq studies [40, 42]. Moreover, the dUTP and the Illumina RNA ligation methods were identified as the best overall protocols [40, 45].

#### **2.7. Spike‐in**

The External RNA Control Consortium (ERCC) [46] has developed a set of 92 polyadenylated synthetic spike‐in controls for normalization and noise reduction of gene expression. ERCC spike‐ins mimic eukaryotic mRNAs and can be added ('spiked') equally to each sample prior to library construction [47]. Ambion ERCC spike‐in control mixes (Thermo Fisher Scientific) are commercially available. Sequins, another set of spike‐in RNA standards, can also be used as inter‐ nal controls and are freely available for non‐profit research upon request [48]. Normalization methods should be carefully chosen to ensure that spike‐in will behave as expected. The R pack‐ age *erccdashboard* [49] and Anaquin [50] can be used for spike‐in analysis.

RNA quality can be assessed by gel electrophoresis (agarose or polyacrylamide) or through Agilent Bioanalyzer. RNA quantity can be assessed using spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) or Agilent Bioanalyzer. No single RNA quantification and quality control method are ideal, and it is necessary to know the limits of each method. We recom‐ mend Bioanalyzer since it measures the RNA integrity and level of degradation by the RNA Integrity Number (RIN) score that allows sample quality comparison by a scale with a range from 1 (most degraded) to 10 (most intact) [54, 55]. There is no consensus about the RIN cut‐ off for sample inclusion or exclusion in a study, but RIN ≥ 6 are commonly acceptable. DGE analysis could be performed even with RIN scores around 4 [56], but non‐degraded RNA is preferred for a successful transcriptome analysis. It is also important to highlight that some organisms do not present typical rRNAs peaks and cannot be evaluated by RIN value. Most insect RNA shows a cleavage of 28S rRNA into two similar fragments (28Sα and 28Sβ) that comigrate with 18S rRNA depending on pretreatment and electrophoresis conditions. This comigration is due to the disruption of the hydrogen bonds responsible for maintaining the two 28S fragments together. This profile should not be misinterpreted as low integrity and degradation [57]. In these cases, check the overall Bioanalyzer trace. More information about

RNA‐seq: Applications and Best Practices http://dx.doi.org/10.5772/intechopen.69250 9

each method and a comparison study can be found in Refs. [58, 59], respectively.

transcriptome and can be used for low‐quality RNA samples [65].

below, but additional information can be found in Refs. [41, 45].

**3.2. cDNA Library construction**

The type of the desired RNA molecule drives the RNA enrichment approach. Selection of mature mRNAs by their poly(A) tails is the most common application and can be carried out with magnetic or cellulose beads coated with oligo(dT) molecules or through oligo(dT) prim‐ ing for reverse transcription (RT). Therefore, since RNAs from formalin‐fixed and paraffin‐ embedded (FFPE) are degraded and mRNA‐seq poorly captures degraded mRNAs, it is not an appropriate method to use with FFPE samples [42], unless adapted protocols are applied such as the recently described protocol based on *in vitro* T7 transcription for linear ampli‐ fication of mRNA [60]. In order to surpass this limitation, rRNA depletion protocols have been developed based on hybridization in highly conserved ribosomal regions, including the selective depletion of abundant RNA (SDRNA) with RNase H [61, 62], Ribominus (Thermo Fisher Scientific), Ribo‐Zero (Illumina), GeneRead (Qiagen) and RiboGone (Takara). Another approach is the duplex‐specific nuclease (DSN) normalization by depletion of abundant tran‐ scripts, such as rRNAs and tRNAs [63, 64]. Samples can be also enriched of small ncRNAs (e.g., miRNA, siRNA and piRNA) via size‐selection through electrophoresis or based on solid phase extraction with commercial kits such as mirVana (Thermo Fisher Scientific) and miR‐ Neasy (Qiagen). For comparison studies between these methods, see Refs. [42, 65]. rRNA depletion is recommended rather than oligo(dT) because it can capture a complete view of the

The library construction includes four steps: (i) RNA/cDNA fragmentation, (ii) cDNA synthe‐ sis, (iii) adapters ligation and (iv) quantification. Some specific points will be briefly discussed

*3.1.2. RNA enrichment*
