**2. Transcriptomics**

#### **2.1. Gene expression**

Gene expression at transcript level is a temporal dynamics event that involves turn "on" or "off" mechanism constituted by the coordinated action of epigenetic factors and transcriptional regulators. Since gene products are part of metabolic pathways in the organism, the inefficiency of protein synthesis control mechanism can lead to an abnormal behavior of metabolic pathways and then lead to diseases [5]. Determining or quantifying the amount of transcripts in a biological condition provides a clear picture about the involvement of that gene in a particular condition. It is necessary to use the quantitative methods to understand normal cell development, disease mechanisms and to determine when, where, and how much a gene is showing divergence with different biological condition [1]. Identification of key genetic factors/marker/a set of genes responsible for a certain biological process can make a sizable change to existing treatment mechanism approach [6].

### **2.2. Applicability of transcriptome data**

Functions of each gene are not completely defined, information about the involvement of genes in functional pathways is identified and available from biological databases which provide clues on how each gene behaves in different metabolic pathways. Estimating the genes expressed in a particular biological condition allows comparing with the existing annotations. Only a small percentage of the genome is expressed in each cell, and a portion of the RNA synthesized in the cell is specific for that cell type [4], identifying the genes which are differentially expressed in similar tissue, but different context has therapeutic significance. Moreover, transcriptome sequencing allows identifying transcript level variations such as cassette exon, mutually exclusive exons, intron retentions, indels, alternative splice junctions, alternative promoters (**Figure 1**), and isoform-specific expression profiles [7].

#### **2.3. Requirements**

[1]. These contexts may refer to a diseased state or the influence of stimulation such as intrinsic ligands or response to immunogens. With the total transcripts often referred to as transcriptome, the stage-specific or cell type-specific transcriptome of cells are valuable to evaluate the genetic and epigenetic features characteristic to them. From high- to low-input RNA, the RNA sequencing methods have considerably improved to appreciate the inter- and intra-level population heterogeneity of cells. Not restricted to messenger RNA (mRNA), these technologies are also being increasingly exploited to analyze other transcription-based products such as microRNAs and lncRNAs, reaching out to the identification of over 10–30 pg of a human cell or tissue [2]. RNA or transcripts are of two categories, protein coding mRNAs which synthesize protein and non-coding RNAs involved in regulating gene expression and in cell structure maintenance. mRNA makes up only 6% of the total RNA content of a cell or tissue; a number of methods and

The human genome has more than 99.5% sequence identity to each other at the genomic level when analyzed in toto. However, they are also paradoxically personalized and are amenable to somatic variations. Hence, the cells could also be heterogeneous at genome level within an individual, and the genomic sequence variations are necessary to be accounted whenever they are analyzed at the transcriptome level. Toward this, the sequence obtained by RNA sequencing also reflects their coding sequence in the genome, kept aside, the RNA editing. Further, there are a plethora of other sequence determinants that could also be analyzed by sequence-based identification of transcripts. These determinants include the isoforms, gene fusions and identification of transcripts from putative pseudogenes. Unarguably, human cancer cells or tissues of diverse origins and stages in different populations are the most explored differential genome and transcriptome to date accounting the amount of data derived by RNA sequencing [4]. The Cancer Genome Atlas (TCGA) is probably the most extensive resource of providing access to cancer data especially from next-generation sequencing (NGS) platform. TCGA provides a number of options to perform analysis on cancer-related experimental data

Gene expression at transcript level is a temporal dynamics event that involves turn "on" or "off" mechanism constituted by the coordinated action of epigenetic factors and transcriptional regulators. Since gene products are part of metabolic pathways in the organism, the inefficiency of protein synthesis control mechanism can lead to an abnormal behavior of metabolic pathways and then lead to diseases [5]. Determining or quantifying the amount of transcripts in a biological condition provides a clear picture about the involvement of that gene in a particular condition. It is necessary to use the quantitative methods to understand normal cell development, disease mechanisms and to determine when, where, and how much a gene is showing divergence with different biological condition [1]. Identification of key genetic factors/marker/a set of genes responsible for a certain biological process can make a sizable

kits are available for RNA extraction from the cell [2, 3].

146 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

and stands as a major data repository for cancer data.

change to existing treatment mechanism approach [6].

**2. Transcriptomics**

**2.1. Gene expression**

The number of biological/technical replicates, adequate sequencing depth, and essentially, the sequencing qualities are the major factors that should be accounted in a sequencing-based

**Figure 1.** Alternative splicing. Here exons are boxes and lines are introns. Promoters represented by arrows and polyadenylation sites with AAA.

study. The parameters such as the availability of reference genome for the organism from which the sample is analyzed, information about the sequencer quality encoding, and whether multiplexing has been performed are also critical for the analysis. One should have a clear understanding of the biological sample, experimental conditions, and the biological questions that are in pursuit before starting a bioinformatics analysis of any transcriptome data [9].

Computational specifications have to be taken care to perform a genome assembly in a reasonable time without interruption. At least 8 core processor with 16 GB of RAM and enough fast storage system is required to perform a genome alignment within a reasonable time [7]. Genome assembly or alignment is the most computational resource consuming process, and the further downstream analysis such as variant calling or differential expression analysis can be performed using a desktop with an appreciable configuration.

Computational biologists prefer to use UNIX-based systems/servers for NextGen sequence analysis as large data can be handled more comfortably through command line by UNIX than a Windows OS [10].

#### **2.4. Software requirements**

A number of established and easily accessible one-shop sequence analysis tools [7, 11] are available online. However, it is important that one should understand the different steps involved in the analysis pipeline that are rather similar across them. There are various pieces of software in the pipeline, and each of them produces a number of output files. These include the main output file that can be used for further analysis and other supporting information such as the statistics of mapping, indicating the fraction of input data that had been successfully utilized by the algorithm (always get a higher fraction for good quality experiment) [7]. One should be aware of the files generated during each of the analysis steps that is fed into the next algorithm in the pipeline.

#### **2.5. Precautions**

A number of algorithms have been developed in recent years, and most of them are available as open-source algorithms. It is important to understand that the transcriptome analysis can be completed using open-source software and tools. Before starting the bioinformatics analysis on transcriptome data, one should decide the algorithms that can be used (**Figure 2**) including its release/version information in each successive step in the pipeline. Following the review articles that compare multiple algorithms and the research publications that have used specific algorithms, appropriate algorithms can be selected in each step [12]. Now, the next step is to select the annotation files to be used for the analysis.

**2.6. File formats**

**Figure 2.** Transcriptomics workflow.

**Resource Representation**

UCSC latest version GRCh38/hg38 >chr22 Ensembl >22

NCBI reference genome GRCh38.p7 >gi|568801992|ref|NT\_167212.2| chromosome 22 genomic scaffold,

GRCh38.p7 primary assembly HSCHR22\_CTG1\_1

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

149

In each step of the analysis pipeline, multiple file formats are generated or used. It is necessary to know the information contained in each type of files. Here, we discuss file types classified into three categories. The first category is the raw files that contain the information adopted from the sequencer to represent the raw sequences with a quality score for each base-pair identification [13]. The file formats can be .sff, .csfasta + .qual, .fastq, etc. The most common file format is the .fastq extension. Second file category is the alignment files that represent the information on how each read or the fragment had been aligned to the reference genome [14], these files can be in .sam, .bam, and .bed formats. The third category is the annotated data files that represent data readily available from standard biological databases such as reference

Even though the information is same, data representation varies between annotation files from different biological data resources. An example given below represents human chromosome 22 from various biological data resources. Hence, one should confirm the annotation files such as genome file (.fasta) and gene transfer format (.gtf) files are compactible to each other.


**Figure 2.** Transcriptomics workflow.

#### **2.6. File formats**

study. The parameters such as the availability of reference genome for the organism from which the sample is analyzed, information about the sequencer quality encoding, and whether multiplexing has been performed are also critical for the analysis. One should have a clear understanding of the biological sample, experimental conditions, and the biological questions that are in pursuit before starting a bioinformatics analysis of any transcriptome data [9].

Computational specifications have to be taken care to perform a genome assembly in a reasonable time without interruption. At least 8 core processor with 16 GB of RAM and enough fast storage system is required to perform a genome alignment within a reasonable time [7]. Genome assembly or alignment is the most computational resource consuming process, and the further downstream analysis such as variant calling or differential expression analysis can

Computational biologists prefer to use UNIX-based systems/servers for NextGen sequence analysis as large data can be handled more comfortably through command line by UNIX than

A number of established and easily accessible one-shop sequence analysis tools [7, 11] are available online. However, it is important that one should understand the different steps involved in the analysis pipeline that are rather similar across them. There are various pieces of software in the pipeline, and each of them produces a number of output files. These include the main output file that can be used for further analysis and other supporting information such as the statistics of mapping, indicating the fraction of input data that had been successfully utilized by the algorithm (always get a higher fraction for good quality experiment) [7]. One should be aware of the files generated during each of the analysis steps that is fed into

A number of algorithms have been developed in recent years, and most of them are available as open-source algorithms. It is important to understand that the transcriptome analysis can be completed using open-source software and tools. Before starting the bioinformatics analysis on transcriptome data, one should decide the algorithms that can be used (**Figure 2**) including its release/version information in each successive step in the pipeline. Following the review articles that compare multiple algorithms and the research publications that have used specific algorithms, appropriate algorithms can be selected in each step [12]. Now, the next step is to select the annotation files to be used for the

Even though the information is same, data representation varies between annotation files from different biological data resources. An example given below represents human chromosome 22 from various biological data resources. Hence, one should confirm the annotation files such as genome file (.fasta) and gene transfer format (.gtf) files are compactible to each

be performed using a desktop with an appreciable configuration.

148 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

a Windows OS [10].

**2.4. Software requirements**

the next algorithm in the pipeline.

**2.5. Precautions**

analysis.

other.

In each step of the analysis pipeline, multiple file formats are generated or used. It is necessary to know the information contained in each type of files. Here, we discuss file types classified into three categories. The first category is the raw files that contain the information adopted from the sequencer to represent the raw sequences with a quality score for each base-pair identification [13]. The file formats can be .sff, .csfasta + .qual, .fastq, etc. The most common file format is the .fastq extension. Second file category is the alignment files that represent the information on how each read or the fragment had been aligned to the reference genome [14], these files can be in .sam, .bam, and .bed formats. The third category is the annotated data files that represent data readily available from standard biological databases such as reference genome sequences (in .fasta format) and the annotated gene information (.gtf, .gff formats). Apart from all the standard file formats listed above, there are algorithm specific files which contain additional information about the specific run of the each algorithm in the pipeline.

In essence, a phred score of 30 is the probability of a base to be wrong is 1 in 1000. However, there are no standard methods to measure this exact quality; the phred score above 20–25 (**Figure 3a** and **b**) is considered as the average score to be acceptable for further analysis

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

151

**Figure 3.** Quality control measures. (a) Per base sequence quality whisker plot: distribution of quality of bases all over the whole file, (b) distribution of percentage of sequences with different quality, and (c) distribution of bases in a .fastq file.

because phred quality assessments are probabilistically stable [13, 29].

## **3. Transcriptome data analysis**

The high-throughput methods previously described (RNA-Seq) are done by direct sequencing of complementary DNA (cDNA) and as a result gives insights into the gene expression profiling [12, 15–17], quantification of alternative splicing [8, 9, 18, 19], variant calling [20–23], novel transcripts [14, 24, 25], and several others. These quantitative measurements are done by the final data produced by each sequencing platforms. However, the process of sequencing involves different steps (reverse transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing that the chance of error in any step is highly likely and could result in faulty outputs. It makes the data in the worst case not suitable for further analysis, so that the experiment may have to be repeated. Nonetheless, these errors can be monitored and necessary actions can be undertaken to rectify the errors prior to analysis. Such preliminary steps are often referred to as quality control analysis of sequencing data.

#### **3.1. Quality control**

This section of the chapter will discuss various reasons and statistical assessment of errors such as sequence read quality, read duplication, GC bias, nucleotide composition bias, adapter contamination, flow cell contamination, enrichment, and false positive errors [26, 27], and how those can be tackled using available tools. The data used for the analysis in this chapter are mainly in the ".fastq" format, the most common format output of runs on many platforms. However, there are many quality control analysis tools available that either come aligned with the machine itself or as standalone software (commercial and open source). The quality control analysis can be done using many software tools, and one of the popular opensource software is FastQC [28].

Data output from sequencing machine includes the information about the sequence fragment as well as a score corresponding to each base identification, we are considering ".fastq" format, widely used in many platforms, to explain the features. A single read is represented by four consecutive lines in .fastq format. The first and third line represent sequence identifiers and other optional information, such as machine version, flow cell information, etc., related to the specific run of the sample in the machine. The second line is the sequence bases, and fourth is the quality value for each base which is represented as ASCII characters.

This ASCII quality value or phred quality score gives the accurate measure of the base calling quality during sequencing. Phred quality score is mathematically defined as

$$Q = -10 \times \log\_{10}(P) \text{ or } P = 10^{-Q \times 10} \tag{1}$$

Where *Q* is the phred quality score, and *P* is the probability of getting a faulty base.

In essence, a phred score of 30 is the probability of a base to be wrong is 1 in 1000. However, there are no standard methods to measure this exact quality; the phred score above 20–25 (**Figure 3a** and **b**) is considered as the average score to be acceptable for further analysis because phred quality assessments are probabilistically stable [13, 29].

genome sequences (in .fasta format) and the annotated gene information (.gtf, .gff formats). Apart from all the standard file formats listed above, there are algorithm specific files which contain additional information about the specific run of the each algorithm in the pipeline.

150 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

The high-throughput methods previously described (RNA-Seq) are done by direct sequencing of complementary DNA (cDNA) and as a result gives insights into the gene expression profiling [12, 15–17], quantification of alternative splicing [8, 9, 18, 19], variant calling [20–23], novel transcripts [14, 24, 25], and several others. These quantitative measurements are done by the final data produced by each sequencing platforms. However, the process of sequencing involves different steps (reverse transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing that the chance of error in any step is highly likely and could result in faulty outputs. It makes the data in the worst case not suitable for further analysis, so that the experiment may have to be repeated. Nonetheless, these errors can be monitored and necessary actions can be undertaken to rectify the errors prior to analysis. Such preliminary steps are often referred to

This section of the chapter will discuss various reasons and statistical assessment of errors such as sequence read quality, read duplication, GC bias, nucleotide composition bias, adapter contamination, flow cell contamination, enrichment, and false positive errors [26, 27], and how those can be tackled using available tools. The data used for the analysis in this chapter are mainly in the ".fastq" format, the most common format output of runs on many platforms. However, there are many quality control analysis tools available that either come aligned with the machine itself or as standalone software (commercial and open source). The quality control analysis can be done using many software tools, and one of the popular open-

Data output from sequencing machine includes the information about the sequence fragment as well as a score corresponding to each base identification, we are considering ".fastq" format, widely used in many platforms, to explain the features. A single read is represented by four consecutive lines in .fastq format. The first and third line represent sequence identifiers and other optional information, such as machine version, flow cell information, etc., related to the specific run of the sample in the machine. The second line is the sequence bases, and

This ASCII quality value or phred quality score gives the accurate measure of the base calling

*Q* = −10 × log10(*P*) or *P* = 10–*Q*/<sup>10</sup> (1)

fourth is the quality value for each base which is represented as ASCII characters.

Where *Q* is the phred quality score, and *P* is the probability of getting a faulty base.

quality during sequencing. Phred quality score is mathematically defined as

**3. Transcriptome data analysis**

as quality control analysis of sequencing data.

**3.1. Quality control**

source software is FastQC [28].

**Figure 3.** Quality control measures. (a) Per base sequence quality whisker plot: distribution of quality of bases all over the whole file, (b) distribution of percentage of sequences with different quality, and (c) distribution of bases in a .fastq file.

For each sequencer, they use different set of ASCII values to score each base calling and a maximum score of 41 which is almost 1 in 10,000 (99.99% accuracy) is the probability that a base is called incorrectly (**Table 1**). However, if the quality of any read falls to much lower scale, it is better to trim those regions off. There are many standard trimming tools available as open source. Few popular tools are FASTX-Toolkit [30], cutadapt [31], and trimgalore [32]. They cannot only be used for quality trimming but also has several other purposes, such as adapter trimming, demultiplexing, etc.

bonds between them, and the annealing process of PCR is based on the melting temperature of GC bonding. DNA methylation happens at cytosine, and comparatively, exons are high in

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

153

In an NGS run, the bases are provided with an equal ratio, and the average of each base as output is expected to be 25% of each base (**Figure 3c**). Any fluctuation from this composition is considered as bias which is due to overrepresented sequences like adapter dimers or rRNA in the sample. However, it is expected that a little bias at the first few bases from 5′ which is

Before starting any analysis, adapters are trimmed off from the reads because the presence of adapters in the sample will lead to the expression of overrepresented sequences. This is more like a final check to be done to make sure the overrepresented sequences or enrichment

As discussed in the GC content, there are few other ways to check the overrepresented sequences. These methods are used to confirm the sample is not contaminated, unless there is some kind of enrichment in the reads. The enrichment analysis is done basically on different scales. The length of the read is considered as the scale here. Creating K-mers of different length can make sure that how often an enrichment or overrepresented sequence can occur in the read, and this can be calculated to double check the presence of contamination or enrich-

This is the second major step in transcriptomic data analysis. If the reference genome is available for the organism, it can be referred to as resequencing analysis else should be referred to as de novo sequencing analysis. In resequencing data, the analysis pipeline is comparatively easier compared to de novo sequencing. If reference genome is available, all we need is to map the fragments to the genome and find out the genes showing expression in the experiment. Although the amount of data generated from the sequencer is huge, it is short in length compared to the actual size of the genome. However, an advanced computationally efficient algorithm is required to perform this time consuming and banal

Genome alignment is the most important step in transcriptome analysis as all the downstream analysis, and the result accuracy is based on the efficiency of the alignment algorithm. As the data are obtained from transcriptome, the algorithm cannot directly map the reads to reference genome. An efficient splice aligner algorithm is required to complete the task [12], and most of these algorithms use a technique called hashing or indexing either in raw data or

Read alignment algorithm has a number of parameters such as input and index as mandatory, and many other optional parameters also based on the computational resources

essentially produced by the random hexamer priming from PCR amplification.

GC content than introns.

identified is not spurious.

*3.2.3. Duplicated sequences*

**3.3. Genome alignment**

the genome data or both.

ment study.

process [5].

#### **3.2. Evaluation of read quality**

There are several statistical analysis pipelines available as open source to check the quality of the NGS data. This session explains the basic backgrounds of quality checks such as (1) base quality, (2) sequence content and distribution, and (3) duplicated sequences.

#### *3.2.1. Base quality*

As explained previously, base calling bias is strictly avoided because any error in base calling means the base is not correctly called. This analysis is done basically by the quality encoding values given to the reads in the file. This analysis is completely depending on the phred quality score throughout the base length. As an exception, the quality of reads will fall down toward the end of the reads, which is quite normal for long runs as the supplied base get reduced, and random calling of base leads to these false-positive errors.

Base quality analyzes are done for rectifying read errors could have happened during the run or library preparation. The data from the ".fastq" file can be plotted different ways based on the phred quality score of each bases, the proportion of reads being called wrong, N content distribution in the read, and finally, sequence length distribution. It is obvious that the sequence length would have uneven distribution in trimmed reads.

#### *3.2.2. Sequence content and distribution*

Evaluating GC content over the sequenced reads is as important as other modules because it leads to many biological reasoning. GC over AT is basically because of the stability of the


**Table 1.** Phred quality score.

bonds between them, and the annealing process of PCR is based on the melting temperature of GC bonding. DNA methylation happens at cytosine, and comparatively, exons are high in GC content than introns.

In an NGS run, the bases are provided with an equal ratio, and the average of each base as output is expected to be 25% of each base (**Figure 3c**). Any fluctuation from this composition is considered as bias which is due to overrepresented sequences like adapter dimers or rRNA in the sample. However, it is expected that a little bias at the first few bases from 5′ which is essentially produced by the random hexamer priming from PCR amplification.

Before starting any analysis, adapters are trimmed off from the reads because the presence of adapters in the sample will lead to the expression of overrepresented sequences. This is more like a final check to be done to make sure the overrepresented sequences or enrichment identified is not spurious.

#### *3.2.3. Duplicated sequences*

For each sequencer, they use different set of ASCII values to score each base calling and a maximum score of 41 which is almost 1 in 10,000 (99.99% accuracy) is the probability that a base is called incorrectly (**Table 1**). However, if the quality of any read falls to much lower scale, it is better to trim those regions off. There are many standard trimming tools available as open source. Few popular tools are FASTX-Toolkit [30], cutadapt [31], and trimgalore [32]. They cannot only be used for quality trimming but also has several other purposes, such as

There are several statistical analysis pipelines available as open source to check the quality of the NGS data. This session explains the basic backgrounds of quality checks such as (1) base

As explained previously, base calling bias is strictly avoided because any error in base calling means the base is not correctly called. This analysis is done basically by the quality encoding values given to the reads in the file. This analysis is completely depending on the phred quality score throughout the base length. As an exception, the quality of reads will fall down toward the end of the reads, which is quite normal for long runs as the supplied base get

Base quality analyzes are done for rectifying read errors could have happened during the run or library preparation. The data from the ".fastq" file can be plotted different ways based on the phred quality score of each bases, the proportion of reads being called wrong, N content distribution in the read, and finally, sequence length distribution. It is obvious that the

Evaluating GC content over the sequenced reads is as important as other modules because it leads to many biological reasoning. GC over AT is basically because of the stability of the

quality, (2) sequence content and distribution, and (3) duplicated sequences.

152 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

reduced, and random calling of base leads to these false-positive errors.

sequence length would have uneven distribution in trimmed reads.

 1 in 10 90% 1 in 100 99% 1 in 1000 99.9% 1 in 10,000 99.99% 1 in 100,000 99.999% 1 in 1,000,000 99.9999%

**Phred quality score Probability of incorrect base call Base call accuracy**

adapter trimming, demultiplexing, etc.

*3.2.2. Sequence content and distribution*

**Table 1.** Phred quality score.

**3.2. Evaluation of read quality**

*3.2.1. Base quality*

As discussed in the GC content, there are few other ways to check the overrepresented sequences. These methods are used to confirm the sample is not contaminated, unless there is some kind of enrichment in the reads. The enrichment analysis is done basically on different scales. The length of the read is considered as the scale here. Creating K-mers of different length can make sure that how often an enrichment or overrepresented sequence can occur in the read, and this can be calculated to double check the presence of contamination or enrichment study.

#### **3.3. Genome alignment**

This is the second major step in transcriptomic data analysis. If the reference genome is available for the organism, it can be referred to as resequencing analysis else should be referred to as de novo sequencing analysis. In resequencing data, the analysis pipeline is comparatively easier compared to de novo sequencing. If reference genome is available, all we need is to map the fragments to the genome and find out the genes showing expression in the experiment. Although the amount of data generated from the sequencer is huge, it is short in length compared to the actual size of the genome. However, an advanced computationally efficient algorithm is required to perform this time consuming and banal process [5].

Genome alignment is the most important step in transcriptome analysis as all the downstream analysis, and the result accuracy is based on the efficiency of the alignment algorithm. As the data are obtained from transcriptome, the algorithm cannot directly map the reads to reference genome. An efficient splice aligner algorithm is required to complete the task [12], and most of these algorithms use a technique called hashing or indexing either in raw data or the genome data or both.

Read alignment algorithm has a number of parameters such as input and index as mandatory, and many other optional parameters also based on the computational resources available that can be set for the efficient mapping of reads. For example, we can set the number of multiple alignments for a single read and the maximum insertion or deletion length that can be allowed. A precise understanding of experimental conditions helps to set appropriate parameters according to a specific experiment. Moreover, default values provided to help and avoid confusions [7].

#### **3.4. Gene quantification**

Gene quantification is performed after alignment to a genome. The first step is to identify the amount of fragments or reads that could be mapped to each genomic location. Gene level or transcript level quantification can be performed according to user's choice. A number of software tools (coverageBED [33], htseq-count [34], and featureCounts [35]) are available for gene quantification. Quantification is performed against a reference annotation (GTF/ GFF) file with coordinates for the gene, transcript, or exon. For example, htseq-count uses "--idattr=<id attribute>" that indicates GFF attribute to be used as feature ID from the ninth column where unique ids or accession numbers are available. Gene qualification has to be performed after normalization to avoid misleading measurements. Hence, gene level or sample level normalization of the data in terms of total number of reads mapped, read length, and coverage should be performed.

The reads per kilobase of exon model per million mapped reads (RPKM) measure normalizes with the sequencing depth that varies significantly between samples as well as the gene length. Fragments per kilobase of exon model per million mapped reads (FPKM) measure normalizes similar to RPKM but for the paired-end data and the transcripts per million (TPM) first normalizes by gene length, then by sequencing depth, preferably a better way of normalization [36].

the gene level expression difference between two or more samples. This can be performed using R packages like edgeR [9], DESeq [10] that can load gene quantification information from multiple samples and report the expression level difference for each transcript/gene. The above-mentioned R packages also can generate multiple figures such as heatmaps, histograms, dispersion plots, etc., which can be used for representing results as well as publications purposes. The comparison is performed after normalization of the data across samples that account the length of the fragments, sequencing depth, and the total number of reads mapped. RPKM, FPKM, and TPM are commonly used normalization values. Genes with at least 2-fold change are usually considered as differentially expressed, although a fold change of 1.5 is also considered in certain instances.

**Figure 4.** Transcript enrichment. Cufflinks identify three transcripts from reads mapped to the same genomic region.

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

155

Types of graphical methods are available to visually represent the identified variations among experiments or samples used. Overview of gene expression studies can be represented by volcano plot, MA plot, heatmap, etc. Heatmap with hierarchical clustering clearly represents the

Visualization is integral to NGS data from the evaluation of sequencing quality to the representation of the biologically significant results. Initially, the raw data have to undergo quality checking to assess the overall sequencing quality and decide quality measures (FastQC (**Figure 3a**) [28], NGSQC [41]). The next level of visualization is applicable to the alignment to the genome as perceived for the number of reads aligned to particular gene, exons, introns, and splice junctions with genome browsers such as UCSC browser [42], Integrative Genomics Viewer (IGV) [43], and Genome Maps [44]. Genome browsers load genome (.fasta), annotations (.gff, .gtf), variations (as bed files) to their interface to obtain clear visualization of collective data for a specified region along with the available annotation, identified evidence or mapped reads, and variations observed. They also host inbuilt tools to represent the data as

trend of gene expression between samples.

plots and figures that can be used for publication [43].

#### **3.5. Splice variation analysis**

Transcriptome analysis can identify transcript sequence level features such as cassette exon, mutually exclusive exons, intron retentions, indels, alternative splice junctions, and hence, different possible isoforms all based on genome mapping (**Figure 1**). There are ~41,000 unique transcripts that are identified from a total of ~20,000 genes in human (NCBI RefSeq) [37].

Identification of transcripts from short and specific number of reads aligned across the gene, and the identification of splice junctions is a challenge in variation analysis. A number of algorithms such as Cufflinks [38], SLIDE [39], and StringTie [40] are available to analyze the alignment with user-provided existing annotations. Cufflinks [38] efficiently utilizes the advantage of paired-end sequencing data to annotate the splice variations (**Figure 4**).

#### **3.6. Differential expression analysis**

Once the genome assembly is completed, the downstream analysis can follow two routes—the variation analysis and the differential expression analysis. Differential expression analysis refers Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and... http://dx.doi.org/10.5772/intechopen.70026 155

available that can be set for the efficient mapping of reads. For example, we can set the number of multiple alignments for a single read and the maximum insertion or deletion length that can be allowed. A precise understanding of experimental conditions helps to set appropriate parameters according to a specific experiment. Moreover, default values

Gene quantification is performed after alignment to a genome. The first step is to identify the amount of fragments or reads that could be mapped to each genomic location. Gene level or transcript level quantification can be performed according to user's choice. A number of software tools (coverageBED [33], htseq-count [34], and featureCounts [35]) are available for gene quantification. Quantification is performed against a reference annotation (GTF/ GFF) file with coordinates for the gene, transcript, or exon. For example, htseq-count uses "--idattr=<id attribute>" that indicates GFF attribute to be used as feature ID from the ninth column where unique ids or accession numbers are available. Gene qualification has to be performed after normalization to avoid misleading measurements. Hence, gene level or sample level normalization of the data in terms of total number of reads mapped, read length,

The reads per kilobase of exon model per million mapped reads (RPKM) measure normalizes with the sequencing depth that varies significantly between samples as well as the gene length. Fragments per kilobase of exon model per million mapped reads (FPKM) measure normalizes similar to RPKM but for the paired-end data and the transcripts per million (TPM) first normalizes by gene length, then by sequencing depth, preferably a better way of

Transcriptome analysis can identify transcript sequence level features such as cassette exon, mutually exclusive exons, intron retentions, indels, alternative splice junctions, and hence, different possible isoforms all based on genome mapping (**Figure 1**). There are ~41,000 unique transcripts that are identified from a total of ~20,000 genes in human

Identification of transcripts from short and specific number of reads aligned across the gene, and the identification of splice junctions is a challenge in variation analysis. A number of algorithms such as Cufflinks [38], SLIDE [39], and StringTie [40] are available to analyze the alignment with user-provided existing annotations. Cufflinks [38] efficiently utilizes the advantage

Once the genome assembly is completed, the downstream analysis can follow two routes—the variation analysis and the differential expression analysis. Differential expression analysis refers

of paired-end sequencing data to annotate the splice variations (**Figure 4**).

provided to help and avoid confusions [7].

154 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

and coverage should be performed.

normalization [36].

(NCBI RefSeq) [37].

**3.5. Splice variation analysis**

**3.6. Differential expression analysis**

**3.4. Gene quantification**

**Figure 4.** Transcript enrichment. Cufflinks identify three transcripts from reads mapped to the same genomic region.

the gene level expression difference between two or more samples. This can be performed using R packages like edgeR [9], DESeq [10] that can load gene quantification information from multiple samples and report the expression level difference for each transcript/gene. The above-mentioned R packages also can generate multiple figures such as heatmaps, histograms, dispersion plots, etc., which can be used for representing results as well as publications purposes. The comparison is performed after normalization of the data across samples that account the length of the fragments, sequencing depth, and the total number of reads mapped. RPKM, FPKM, and TPM are commonly used normalization values. Genes with at least 2-fold change are usually considered as differentially expressed, although a fold change of 1.5 is also considered in certain instances.

Types of graphical methods are available to visually represent the identified variations among experiments or samples used. Overview of gene expression studies can be represented by volcano plot, MA plot, heatmap, etc. Heatmap with hierarchical clustering clearly represents the trend of gene expression between samples.

Visualization is integral to NGS data from the evaluation of sequencing quality to the representation of the biologically significant results. Initially, the raw data have to undergo quality checking to assess the overall sequencing quality and decide quality measures (FastQC (**Figure 3a**) [28], NGSQC [41]). The next level of visualization is applicable to the alignment to the genome as perceived for the number of reads aligned to particular gene, exons, introns, and splice junctions with genome browsers such as UCSC browser [42], Integrative Genomics Viewer (IGV) [43], and Genome Maps [44]. Genome browsers load genome (.fasta), annotations (.gff, .gtf), variations (as bed files) to their interface to obtain clear visualization of collective data for a specified region along with the available annotation, identified evidence or mapped reads, and variations observed. They also host inbuilt tools to represent the data as plots and figures that can be used for publication [43].
