**2. RNA-seq data analysis**

RNA-seq has become a common tool for scientists to study the transcriptome complexity, and a convenient method for the analysis of differential gene expression. A typical RNA-seq data analysis workflow starts by preprocessing raw reads for contamination removal and quality control checks. The following step is to align the reads to a reference genome, or to make a de novo assembly if there is not any. Following the alignment, the quantification step aims to quantify aligned reads to produce a count matrix to use as entry data for Differential Expression (DE) analysis. Normalization and DE analysis normally go together as most of the methods have built-in normalization and accept only raw count matrix. For this study, we are more interested in the clustering step, we will perform Normalization of the raw counts separately and do the clustering without going through differential gene expression analysis. In the following section we describe with more details each step of the pipeline (**Figure 2**).

#### **2.1 Preprocessing**

Preprocessing raw reads consist of checking the quality of the reads, adapters trimming, removal of short reads and filtering bad quality bases. Tools like FastQC can generate a report summarizing the overall quality of the sequence information [2]. Based on this report we can determine how the quality trimming should be set up. Trimmomatic is one of many tools used to clean up the raw data. It can be used to remove adapters from the reads, trim off any low-quality bases at the ends of reads, and filter short reads that can align to multiple locations on the reference genome. Once the trimming step is done, it is a good practice to recheck the quality of the reads by rerunning FastQC.

**111**

**2.2 Alignment**

*RNA-seq data analysis workflow.*

**Figure 2.**

alignments [3–5].

**2.3 Quantification**

**2.4 Normalization**

*Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq*

Now that we have explored the quality of our raw reads, we can move on to read

Quantification of gene expression is to count the number of reads that map to each gene using methods such as HTSeq-count, FeatureCounts or kallisto [6–8]. This step is crucial if we want to do a gene differential expression analysis, which means to identify genes (or transcripts), if any, that have a statistically significant

The read counts generated in the quantification step need to be normalized to make accurate comparisons of gene expression between samples or when doing an exploratory data analysis. Several normalization methods are used for this purpose such as CPM (counts per million), TPM (transcripts per kilobase million), RPKM/FPKM

difference in abundance across the experimental groups or conditions.

alignment. Read alignment is one of the first steps required for many different types of analysis. It aims to map the huge number of short RNA sequences generated by NGS instruments (reads) to a reference genome in order to identify the correct genomic loci from which the read originated. In RNA-seq, alignment is a major step for the calculation of transcript or gene expression levels; several spliceaware alignment methods have been developed for RNA-seq experiments such as STAR, HISAT2 or TopHat. These aligners are designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced

*DOI: http://dx.doi.org/10.5772/intechopen.94069*

*Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq DOI: http://dx.doi.org/10.5772/intechopen.94069*
