**2.6. Annotation and pathway analysis**

**f.** Median of ratio: the normalization factor for each sample is computed as the median of ratios of expressions of features over their geometric means across all samples. Finally, each feature expression is divided by this factor to get the normalized expression [96]. Median of ratio has been implemented in R packages DESeq [96], DESeq2 [97], and in

Several publications [98, 99] comparing normalization methods suggest that median of ratio is the best method for normalization in differential expression study for mRNASeq experiment.

In addition to normalization methods, several packages have been developed to control batch effects, for example, svaseq [100]. svaseq can work on both, count-based data (e.g., htseq-count

Differential expression analysis helps identify genes that are important in the experimental conditions being tested and hence is the most routine analysis performed using the RNASeq data. In RNASeq data, a linear relationship has been observed between the number of reads that map to a transcript and the abundance of the transcript [12]. The goal of differential expression analysis is to compare these read counts for a feature between distinct sample groups and perform a statistical test to determine whether the difference is significant. For this purpose, a distribution is required to be fitted to the count data using generalized linear model (GLM). Based upon the assumption that reads are independently sampled from a population with a given, fixed fractions of genes, it can be said that the read counts will follow a multi‐ nomial distribution. This multinomial distribution can be approximated by the Poisson distribution and therefore Poisson distribution has been used to test differential expression in several studies [101-103]. But it has been found that this distribution predicts smaller variations than what is seen in the data. To overcome this issue, negative binomial (NB) distribution and beta negative binomial distribution were proposed. NB has been used in several differential expression tools such as edgeR [86], DESeq [96], DESeq2 (an enhanced version of DESeq) [97], and BaySeq [104]. Though these tools use a common distribution, the method of variance (dispersion) estimation differs, which affects the final outcome of the analysis. Cuffdiff2 uses

Recent advances in this area of research suggest that a combination of Poisson distribution and NB distribution may yield better results. Chen et al. [105] derived a novel algorithm XBSeq from DESeq, where they used Poisson distribution to fit read counts that map to nonexonic regions (considered as sequencing noise) and used NB distribution to fit read counts that map

Recently, limma [94], a well-known R package for performing differential expression analysis of microarray data, has been empowered with RNASeq data analysis ability. It does not use the above-mentioned distribution, rather converts count data (or normalized count data) to log-counts per million using voom transformation, then fits a linear model to this data and

performs differential expression analysis using an empirical Bayes method.

generated data) as well as FPKMs (Cufflinks generated data).

126 Next Generation Sequencing - Advances, Applications and Challenges

beta negative binomial distribution to fit fragment counts [92].

to exonic regions (considered as true signals).

Cuffdiff2 [92].

**2.5. Differential expression analysis**
