**2.5. Differential expression analysis**

Differential expression analysis helps identify genes that are important in the experimental conditions being tested and hence is the most routine analysis performed using the RNASeq data. In RNASeq data, a linear relationship has been observed between the number of reads that map to a transcript and the abundance of the transcript [12]. The goal of differential expression analysis is to compare these read counts for a feature between distinct sample groups and perform a statistical test to determine whether the difference is significant. For this purpose, a distribution is required to be fitted to the count data using generalized linear model (GLM). Based upon the assumption that reads are independently sampled from a population with a given, fixed fractions of genes, it can be said that the read counts will follow a multi‐ nomial distribution. This multinomial distribution can be approximated by the Poisson distribution and therefore Poisson distribution has been used to test differential expression in several studies [101-103]. But it has been found that this distribution predicts smaller variations than what is seen in the data. To overcome this issue, negative binomial (NB) distribution and beta negative binomial distribution were proposed. NB has been used in several differential expression tools such as edgeR [86], DESeq [96], DESeq2 (an enhanced version of DESeq) [97], and BaySeq [104]. Though these tools use a common distribution, the method of variance (dispersion) estimation differs, which affects the final outcome of the analysis. Cuffdiff2 uses beta negative binomial distribution to fit fragment counts [92].

Recent advances in this area of research suggest that a combination of Poisson distribution and NB distribution may yield better results. Chen et al. [105] derived a novel algorithm XBSeq from DESeq, where they used Poisson distribution to fit read counts that map to nonexonic regions (considered as sequencing noise) and used NB distribution to fit read counts that map to exonic regions (considered as true signals).

Recently, limma [94], a well-known R package for performing differential expression analysis of microarray data, has been empowered with RNASeq data analysis ability. It does not use the above-mentioned distribution, rather converts count data (or normalized count data) to log-counts per million using voom transformation, then fits a linear model to this data and performs differential expression analysis using an empirical Bayes method.

There is no clear evidence as such about the best tool for differential expression analysis; however, multiple studies comparing available methods have been performed. Soneson and Delorenzi [106] evaluated and compared eleven methods for differential expression analysis on simulated and real RNASeq data, whereas Seyednasrollah et al. [107] compared eight widely used tools on real data sets. Both the studies concluded that no single method is optimal under all circumstances. Soneson and Delorenzi [106] observed that limma performed well under many different conditions and Seyednasrollah et al. [107] found limma and DESeq as the preferred choice. Additionally, these studies have suggested that the method of choice should depend on the experimental conditions that include the number of samples per condition.
