**Why should one normalize the expression data?**

RNASeq experiments have multiple sources of systematic variations introduced through intersample differences such as difference in library size (sequencing depth) or unwanted variations due to batch effects such as sampling time or different sequencing technology [12] or through intra-sample differences such as difference in read length [88] or GC content between genes [89, 90]. These variations, if ignored, can dramatically reduce the accuracy of statistical inference and hence should be removed or controlled during statistical analysis. Therefore, read count and FPKM of a feature, as calculated for example by htseq-count and Cufflinks, respectively, may not be appropriate to compare across features and samples without nor‐ malization.

Normalization is a process that aims to ensure that expression estimates are comparable. There are a number of normalization methods, such as:


**f.** Median of ratio: the normalization factor for each sample is computed as the median of ratios of expressions of features over their geometric means across all samples. Finally, each feature expression is divided by this factor to get the normalized expression [96]. Median of ratio has been implemented in R packages DESeq [96], DESeq2 [97], and in Cuffdiff2 [92].

Several publications [98, 99] comparing normalization methods suggest that median of ratio is the best method for normalization in differential expression study for mRNASeq experiment.

In addition to normalization methods, several packages have been developed to control batch effects, for example, svaseq [100]. svaseq can work on both, count-based data (e.g., htseq-count generated data) as well as FPKMs (Cufflinks generated data).
