**3.8. Deep sequencing versus cost**

**3.4. Read quantification for the estimation of gene expression**

132 Next Generation Sequencing - Advances, Applications and Challenges

**3.5. Count normalization**

scripts compared to shorter transcripts [88].

**3.6. Differential expression analysis**

replicates are used in the analysis.

**3.7.** *De novo* **assembly**

challenge in read quantification is reads mapping to multiple locations.

Once the sequenced reads are aligned, gene expression is measured. The most common way of read quantification is counting the number of reads overlapping the exons of a gene and if the exon boundaries are not well-annotated, it may lead to false-positive hits. Another major

There are several methods such as quantile-based normalization, GC-content-based normali‐ zation, Poisson model with variable rates for different positions, available to normalize and correct the biasness in the count data for the improved detection of differentially expressed genes [91, 127, 128]. The increasing number of normalization methods requires a state-of-theart technique for comparing these methods. In the absence of such technique, there is no consensus on the best method for normalization. For example, Zyprych-Walczak et al. [99] found that TMM method worked poorly for them while Dillies et al. [98] found TMM and median of ratio methods to be the best as compared to other methods. The transcript length is another source of bias and leads to detection of more differential expression in longer tran‐

There are several tools and methods developed for the differential expression analysis comparing differences in gene expression in different conditions (see section 2). Nonparamet‐ ric methods are not capable of better differential expression detection in the absence of sample replicates and hence parametric methods are preferred for differential expression analysis [129]. A study comparing various differential expression methods suggests that there is no optimized method that can serve well for all the different conditions. As compared to other tools, Cuffdiff performed poorly with large number of false-positives [130]. The accuracy of differentially expressed genes is statistically significant and makes more sense if multiple

Similar to the situation as in normalization, picking up the best tool for differential analysis is a tricky job. This is because there is no consensus about the tool best-suited for all experimental setups. Soneson and Dolerenzi [106] found limma performing well under many conditions but it required at least three replicates. Furthermore, they found limma performing worse when dispersion differed between two conditions. They also observed that with large sample sizes DESeq was overly conservative, while edgeR was producing large number of false-positives.

The performance and accuracy of the *de novo* transcriptome assembly is largely dependent on the complexity of the genome (e.g., genome size, number of paralogs, ploidy level), differential read coverage of the sequenced data, and sequencing error. Transcriptome assembly is complex and different from genome assembly in which read coverage is uniform. In contrast, in RNASeq, the abundance of reads vary based upon gene expression, in which case isoforms Another challenge associated with the RNASeq technology is read coverage and cost associ‐ ated with it. In order to detect lowly expressed genes or rare variants in the coding region, high read coverage is required. According to Nagalakshmi et al. [10], for simple organism such as yeast, which does not undergo alternative splicing, 30 million reads are sufficient to observe genome-wide transcriptome profile [10]. But for larger and complex genomes such as the human genome, higher-depth RNASeq data are required in order to capture the complete transcriptomes. Moreover, in a given organism the number of transcripts expressed in different conditions is different and hence same coverage may not be sufficient to capture all the transcripts expressed under different conditions. Hence, before designing an experiment, one should be aware of both sequencing depth required and the number of samples to be se‐ quenced. If the aim of experiment is to detect rare variants or lowly expressed genes, one should go for high coverage of the transcriptome, whereas, if the aim of the experiment is focused on gene expression differences between different samples (or conditions), one should consider generating replicate data for statistical power [133].

There are other bioinformatics challenges such as data retrieval, storing, unavailability of optimized statistical methods, and high-end compute infrastructure requirement that add to the complexity of transcriptome analysis.
