**2.4. Normalization**

Quality andaccuracy of assembledtranscriptome are assessedinseveraldifferent ways [84, 85]: **1.** Assembly statistics: Most algorithms generate an assembly statistic that includes the number of contigs/transfragments generated, total contigs/transfragments length and singletons, size of the assembly (in number of nucleotides), percentage of reads assembled to transfragments, percent GC content, etc. Assembly statistics provide overview of the

**2.** Transfragments/contigs statistics: This statistics includes lengths of the largest and shortest transfragments, average and median length of transfragments, and N50 of assembled transcriptome. N50 of the assembly is calculated by sorting the contigs in descending order and the size of the contig that makes the total greater than or equal to 50% of the genome size is regarded as the N50 value. A large N50 is indicative of a more

**3.** Mis-assembly and variations: Some of the major reasons for mis-assembly of the tran‐ scriptome are presence of ambiguous bases, repeat regions, insertions, deletions, SNPs, and chromosomal rearrangements in the transcriptome. Percentage of mis-assembled contigs can be calculated by mapping the contigs back to the reference genome. QUAST,

**4.** Number of transfragments matching with the closest reference genome: Once transcripts are assembled, it can be compared against a closely related species/genome. Assembly is considered to be of high quality if the number of reference transcripts matching with the transfragments is high. However, the genes that are not expressed, or lowly expressed,

**5.** Hybrid or fused transcripts: Hybrid transcripts result from joining of two or more different transcripts and hence matching to different locations of the genome. Reasons for hybrid transcript generation are sequencing error, improper trimming of the adapter/contami‐ nant from the raw read, similarity of the transcripts, assembly algorithm's parameters,

Once the read data is aligned to the reference genome, the gene expression can be quantitated by read counting at exon, transcript, or gene-level. Here are few possible expression units:

**a.** Read Count: read counts are number of reads overlapping a genomic feature such as a

**b.** CPM (Counts Per Million mapped reads): CPMs are read counts scaled by the number of fragments sequenced times one million. This unit is used in a differential expression

**c.** RPKM (Reads Per Kilobase of transcript per Million): RPKM for a feature is computed by dividing the number of read counts by it length and total number of reads sequenced, followed by multiplication with one billion [12]. Applicable only for single-end data.

a tool, generates consolidated report on mis-assembly statistics [84].

etc. Low number of hybrid transcripts reflects better assembly.

**Choice of expression unit: CPM, RPKM, FPKM, TPM, or read count**

organisms' transcriptome.

124 Next Generation Sequencing - Advances, Applications and Challenges

contiguous assembly.

might not be captured.

**2.3. Quantification**

gene or transcript.

analysis R package edgeR [86].
