*2.2.1.3. Choice of annotation source*

The reason for this highly efficient system is believed to be the indexing scheme it utilizes. As compared to its counterparts, HISAT uses two different types of indexes instead of a single index: (i) a whole-genome FM index to anchor each alignment, and (ii) numerous local FM indexes for very rapid extension of these alignments. HISAT is 50 times faster than TopHat2, 12 times faster than GSNAP, and slightly faster than STAR [56]. In addition, HISAT requires comparable amount of RAM as TopHat2 but maximum 20% of RAM as GSNAP or STAR needs. Similar to TopHat2, HISAT also uses Bowtie2 in the back-end. Furthermore, it is the only aligner that can work directly on an SRA file, which eliminates the sra to fastq file conversion

Considering the options available, selecting the right aligner is a nontrivial task and there are several publications comparing the read aligners. Fonseca et al. [59] published a feature-level comparison of 60 mappers and highlighted the difficulties in determining the best aligner (in terms of accuracy and speed). Other comparative studies include one by Lindner and Friedel

**1.** Does the genome sequence belong to a prokaryote (where a gene lacks intron) or eukaryote

If the genome is bacterial (example of a prokaryote), then computationally intensive splice aligners such as TopHat2 or STAR are not required. In this case, non-splice aligners such as Bowtie1, Bowtie2, or BWA are more appropriate because of the contiguous read mapping to the reference genome. On the contrary, for eukaryotic genomes such as human/mouse, where the reads will span an exon boundary and therefore a part of it will not map contiguously on

If the data are generated from a SOLiD sequencing platform, they will be in color space format and almost all recently developed tools do not support color space data. In this case, the only

In experiments where the aim is to find variants in transcripts, mapping quality plays a crucial role, and hence it is advisable to use only aligners that provide accurate mapping quality. BWA and STAR aligners are suitable for this purpose; however, Bowtie 1 is not because it does not

Additionally, one should also consider the comparative precision and recall statistics, CPU, and RAM requirements of the aligners. In addition to the aligners used, the data type itself plays a critical role in the quality of mapped data. For example, paired-end information improves mapping accuracy and, therefore, paired-end data are favored over single-end data

The aligned read data generated from aligners mentioned in the previous section are stored in Sequence Alignment/Map (SAM) file format, which is a gold standard to store alignment

[60] on non-spliced aligners and another by Engstrom et al. [61] on spliced aligners.

the reference genome; it is better to use a splice aligner that can identify splice sites.

available options are aligners such as BWA (older than 0.6.x), Bowtie1, and TopHat2.

**3.** Does the aim of RNASeq experiment include calling variants in transcripts?

**2.** Are the sequence data available in base space or color space format?

assign appropriate quality score to the mapped reads.

for RNASeq experiment.

Answers to the following questions may help to choose a suitable aligner:

requirement.

(where a gene has introns)?

120 Next Generation Sequencing - Advances, Applications and Challenges

Depending upon the biological question of interest, one may wish to perform expression study either on known transcripts only, as per a given annotation catalog, or on reconstructed transcriptome built using a known reference annotation. This enables the quantification of novel genes/isoforms in addition to the known ones. In the first case, the mapped reads and the annotation catalog can be used to assign read counts to each feature (genes/transcripts) using a tool like htseq-count [63], and then perform statistical analysis to identify the differ‐ entially expressed genes/isoforms. In the second case, transcriptome reconstruction is required prior to differential expression analysis. It requires assembly of reads into transcription units using either the reference-based or de novo assembly approach. Given a reference genome and an annotation catalog, there are tools such as Cufflinks [64, 65] that first map all the reads to the genome and then use spliced reads directly to reconstruct the transcriptome. It generates a GTF file that contains the assembled isoforms along with isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) units [65].

#### *2.2.2. De novo transcriptome assembly*

Building a transcriptome using *de novo* methods is a powerful way to create the transcriptome of a divergent or novel species. Mainly three features affect the quality of assembled transcripts: a) type of transcript: presence of repeats, polymorphisms, splicing event, complexity of organism, e.g., ploidy level, GC content; b) sequencing technology: library preparation, sequencing accuracy; c) bioinformatics workflow: assembly algorithms and annotation. Currently available *de novo* assemblers have different sensitivity, and specificity in terms of transcript identification are error-prone, and lead to fused transcripts, splicing errors, and gaps [66]. In order to enhance the sensitivity and specificity one can take the combined approach, which employs *de novo* assembly method with reference-guided approach.
