*2.2.2.1. De novo assembly approaches*

There are several algorithms available for *de novo* transcriptome assembly (Table 3). In *de novo* transcriptome assembly, contigs or transfragments are created from overlapping reads. Process of assembly involves either de Bruijn graphs construction using k-mers or overlaplayout-consensus (OLC) approach for short and long reads, respectively [67].


**Table 3.** A list containing different *de novo* transcriptome assemblers

#### Overlap-Layout-Consensus (OLC) approach:

OLC approach was initially developed for reconstruction of the genome from Sanger sequence and EST (Expressed sequenced tag) data. As the name suggests, in the OLC approach, the read data are searched for overlapping sequences and merged to create longer reads. Depending on the volume of data and complexity of genome (e.g., repeats), the OLC approach is compu‐ tation- intensive. Some of the OLC-based assemblers are MIRA [68], Newbler (from Roche/454 Life Sciences), and CAP3 [69]. The assemblers using the OLC approach are more suitable for small volume of data, not sensitive to repeat region detection and resolution, and cannot handle the high-depth short read data generated from sequencers such as Illumina. The Eulerian path assemblers, which are based on *de Bruijn* graph algorithms [70], are more suitable for the high-depth short read data and are discussed in detail below.

#### *De Bruijn-*graph-based approach:

*De Bruijn* graph is a mathematical graph that uses a substring of letters (here nucleotides) of length k to represent nodes. Nodes are connected if shifting a substring by one nucleotide creates an exact k-1 overlap between the nodes [70]. *De Bruijn* graph can be created for both small as well as large sequences. Based upon the defined k-mer (a nucleotide substring of length k) length, reads are broken in k-length to generate substrings. Using these substrings, *de Bruijn* graph is generated in which each unique substring represents a node (or vertex) connected with overlaps between the last k-1 nucleotides of the previous sequence with the first k-1 nucleotides of the subsequent sequence [71]. Identical overlaps of k-mers are merged and counted while creating the graph. If the assembler finds differences in the nodes, the graph is branched. Upon subsequent identity and overlap in the nodes, the graph will join the ends. Presence of single nucleotide difference between the sequence data gives rise to bubbles in the graph. In the case of RNASeq data, occurrence of large bubbles and open-ended branches in the graph suggests presence of alternative splicing and alternative transcription start and end. Occurrence of small bubbles can be due to single nucleotide variation or sequencing errors [72]. In most of the *de Bruijn-*graph-based assemblers, the preferred value of k-mer is usually an odd number in order to avoid reverse complement of k-mers. The chosen size of k-mer has great impact on the assembly process as using a large k-mer can result in a unique *de Bruijn* graph, but this approach is computationally intensive. On the other hand, using small k-mers can result in a fragmented assembly. According to some of the previous studies it has been observed that smaller k-mers can be useful in more accurate transcriptome assembly of lowly expressed genes whereas larger k-mers perform better for abundant transcripts [73-75]. It is therefore essential to identify the optimal k-mer for the sequence being assembled and it depends to a large extent on the read length, sequencing depth, sequencing error rate, and the complexity of the genome. Additionally, using directionality of the read from paired-end data, assemblers can generate more accurate assembly as compared to single-end data [76]. Some of the most commonly used *de Bruijn-*graph-based assemblers are: Velvet/Oases [74, 77], Trinity [78], Trans-Abyss [79], SOAPdenovo-Trans [80].

Oases: Oases has a set of algorithms that post-processes the assembly generated by Velvet at different k-mers such as dynamic filtering of the noise, resolution of alternative splicing transcripts, and merging of the multiple assemblies generated using different k-mers (www.ebi.ac.uk/~zerbino/oases/). Data generated from different k-mers are merged to generate a complete assembly. Oases works well for the correction of errors and resolution of repeats in the case of paired-end data.

Trinity: Trinity uses three steps to produce transcriptome assembly: inchworm, chrysalis, and butterfly. Inchworm builds initial sets of contigs using k-mer graphs. Chrysalis groups these contigs and builds *de Bruijn* graphs from them. Butterfly simplifies and resolves the graphs to generate the final set of transcripts containing spliced variants and isoforms.

Trans-Abyss: Trans-Abyss considers multiple assemblies generated from Abyss to optimize the assembly and can tackle varying coverage of the transcripts very well.

SOAPdenovo-Trans: SOAPdenovo-Trans is derived from the genome assembler, SOAPdeno‐ vo2 [81] and is known to construct transcriptome faster than the above-mentioned assemblers.
