*2.2.2.2. Choosing the transcriptome assembler*

**Tool name Algorithm Read type Reference**

Paired end [179]

Trinity de Bruijn graph Single and Paired end [78] Velvet-Oases de Bruijn graph Single and Paired end [74, 77] SOAPdenovo-Trans de Bruijn graph Single and Paired end [80] IDBA-tran de Bruijn graph Paired end [178] Trans-ABySS de Bruijn graph Single and Paired end [79]

Bayesembler Bayesian model Paired end [180] Mira Overlap graph Single and Paired end [68]

OLC approach was initially developed for reconstruction of the genome from Sanger sequence and EST (Expressed sequenced tag) data. As the name suggests, in the OLC approach, the read data are searched for overlapping sequences and merged to create longer reads. Depending on the volume of data and complexity of genome (e.g., repeats), the OLC approach is compu‐ tation- intensive. Some of the OLC-based assemblers are MIRA [68], Newbler (from Roche/454 Life Sciences), and CAP3 [69]. The assemblers using the OLC approach are more suitable for small volume of data, not sensitive to repeat region detection and resolution, and cannot handle the high-depth short read data generated from sequencers such as Illumina. The Eulerian path assemblers, which are based on *de Bruijn* graph algorithms [70], are more suitable

*De Bruijn* graph is a mathematical graph that uses a substring of letters (here nucleotides) of length k to represent nodes. Nodes are connected if shifting a substring by one nucleotide creates an exact k-1 overlap between the nodes [70]. *De Bruijn* graph can be created for both small as well as large sequences. Based upon the defined k-mer (a nucleotide substring of length k) length, reads are broken in k-length to generate substrings. Using these substrings, *de Bruijn* graph is generated in which each unique substring represents a node (or vertex) connected with overlaps between the last k-1 nucleotides of the previous sequence with the first k-1 nucleotides of the subsequent sequence [71]. Identical overlaps of k-mers are merged and counted while creating the graph. If the assembler finds differences in the nodes, the graph is branched. Upon subsequent identity and overlap in the nodes, the graph will join the ends. Presence of single nucleotide difference between the sequence data gives rise to bubbles in the graph. In the case of RNASeq data, occurrence of large bubbles and open-ended branches in the graph suggests presence of alternative splicing and alternative transcription start and end. Occurrence of small bubbles can be due to single nucleotide variation or sequencing errors [72]. In most of the *de Bruijn-*graph-based assemblers, the preferred value of k-mer is usually an odd number in order to avoid reverse complement of k-mers. The chosen size of k-mer has

Repeat-sensing *de novo*

for the high-depth short read data and are discussed in detail below.

**Table 3.** A list containing different *de novo* transcriptome assemblers

Overlap-Layout-Consensus (OLC) approach:

*De Bruijn-*graph-based approach:

EBARDe novo Extension, Bridging, and

122 Next Generation Sequencing - Advances, Applications and Challenges

Choosing an assembly algorithm is difficult as it depends on a number of factors such as read type, length, and complexity of the genome. Some instrument vendors such as Roche provide assembly algorithms, e.g., Newbler, which can handle the long read data and the homopoly‐ mer issue frequently observed in the data generated from 454. A recent study using peanut plant RNASeq data suggests that performance of Trinity is better than TransAByss and SOAPdenovo-Trans when raw reads are mapped to the reconstructed assembly of the polyploidy transcriptome [82]. Another study suggested use of multiple k-mers and clustering of k-mer assemblies and at the same time identifying unique contigs from each assembly for effective extraction of biological information from transcriptome assembly [83].
