**3.2. Sequence assembler algorithms**

There are two major types of sequence assembly methods, Overlap-Layout-Consensus assembly and De Bruijn graph assembly. Current efficient and successful sequence assembly programs, including the ones employed for *Xiphophorus* genome assemblies (i.e., ALLPATHS), utilize the De Bruijn graph as a central data processing structure (De Bruijn-based assemblers are summarized in Table 2).


**Table 2.** De Bruijn-based sequence assembler

two adaptor sequences is defined as "insertion size." When the desired sequencing length is longer than insertion size, the short sequencing can contain adaptor sequence in it. This artificial sequence must be trimmed off, so as not to produce significant sequence error in sequence assemblies. Another contaminant, the low quality base call, has many sources, from equipment to sequencing glitches. The quality of a base call is defined as Phred quality score

**Figure 1.** Outline of Illumina genome analyzer sequencing process. (1) Adaptors are annealed to the ends of sequence fragments. (2) Fragments bind to primer-loaded flow cell and bridge PCR reactions amplify each bound fragment to produce clusters of fragments. (3) During each sequencing cycle, one fluorophore attached nucleotide is added to the growing strands. Laser excites the fluorophores in all the fragments that are being sequenced and an optic scanner col‐ lects the signals from each fragment cluster. Then the sequencing terminator is removed and the next sequencing cycle

Phred <sup>10</sup> *Q* = – 10 *log P*

To retain the most usable as high-quality sequencing reads, the adaptor sequences are first clipped off, subsequently trim off low-quality base calls at the end of sequencing reads, and finally filter out sequence reads that contain a certain percentage of base calls that are below a defined *Q*Phred score. Several tool software packages are available that can be utilized to perform the read filtering steps (e.g., fastx\_toolkit: http://hannonlab.cshl.edu/fastx\_toolkit/).

(*Q*Phred score). If we assign *P* as base calling error probabilities [26], then

66 Next Generation Sequencing - Advances, Applications and Challenges

starts.

De Bruijn graph-based assembler begins the assembling process by breaking the sequencing reads into *k*-mers, which in a genome is defined as a sequence of *k* consecutive bases. To build a De Bruijn graph, each *k*-mer is split into two parts, the left (*k*– 1) base *x* and right (*k*– 1) base *y*. Then all the *x* and the possible *y* are joined together by directed edges (*x* → *y*). A De Bruijn graph is obtained by taking the *x* and the *y* as nodes and the adjacencies as edges. The edges represent (*k*– 1) overlap between the connected nodes. In DNA sequencing, each node can have 8 possible connections, 4 are from the upstream sequence and 4 are to the downstream sequence, respectively. Actual connections are recorded in the memory as they are observed in the sequencing data. As sequencing data runs through the graph-building algorithm, discrete seed graphs are joined as the reads connecting to them are identified. In Figure 2, we present a simplified assembly and a sequence feature that can lead to problems in the sequence assembling process.

In Figure 2, 4 short DNA fragments that were attained from a randomly sheared 21 nt genome are sequenced. The *k*-mer length of 5 was chosen for this assembly. In the De Bruijn graph, there are 11 balanced nodes, where the number of indegree equals that of outdegree, and two semibalanced nodes, where indegree differs from outdegree. This graph is directed, connected, and considered as Eulerian since it has and only has at most 2 semibalanced nodes. The node in this directed graph that has more outdegree than indegree is considered to be the staring site of the assembly, while the other semibalanced node is the end of the assembly. At the end of the graph, where a cyclic edge forms, a problem for short sequence assemblers when repetitive sequence regions are encountered is presented. De Bruijn algorithms cannot resolve this problem and will simply ignore it, resulting in gaps in the contigs assembled. Long repeats present in the genome constantly cause assembly issues in practice. A detailed solution to this will be discussed in the following part of this chapter.

**Figure 2.** Outline of De Bruijn graph build during the sequence assembling process. A short model genome is se‐ quenced. Four short reads were generated from template. The *k*-mer length of 5 was chose to be used in sequence as‐ sembly. For each *k*-mer, the left *k*– 1 and right *k*– 1 were represented as nodes in the De Bruijn graph, and all left parts are connected to possible right parts by directed edges. The red digit shows the number of occurrence of each node. The cyclic edge at the rightmost end of the graph causes the gap of contig assembly. Thus, the final assembly does not fully represent the "repeat" in the genome sequence.

Taking ALLPATHS for instance, the memory use is estimated to be roughly 1.7 bytes per read base, which equals to a 102-GB RAM of a 60× coverage 1-GB genome. This level of RAM requirement can be fully fulfilled nowadays. Alternatively, this RAM requirement can be solved by sharing memory from different computer nodes, or by distributing the workload to different nodes within a computer cluster, which is normally accessible in most universities and research institutions. In addition, the development of cloud computing allows one to gain access to high-speed computer clusters in a pay-as-you-go manner, and there are several recently developed cloud-based sequence assemblers (summarized in Table 3).


**Table 3.** Cloud computing-based sequence assemblers
