**1. Introduction**

The on-going advances in sequencing technologies and a drastic drop in the cost of sequencing allow us to obtain genome-wide genetic information for virtually all kingdoms of life.

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Particularly, making large-scale DNA sequencing more affordable and accessible for smallscale laboratories has greatly promoted genomic research studies on non-model organisms genetically linked to a specific biological question of interest [1, 2]. Despite huge effort, *de novo* sequencing of an entire genome is not an easy task, even now, and this also makes 'RNA sequencing (hereafter, RNA-Seq)-based transcriptomic analysis' appealing for non-model organisms that are generally described as having no or limited genomic resources and transcriptomic datasets as well as molecular tools [3–6]. In the field of '-omics' disciplines, RNA-Seq is among high-throughput experimental methods and widely used for identifying all functional elements in the genome. In other words, RNA-Seq data are directly derived from functional genomic elements, mostly protein-coding genes. Therefore, analysing the expressed part of genome by RNA-Seq gives substantial information about the genome-wide transcriptome structure, profile and dynamics for non-model organism at genome-wide scale. Currently, large-scale sequencing efforts such as 'Fish-T1K (Transcriptomes of 1000 fishes)', '1KITE (1K insect transcriptome evolution)' and '1KP (1000 Plants Project)' have been initiated to serve as valuable source of transcriptome composition and dynamics. In spite of immense potential of RNA-Seq–based methods, particularly in recovering full-length transcripts and spliced isoforms from short-reads, the accurate results can be only obtained by the procedures to be taken in a step-by-step manner.

and applications for transcriptome assembly should be meticulously considered while planning a project. As no consensus procedure exists, researchers mainly in the field of ecology and evolution use many different approaches and tools from sequence pre-processing to functional annotations (**Figure 1**). In this context, establishing a guideline that facilitates and standardizes

Transcriptome Analysis for Non-Model Organism: Current Status and Best-Practices

http://dx.doi.org/10.5772/intechopen.68983

57

the transcriptome assembly and post-assembly analysis provides a good starting point.

**data for non-model organism**

fragmented and misassembled transcripts.

**2.2. A brief glance at** *de novo* **transcript assemblers**

assembly.

**2.1. Quality check and pre-processing of raw reads**

**2.** *De novo* **transcriptome assembly methods and mining transcriptome** 

Following sequencing reaction and initial processing, next-generation sequencing instruments generate raw image files that are automatically processed via instrument base calling software to output a massive quantity of raw sequence data in ".fastq" format. The ".fastq" is a text format containing both sequence read and base calling information encoded in ASCII characters. The read quality at each base or quality score can be obtained by converting the ASCII characters into Phred score (*Q*) indicating the probability of an erroneous base call. Compelling evidences show that a minimum threshold of Phred score for assembly and alignment is 20 (equivalent to 99% probability of being correct) for each base in raw read. Despite remarkable progress in sequencing chemistry and base detection approaches, the instruments can still produce incomplete, erroneous and ambiguous reads. Therefore, a pre-processing step (quality checking and read filtering) is considered an essential prerequisite prior to *de novo* transcriptome assembly because erroneous and ambiguous bases can often lead to

Quality checking and visualization of raw reads (in fastq) start with the FastQC tool (a standalone Java program available at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). FastQC generates a HTML output containing a number of graphical illustrations providing the number and length of raw reads and duplication rate, but two main component of the FastQC tool: (i) *per base sequence content* and (ii) *per base sequence quality* are particularly useful in guiding pre-processing step. The most popular pre-processing tools are FASTX-Toolkit [15], Trimmomatic [16], Cutadapt [17], NGS QC Toolkit [18] and Qtrim [19], and regardless of the tools used, common pre-processing steps include: (i) removing adapter sequences, (ii) discarding the low quality reads (*Q* ≤ 20) and ambiguous nucleotides (Ns), (iii) removing the short-read length sequences (length below 50 base pair (bp)) and (iv) trimming low quality bases at the both ends of reads (generally first 10 bp) (**Figure 1**) [20]. After pre-processing, resulting high-quality reads are ready for downstream analysis; *de novo* transcriptome

Currently, the length of sequence reads from NGS instruments (e.g. sequencing by synthesis from Illumina HiSeq Models) is ranged from 150 to 250 base pairs (bp) and, following quality checking and filtering step, the high-quality sequence reads have to be *de novo* assembled for

Compelling evidence show that a number of factors *de novo* transcript construction procedure were reported, such as error-prone and biased (e.g. GC%) nature of sequencing technologies, limitations of assembler algorithm and multi k-mer approaches [7–9], read length [10], coverage depth of reads [11], pre-processing options of raw reads [12, 13] and transcript complexity of organism (e.g. sequence variations at terminal regions, alternative splicing, antisense transcription, overlapping genes) [14]. Therefore, the state-of-the-art advancements in methodologies

**Figure 1.** An overview of *de novo* transcriptome analysis pipelines from assembly to quality checking and pre-processing to assembly and transcript quantification.

and applications for transcriptome assembly should be meticulously considered while planning a project. As no consensus procedure exists, researchers mainly in the field of ecology and evolution use many different approaches and tools from sequence pre-processing to functional annotations (**Figure 1**). In this context, establishing a guideline that facilitates and standardizes the transcriptome assembly and post-assembly analysis provides a good starting point.
