**2. Glimpse of protocol and tools used for transcriptome analysis**

#### **2.1 Data collection and processing**

Data collection is a key process for protocol development, decision making, planning and research. Effective data collection of the transcriptome sequences gives the required outcome and enables to predict future trends. Quality of the reads plays a major role in the analysis. To identify sequencing errors, PCR artifacts, or contamination, Analyzing the sequence quality, GC content, adaptor presence, overrepresented k-mers, and duplicate reads is part of quality control for the raw reads. Acceptable duplication, k-mer, or GC content levels vary organism- and experiment-specific, although these values should be uniform among samples in the same experiments. With above 30% disagreement, we advise eliminating outliers. In contrast to NGSQC [10], which can be used on any platform, FastQC [11] is a popular tool for carrying out these analyses on Illumina reads. In general, read quality declines at the 3′ end of reads, and if it drops too low, bases should be deleted to increase the mapping quality. The low-quality reads can be removed using Trimmomatic and FastX toolkit [12, 13].

#### **2.2 Denovo assembly**

One or more contigs are created from partially or completely overlapping reads and one or more scaffolds are created by joining groups of contigs that are overlapping or non-overlapping. A single chromosome is created by joining together groups of overlapping or non-overlapping scaffolds. Reads must overlap by a certain number of base pairs, called k-mers, in the contig assembly process before they can be mapped together. Contigs can be connected to one another during the scaffold assembly process without necessarily overlapping. This can be attributed to pairedend sequencing. Scaffolds are connected via a gap-filling, gap-closing, or genome-finishing procedure during the chromosomal construction stage. Using only short-read technology, it can be challenging and occasionally impossible to finish this last stage. Despite considerable improvement in this area, the presence of repetitive sequences can prevent gap-filling using only short reads [14]. The ability to generate large amounts of RNA-sequencing data has led to the creation of a variety of referencebased and de novo transcriptome assemblers, each of which has unique benefits and drawbacks. The only organisms that can use reference-based methodologies are those with entire, well-annotated genomes, even though many transcriptome investigations routinely use them. De novo transcriptome reconstruction from short reads is challenging, and this challenge is exacerbated by alternative splicing, paralogous genes, and the diversity of gene expression levels. rnaSPADES and Trinity [15, 16] are one of the robust tools for performing De novo assembly.

### **2.3 Read alignment**

Typically, reads are mapped to either a genome or a transcriptome. The percentage of reads that are mapped, which serves as a general indicator of sequencing accuracy and the presence of contaminating DNA, is a crucial measure for determining the mapping quality. In this regard, we forecast 70 and 90 percent of typical RNA-seq readings to map onto the human genome [17], with a sizeable portion of reads mapping to a small number of identical areas equally well (referred to as "multi-mapping reads"). The uniformity of read coverage across exons and the mapped strand represents crucial parameters. If there is a predominant accumulation of reads at the 3′ end of transcripts in poly(A) selected samples, it could suggest suboptimal RNA quality in the initial material. The standard tool for alignment: HISAT2 The reads from RNA sequencing studies can be aligned using the extremely effective approach known as HISAT (hierarchical indexing for spliced alignment of transcripts). HISAT utilizes dual types of indexes for alignment: a comprehensive whole-genome FM index for initial alignment anchoring, and several localized FM indexes for rapid extension of these alignments. This indexing methodology is grounded in the Burrows-Wheeler transform and leverages the Ferragina-Manzini (FM) index. Each of the 48,000 local FM indexes in HISAT's hierarchical index for the human genome represents a genomic area of about 64,000 base pairs [18].

#### **2.4 Quantification**

The primary use of RNA-seq commonly involves assessing transcript and gene expression levels. This application heavily relies on the quantification of reads aligning to each transcript sequence, although alternatives like Sailfish and others may involve counting k-mers in reads without the need for mapping [19]. This gene-level quantification approach, which quantifies genes rather than transcripts, usually disregards multireads and uses a gene transfer format (GTF) file including the genomic coordinates of exons and genes. Raw read counts cannot be used to evaluate expression levels between samples since they are influenced by factors such as transcript length, total reads, and sequencing biases. The feature-length and library-size influences will be eliminated using the within-sample normalization method RPKM (reads per kilobase of exon model per million reads) [20]. This metric and its subsequent modifications, such as FPKM (fragments per kilobase of exon model per million mapped reads) – a measure of normalized transcript expression within a sample like RPKs – and TPM (transcripts per million), are the frequently reported values for RNA-seq gene expression. It's important to highlight that for single-end (SE) values, FPKM and RPKM are interchangeable, and TPM can be derived from FPKM using a straightforward method.

#### **2.5 Differential gene expression**

Conducting differential expression analysis necessitates the comparison of gene expression values across different samples. However, RPKM, FPKM, and TPM normalization methods mitigate the impact of a crucial factor for inter-sample

*Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis DOI: http://dx.doi.org/10.5772/intechopen.114077*

#### **Figure 1.**

*Protocol for computational transcriptomics: the figure is inclusive of all the steps in transcriptomics; dataset collection, alignment, assembly, differential gene expression and pathway analysis.*

comparisons – sequencing depth. Whether directly or by considering the number of transcripts, which may vary substantially between samples, these methods normalize based on total or effective counts. Notably, they exhibit suboptimal performance in cases where samples display heterogeneous transcript distributions, meaning that highly and differentially expressed features can distort the count distribution [21]. One of the important parameters by which the Differential Gene Expression is evaluated is log2fold change value. Negative and positive fold change values are taken into consideration. Negative indicates downregulation of the genes and positive value indicates the upregulation of genes. The most well studied tools for differential gene expression would be DeSeq2, Ballgown, Cuffdiff and Cufflink [22–25]. **Figure 1** shows the complete protocol involved in Computational Transcriptomics.

## **3. New era for transcriptome analysis**

#### **3.1 Dataset collection**

#### *3.1.1 Gene expression omnibus*

The Gene Expression Omnibus (GEO), managed by the National Centre for Biotechnology Information (NCBI), is a leading repository for high-throughput genomics data. Let us talk about GEO. It's a hub for gene expression studies and is fully accessible to researchers everywhere. Everyone can give their contributions, making the database ever-changing and rich with variegated test data. Searching through GEO is smooth thanks to its user-friendly interface. You can easily download any dataset, detailed metadata, and more. But GEO is more than gene expressions. The repository also keeps other omics data. It follows community standards. This means that everyone can use the data, promoting easy sharing and use of information. GEO is critical in genomics research progress, providing learning tools, backing up research work, and aiding in scientific collaboration and discovery.

The Gene Expression omnibus [26] is an International public repository that provides free access to the high throughput gene expression and functional genomics datasets. It is maintained by National Centre for Biotechnology information (NCBI) and is supported

by National Library of Medicine (NLM). The raw files in fastq format obtained after sequencing is deposited along with the descriptions, experimental design, attributes, and information on the protocol for study. The database has provision for Direct retrieval of the specific GEO record and quick access to datasets with the appropriate keywords.

The GEO database offers two separate search engines: (i) GEO datasets, (ii) GEO profiles. GEO datasets: The search engine is used for accessing the datasets of specific study. The submitter's platform, sample, and series entries comprise Database which is supplemented with curated gene Expression Dataset records. Every record has an accession code, title, synopsis, species information, and link to relevant data which will result in thorough and meaningful recovery.

By using the GEO Profiles, people can make the task of finding gene expression profile easier. The database holds carefully compiled set records which forms these profiles. The name of the gene, as well as that of the dataset name and a thumbnail showing the gene's expression level for each dataset sample is accessible by users. It allows for quick recognition of whether a gene shows differential expression under different experimental set-ups.

#### *3.1.2 TACITUS*

Researchers have acces to different RNA-Seq and microarrays datasets in many sizable publis repositories. Some of the most popular include NCBI GEO and Array express. These repositories contain large numbers of files requiring substantial bandwidth and specialized tools for identifying relevant subsets for research purposes. Hence, it is not easy to import or modify data from such sources. In a nutshell, however, TACITuS web-based platform provides quick query of microarray and NGS archive data. Module for handling large files and store them in a cloud and extract efficient data subset. Additionally, the technology facilitates importing of data into galaxy for further analysis. High-throughput microarray and NGS data analysis often involves processing extensive data from publicly accessible libraries. TACITuS streamlines this pre-processing task by automating several modules, enabling efficient management of large data files with agility. Additionally, it can work in a galaxy setting and has a user-friendly interface for users to analyze data [27].

TACITuS is developed using the Laravel framework and employs two databases: MariaDB for swift indexing of available datasets and MongoDB for storing both data and metadata. The data processing pipeline optimizes performance by leveraging R, C++, and PHP. The platform seamlessly integrates information from prominent sources such as NCBI GEO and Array Express with user data. TACITuS offers five essential functionalities: (i) data import, (ii) data selection, (iii) identifier mapping, (iv) data integration, and (v) Galaxy Export. A detailed, step-by-step tutorial accessible through the web interface provides in-depth insights into the implementation of these modules. The panel for "Dataset submission" within TACITuS provides a means for users to import datasets from diverse public transcriptomics resources. This includes the option to select a data source such as NCBI GEO, Array Express, or a custom source, in addition to specifying the dataset accession number and establishing whether the dataset should be classified as public or private. Following submission, the request joins a priority queue, triggering computation once resources become accessible.

Upon initial acquisition, the files are promptly downloaded, and pertinent data are meticulously archived within MongoDB databases, with the inclusion of two crucial indexes: one housing metadata attributions and the other mapping samples to their respective positions in the expression matrix through unique codes. Furthermore,

#### *Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis DOI: http://dx.doi.org/10.5772/intechopen.114077*

a full-text index backed by Lucene is meticulously constructed for each metadata attribute, enriching the dataset's accessibility for end-users.

As for the NCBI GEO data sets, TACITuS gets the platform descriptor and makes reference links between probes and their correspondents in other platforms such as Entrez or Ensembl Gene IDs. "Selections" panel provides services like "Map Identifiers," helping in mapping probe identifiers to standard formats like Entrez (for example), which will allow integrating data among various platforms. This system uses COMBAT, Gene standardization and XPN to integrate selection items together creating one complete set of data. It entails using z-score normalization, empirical Bayesian modeling, gene normalization, and similarity analysis normalization.

**Figure 2** illustrates the principle and working of the tool.

#### *3.1.3 Array express*

One of the most important online databases for functional genomics datasets is Array Express. Genome-wide gene expression data collected using microarray or next-generation sequencing (NGS) systems makes up most of the data. Array Express also offers a variety of DNA tests, including ChIP-seq and genotyping. Several assays from research are typically combined into one experiment. Depending on the sort of investigation, an assay has many definitions. An assay corresponds to one hybridization in microarray investigations (of biological sample material to an

array). A read-out (sequencing) of one library constitutes an assay for NGS investigations [28].

Array express is one of the repositories that follow the MIAME (minimal information about a microarray experiment) standards and serves as the main database of published data or those obtained through joint projects. MGED society also recommends Array Express together with GEO and CiBEX for confidential storage prior to publication. Once published, data is open source. In the last 2 years, the Array Express Repository has expanded and consists of over 50,000 hybridisations across 1650 experiments. These experiments are more than 90% for gene expression profiling, and the rests are some array-based chromatin immunoprecipitation, or comparative genomics studies. It comprises more than 200 species among which there is a considerable representation of human, mouse, Arabidopsis, yeast and rat.

The three methods that are used in data submission to the Array Express repository include; personal, internet and postal delivery. Firstly, web-based submissions are recommended for up to ∼20 hybridizations, utilizing the MIA Express online tool. A batch-loader for larger experiments via this route is in testing and slated for release in 2006. Secondly, experiments of varying types and sizes can be submitted as spreadsheets, employing a template generation system for user convenience. Thirdly, laboratories with local databases can use MAGE-ML or MAGE-TAB formats for automated data export directly to Array Express. The curation process involves meticulous checking for MIAME compliance, including raw and processed data presence, accuracy and completeness of biological information, and data consistency. Array Express has been sought by journals to provide a MIAME assessment service, with legacy data set to display MIAME scores in the user interface and be accessible to reviewers supporting publications. The MIAME scoring is currently used internally for Data Warehouse selection [28].

#### **3.2 Alignment**

#### *3.2.1 STAR*

Due to the non-contiguous transcript structure, relatively small read lengths, and continuously rising throughput of the sequencing technologies, accurate alignment of high-throughput RNA-seq data is a difficult and unresolved topic. Read length limitations, high mapping error rates, slow mapping speeds, and mapping biases are all problems with the RNA-seq aligners that are currently on the market. We created the Spliced Transcripts Alignment to a Reference (STAR) software based on an undocumented RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure to align our large (>80 billion reads) ENCODE Transcriptome RNA-seq dataset. On a modest 12-core server, STAR performs better than other aligners by a factor of >50 when it comes to mapping speed, matching 550 million 2 76 bp paired-end reads to the human genome in an hour while also enhancing alignment sensitivity and precision. STAR is capable of mapping full-length RNA sequences, non-canonical splices, chimeric (fusion) transcripts, unbiased de novo identification of canonical junctions, and non-canonical splices. We experimentally confirmed 1960 unique intergenic splice junctions with an 80–90% success rate using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, confirming the excellent accuracy of the STAR mapping technique.

While many RNA-seq aligners extend contiguous short read mappers, STAR takes a unique approach by directly aligning non-contiguous sequences to the reference

#### *Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis DOI: http://dx.doi.org/10.5772/intechopen.114077*

genome. The STAR algorithm involves two main steps: seed searching and clustering/stitching/scoring. In the seed search phase, STAR identifies Maximal Mappable Prefixes (MMPs), akin to concepts in large-scale genome alignment tools. MMP represents the longest substring of a read sequence that exactly matches one or more substrings of the reference genome. STAR's sequential MMP search efficiently detects splice junctions, making it notably faster than comparable tools. This approach, implemented through uncompressed suffix arrays, allows for precise splice junction detection in a single pass, without prior knowledge of junction loci. The binary nature of SA search results in efficient, logarithmic scaling search times, particularly advantageous for large genomes, and facilitates accurate alignment of multimapping reads [29].

#### *3.2.2 TOPHAT2*

By sequencing transcribed RNA molecules in cells, RNA-seq allows for a more comprehensive understanding of transcription activities. In this case, analyses of RNA-seq data are done to determine the expressed genes and their amount or abundance in a cell. The first stage is mapping the RNA-seq reads back onto the reference genome and it has its own specific challenges associated with sequence alignment. The gene structures of eukaryotic genome involve intronic sequences necessitating the RNA-Seq alignment software to be able to perform gapped (spliced) alignment with varying intron lengths. In addition to this, there is a lot of processed pseudogenes in the human genome that might lead to alignment issues for read spanning exons. Mature mRNAs have an average size of about 2227 bp per transcript, with 235 bp corresponding to each exon on average, and the number of exons in a typical transcript is around 9.5 bp. The shorter the read length, the higher are the alignment complexities and about twenty percent of junction-spanning reads prove to be problematic because of very short 'anchors' that reach into the exon part of a gene. This makes it difficult to get the correct alignment especially for algorithms based on k-mers initial mapping.

In RNA-sequencing (RNA-seq) research, one popularly used spliced aligner is known as TopHat. TopHat 2 is the last one and it can align reads obtained with the most modern sequencing technologies to any reference genome allowing for variable length indel. In addition to de novo spliced alignment, TopHat version 2.0 is also capable of aligning reads across fusion breaks as they might occur after genomic translocation. TopHat2 produces sensitive and precise alignments, even for highly repetitive genomes or in the presence of pseudogenes, by combining the capacity to find novel splice sites with direct mapping to existing transcripts [30].

#### **3.3 Assembly**

#### *3.3.1 BinPacker*

Introduced by Heber, the splicing graph serves as a foundational concept in BinPacker. BinPacker works by constructing directed acyclic splicing graphs comprising of nodes, which represent exons, and edges, which represent splicing. Although nodes represent contiguous sequences in the genome that exclude alternative splicing events, they may not necessarily correspond to actual exons because of such factors like sequencing errors and poor gene expression. The tool construct splices graph with respect to an expressed gene by incorporating data from RNA-seq and it looks for the best cover with its path edges and explains each splicing graph via iteration of bin packing problems providing sufficient evidence of all the splicing events allowing whole length recovery of the transcripts. Unlike other assemblers, which sequence the transcripts using the de Bruijn graph, BinPacker sequences the spliced graph with the coverage information. According to the assumption that every splicing graph represent the single manifest transcript; every splicing graph symbolize the whole alternative splicing transcripts at each position. With the use of overlapping sequence reads and junctions, the binpacking method attempts to maximize the edge-path-cover for each splicing graph to recover the set of transcripts that may be put together [31].

#### **3.4 Gene count generation**

#### *3.4.1 Feature counts*

Next-generation sequencing technologies produce vast amounts of short reads that are typically aligned to a reference genome. One crucial aspect of downstream analysis involves determining the number of reads associated with each genomic feature, such as exons or genes. This process, known as read summarization, is essential for various genomic analyses but has not been extensively explored in the literature. A notable tool for read summarization is feature Counts, designed to count reads from both RNA and genomic DNA sequencing experiments. This program employs efficient chromosome hashing and feature blocking techniques, resulting in significantly faster performance (approximately tenfold improvement for gene-level summarization) and reduced memory requirements compared to existing methods. Feature Counts is versatile, accommodating single or paired end reads and offering a range of options tailored to different sequencing applications [32].

Feature Counts takes aligned reads in SAM or BAM format and genomic features in GFF or SAF format as input. The read input format is automatically detected, and both the read alignment and feature annotation should correspond to the same reference genome. SAM or BAM files provide detailed alignment information, including chromosome mapping and alignment specifics. Genomic features, specified in GFF or SAF format, include information like feature identifier, chromosome name, start and end positions, and strand. The tool supports strand-specific read counting when provided. It accommodates various reference sequence numbers, counting either individual features or grouping them into meta-features, such as genes. Paired or unpaired reads are supported, with paired reads counting fragments.

Feature Counts ensures precise read assignment by evaluating the mapping location of each base in a read or fragment against genomic features, accounting for gaps like insertions, deletions, and exon–exon junctions. A hit is registered with any overlap of 1 bp or more between the read/fragment and a feature. Meta-features receive hits if overlapping with any component feature. Multi-overlap reads, overlapping multiple features or meta-features, can be excluded or counted based on experiment type. For RNA-seq, excluding multi-overlap reads is suggested, while counting them is recommended for ChIP-seq, considering potential regulatory effects on overlapping genes. Chromosome hashing facilitates quick matching of reference sequence names for efficient analysis [32].

#### **3.5 Differential gene expression**

#### *3.5.1 NOISeq*

A non-parametric method for analyzing the differential expression of RNseq data is termed NOISeq. By contrasting the number of reads for each gene in samples taken *Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis DOI: http://dx.doi.org/10.5772/intechopen.114077*

under the same condition, NOISeq generates a null or noise distribution of count changes. The change in count number between two conditions for a certain gene is then evaluated using this reference distribution to determine whether it is most likely noise or a genuine differential expression. The method is implemented in two different ways: NOISeq-real uses replicates to compute the noise distribution when they are available, while NOISeq-sim simulates them when replication is not possible [33].

#### **3.6 Automated pipelines**

#### *3.6.1 TALON*

It is well known that alternative splicing regulates gene expression and plays a significant role in both healthy development and disease states. Despite increasingly powerful computational techniques, short-read RNA-seq is unable to determine full-length transcript isoforms even while it is accurate and cost-effective for quantification. Platforms for long-read sequencing, such those from Pacific Biosciences (PacBio) and Oxford Nanopore (ONT), avoid the difficulties of short-read transcript reconstruction. The ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes, TALON, is introduced here. For both straightforward investigations and bigger initiatives, TALON can track both known and innovative transcript models as well as their expression levels across datasets. With the help of these characteristics, TALON users will be able to overcome the limits of short-read data and carry out isoform detection and quantification on current and upcoming long-read platforms uniformly [34].

#### *3.6.2 TCC-GUI*

Differential expression (DE) analysis of RNA-Seq count data is a critical stage in the process. For this reason, we already created the TCC R/Bioconductor package. Although this package has the distinctive ability to include a reliable normalization mechanism, only R users have been able to utilize it thus far. Therefore, for non-R users, there is a need for a DE analysis alternative to TCC. The TCC-GUI is developed in R and packaged as a Shiny application. It includes all of the key TCC features, such as strong normalization for DE pipelines and the creation of simulation data under varied scenarios. Additionally, it includes I tools for exploratory analysis, such as the average silhouette score, (ii) visualization tools like the volcano plot and heatmap with hierarchical clustering, and (iii) a reporting tool that uses R Markdown [35]. **Table 2** gives a comprehensive information of traditional and current tools its run time along with its memory usage.

### **4. Concluding remarks**

An essential technique for functional genomics and related fields today is computational transcriptomics. The success of the mature discipline of computational transcriptomics, which uses high throughput methods like cDNA microarray and RNA sequencing, is significantly impacted by these methods. The field of bioinformatics offers a variety of databases, programmes, and automated pipelines for processing and statistical analysis of high throughput data. In order to evaluate the biological meaning from experimental data, this chapter presents an overview of key


*Population Genetics – From DNA to Evolutionary Biology*

**Table 2.** *Comparison of different transcriptome analysis tools.* *Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis DOI: http://dx.doi.org/10.5772/intechopen.114077*

computational components that are used for RNA-seq data processing. The Chapter also gives a thorough overview of the tools and databases that are often utilized in transcriptomics research. It can conclude that the recent advancements in transcriptomics analysis have a new boom in the next generation sequencing field. This has made the researchers explore more on the Transcriptomics and its applications in solving complex biological problems.
