**1. Introduction**

A vast portion of the mammalian transcriptome is composed of non-protein coding transcripts or non-coding RNA (ncRNA). Some ncRNAs are processed into functionally important transcripts such as microRNA (miRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), small interfering RNA (siRNA), PIWI-interacting RNA (piRNA), circular RNA (circRNA), long non-coding RNA (lncRNA)

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 Her Majesty the Queen in Right of Canada, as represented by the Minister of Agriculture and Agri-Food Canada. Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

and several classes with limited information about their functions. In addition to the well described ncRNA classes, clusters of ncRNA (22–200 nucleotides (nt)) were detected at the 5, and 3′end of human and mouse genes, and named promoter-associated short RNAs (PASRs) and termini-associated short RNAs (TASRs) [1]. Mercer et al. [2] described a class of ncRNA, about 50–200 nt, that are processed from the 3′UTRs of protein-coding genes (uaRNAs). The uaRNAs are in sense direction to the protein-coding gene and show stage, sex and subcellular specific expression. A class of ncRNA derived from tRNA precursors and named tRNA-derived RNA fragments (tRF) or tRNA-derived small RNAs (tsRNAs) appear to be processed by Dicer while others are Dicer independently processed [3, 4]. Small nucleolar RNAs (snoRNA) can also be processed into small miRNA-like molecules called sno-derived RNAs or sdRNAs [5, 6] which play roles in guiding enzymes to target RNAs for modification [7]. In this chapter, only the main classes of functional ncRNAs (miRNA, snoRNA, siRNA, piRNA and lncRNA), not considering the translation related ncRNAs (rRNA and tRNA), will be further discussed. NcRNAs have been implicated in many biological processes including transcriptional inference, translational modifications, mRNA cleavage, epigenetic modifications, regulation of structural organization, and modulation of alternative splicing, small RNA precursor, and endo or secondary siRNA generation [7–10].

on known genome annotation, but rather on all the information available in a given sample,

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

http://dx.doi.org/10.5772/intechopen.69872

105

A diversity of platforms offer a wide range of RNA-sequencing possibilities[12]. For example, Illumina HiSeq and MiSeq technologies offer short sequence reads (36–300 base pairs (bp)) while Oxford Nanopore can reach sequence lengths of greater than 150 kilo base pairs (kb) [26]. The sequencing techniques could be DNA-polymerase dependent (i.e. sequencing-by-synthesis (e.g. Illumina MiSeq/HiSeq)) while others like PacBio and Oxford Nanopore are single-molecule sequencers. The sequencing error rate ranges from 0.1% (Illumina MiSeq/HiSeq) to about 1.3% (PacBio RSII single pass). An overview of sequencing platforms and their characteristics is shown in **Table 1**. The error rate between platforms varies [27], so it is important to consider this

The challenges of managing RNA-Seq data are considerable in terms of data storage and analysis as well as algorithm development. Since the technology is not yet fully matured, shortcomings exist at every step of sequence analysis. Various tools are available for alignment of reads, transcript construction, quantification, differential gene expression, pathways and correlation analyses [28] (**Tables 2** and **3**). Nonetheless, the use and specificity of the softwares differ highly from one type of analysis to another and the hardest part is making sure that the right tool is chosen at every step. A review of best practices for RNA-Seq data analysis was published recently [29]. The gap between the rapid evolution of RNA-Seq technology and the development of data analysis tools is hindering wide application in livestock species. Most data analysis tools are developed for use with genomes of human and common model organisms (mouse, rat) and require tweaking before use with livestock genomes. For example, when performing target prediction analysis for newly discovered transcripts, it is the practise to use human/mouse databases as it brings a lot of power to the analysis. However, there is great bias coming from the assumption that livestock biological systems are identical to

there is clear opportunity to make discoveries at a rate never expected before.

especially when the goal is to sequence short read transcripts like miRNA.

**2.2. Generation of ncRNA sequence data and pre‐mapping quality control**

The choice of the sequencing platform is critical to attain the goals of a study. Numerous protocols and commercial kits to generate cDNA libraries from RNA samples are available and they are mostly based on the same principles (e.g. fragmentation, reverse-transcription, adapter ligation and amplification). The steps in library preparation for lncRNA are the same as for mRNA since they share similar biogenesis pathways. The starting material for lncRNA library preparation is total RNA. Majority of lncRNA transcripts have poly-A tails while a small proportion do not. Library preparation methods based on poly-A tail selection are cheaper but less robust since non-poly-A tail transcripts are lost. An ideal but more expensive method involves depletion of rRNA (constitutes ~90% of total RNA). Library preparation with rRNA depleted total RNA is robust as it allows quantification of all other RNA transcripts including lowly expressed transcripts. Thus, the first step in lncRNA library preparation is to consider whether to perform poly-A tail selection or to deplete rRNA (**Figure 1**). The next dilemma is deciding whether or not

human or mouse.

*2.2.1. Generation of ncRNA sequence data*

#### **2. Transcriptome analysis of non‐coding RNA**

#### **2.1. Platforms for transcriptome analysis of non‐coding RNA**

Transcriptome analysis reached a turning point in its history with the arrival of high throughput next-generation sequencing technologies like RNA-Sequencing (RNA-Seq) [11, 12]. Before this time, microarray was the gold standard for transcript profiling or simultaneous measurement of the expression level of thousands of genes in a given sample [13, 14]. Microarray technology however has major drawbacks like non-specific probe hybridization signals and errors in background level measurements [15], as well as limited gene diversity since probes are designed to represent only a set of preselected genes. Unique hybridization properties of each probe may affect their dynamic range and thus create bias in data processing algorithms [16]. The flexibility offered by RNA-Seq technology enables detection of unknown splice junctions [17], novel transcripts [18], new single nucleotide polymorphisms (SNPs) [19] and many other features all in the same assay. RNA-Seq technology has taken the possibility of fine tuning our knowledge of the transcriptome to a much higher level. In recent years, RNA-Seq has proved its worth as a technology that will replace microarray in whole-genome transcript profiling [20–22]. Correlation of RNA-Seq to RNA-Seq differential gene expression data resulted in good overlap than RNA-Seq to microarray data [23, 24], thus confirming that RNA-Seq is the preferred method to analyze the transcriptome. Moreover, correlation of transcriptome quantification by the two methods versus transcript level measured by shotgun mass spectroscopy showed better estimation with RNA-Seq analysis [25]. Through the evolution process of RNA-Seq technology, other new aspects have been included such as allele specific transcriptome analysis. Moreover, since the RNA-Seq procedure does not rely on known genome annotation, but rather on all the information available in a given sample, there is clear opportunity to make discoveries at a rate never expected before.

A diversity of platforms offer a wide range of RNA-sequencing possibilities[12]. For example, Illumina HiSeq and MiSeq technologies offer short sequence reads (36–300 base pairs (bp)) while Oxford Nanopore can reach sequence lengths of greater than 150 kilo base pairs (kb) [26]. The sequencing techniques could be DNA-polymerase dependent (i.e. sequencing-by-synthesis (e.g. Illumina MiSeq/HiSeq)) while others like PacBio and Oxford Nanopore are single-molecule sequencers. The sequencing error rate ranges from 0.1% (Illumina MiSeq/HiSeq) to about 1.3% (PacBio RSII single pass). An overview of sequencing platforms and their characteristics is shown in **Table 1**. The error rate between platforms varies [27], so it is important to consider this especially when the goal is to sequence short read transcripts like miRNA.

The challenges of managing RNA-Seq data are considerable in terms of data storage and analysis as well as algorithm development. Since the technology is not yet fully matured, shortcomings exist at every step of sequence analysis. Various tools are available for alignment of reads, transcript construction, quantification, differential gene expression, pathways and correlation analyses [28] (**Tables 2** and **3**). Nonetheless, the use and specificity of the softwares differ highly from one type of analysis to another and the hardest part is making sure that the right tool is chosen at every step. A review of best practices for RNA-Seq data analysis was published recently [29]. The gap between the rapid evolution of RNA-Seq technology and the development of data analysis tools is hindering wide application in livestock species. Most data analysis tools are developed for use with genomes of human and common model organisms (mouse, rat) and require tweaking before use with livestock genomes. For example, when performing target prediction analysis for newly discovered transcripts, it is the practise to use human/mouse databases as it brings a lot of power to the analysis. However, there is great bias coming from the assumption that livestock biological systems are identical to human or mouse.

#### **2.2. Generation of ncRNA sequence data and pre‐mapping quality control**

#### *2.2.1. Generation of ncRNA sequence data*

and several classes with limited information about their functions. In addition to the well described ncRNA classes, clusters of ncRNA (22–200 nucleotides (nt)) were detected at the 5, and 3′end of human and mouse genes, and named promoter-associated short RNAs (PASRs) and termini-associated short RNAs (TASRs) [1]. Mercer et al. [2] described a class of ncRNA, about 50–200 nt, that are processed from the 3′UTRs of protein-coding genes (uaRNAs). The uaRNAs are in sense direction to the protein-coding gene and show stage, sex and subcellular specific expression. A class of ncRNA derived from tRNA precursors and named tRNA-derived RNA fragments (tRF) or tRNA-derived small RNAs (tsRNAs) appear to be processed by Dicer while others are Dicer independently processed [3, 4]. Small nucleolar RNAs (snoRNA) can also be processed into small miRNA-like molecules called sno-derived RNAs or sdRNAs [5, 6] which play roles in guiding enzymes to target RNAs for modification [7]. In this chapter, only the main classes of functional ncRNAs (miRNA, snoRNA, siRNA, piRNA and lncRNA), not considering the translation related ncRNAs (rRNA and tRNA), will be further discussed. NcRNAs have been implicated in many biological processes including transcriptional inference, translational modifications, mRNA cleavage, epigenetic modifications, regulation of structural organization, and modulation of alternative splicing, small

Transcriptome analysis reached a turning point in its history with the arrival of high throughput next-generation sequencing technologies like RNA-Sequencing (RNA-Seq) [11, 12]. Before this time, microarray was the gold standard for transcript profiling or simultaneous measurement of the expression level of thousands of genes in a given sample [13, 14]. Microarray technology however has major drawbacks like non-specific probe hybridization signals and errors in background level measurements [15], as well as limited gene diversity since probes are designed to represent only a set of preselected genes. Unique hybridization properties of each probe may affect their dynamic range and thus create bias in data processing algorithms [16]. The flexibility offered by RNA-Seq technology enables detection of unknown splice junctions [17], novel transcripts [18], new single nucleotide polymorphisms (SNPs) [19] and many other features all in the same assay. RNA-Seq technology has taken the possibility of fine tuning our knowledge of the transcriptome to a much higher level. In recent years, RNA-Seq has proved its worth as a technology that will replace microarray in whole-genome transcript profiling [20–22]. Correlation of RNA-Seq to RNA-Seq differential gene expression data resulted in good overlap than RNA-Seq to microarray data [23, 24], thus confirming that RNA-Seq is the preferred method to analyze the transcriptome. Moreover, correlation of transcriptome quantification by the two methods versus transcript level measured by shotgun mass spectroscopy showed better estimation with RNA-Seq analysis [25]. Through the evolution process of RNA-Seq technology, other new aspects have been included such as allele specific transcriptome analysis. Moreover, since the RNA-Seq procedure does not rely

RNA precursor, and endo or secondary siRNA generation [7–10].

104 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

**2. Transcriptome analysis of non‐coding RNA**

**2.1. Platforms for transcriptome analysis of non‐coding RNA**

The choice of the sequencing platform is critical to attain the goals of a study. Numerous protocols and commercial kits to generate cDNA libraries from RNA samples are available and they are mostly based on the same principles (e.g. fragmentation, reverse-transcription, adapter ligation and amplification). The steps in library preparation for lncRNA are the same as for mRNA since they share similar biogenesis pathways. The starting material for lncRNA library preparation is total RNA. Majority of lncRNA transcripts have poly-A tails while a small proportion do not. Library preparation methods based on poly-A tail selection are cheaper but less robust since non-poly-A tail transcripts are lost. An ideal but more expensive method involves depletion of rRNA (constitutes ~90% of total RNA). Library preparation with rRNA depleted total RNA is robust as it allows quantification of all other RNA transcripts including lowly expressed transcripts. Thus, the first step in lncRNA library preparation is to consider whether to perform poly-A tail selection or to deplete rRNA (**Figure 1**). The next dilemma is deciding whether or not


to preserve strand information during library preparation. As lncRNA annotation is still in the initial phase, it is crucial to preserve strand information to enable correct genome localization of novel transcripts. Paired-end sequencing is to be considered over single end sequencing for lncRNA characterization to facilitate construction of transcripts with clear-cut exon boundaries. Paired-end sequencing also allows accurate detection of splicing position. Sequencing long fragments (>100 bp) is also desired to get adequate coverage of the genome and consequently, better transcript construction. The number of multiplexed samples on each sequencing lane affects lncRNA sequence depth. Reducing cost by multiplexing more samples than necessary reduces quality of results obtained. It has been demonstrated that the depth of sequencing is relative to the nature of the expected results [30, 31]. To accomplish lncRNA discovery with confidence, a minimum of 100 million reads per sample is suggested to enable *de novo* transcript assembly.

**Platform Read length1**

454 GS FLX Titanium

454 GS FLX Titanium

Oxford Nanopore MK1 MinION

Oxford Nanopore GridION X5

Oxford Nanopore PromethION

M: Million, B: Billion.

XLR70

XL+

1

2

3

**pair)**

 **(base** 

Up to 600; 450 mode

Up to 1000; 700 mode

~Hundreds of Kb 100 Gb

(SE, PE)

(SE, PE)

SE: single end, PE: paired end, Kb, Kilo base pair.

Mb: Megabyte, Gb: Gigabyte, TB: Terabyte.

Ion Proton 200 (SE) Up to 10 Gb 60 M 1% indel Ion PGM 318 200 or 400 (SE) 0.6–2 Gb 4–5.5 M 1% indel Ion PGM 316 200 or 400 (SE) 0.3–1 Gb 2–3 M 1% indel Ion PGM 314 200 or 400 (SE) 30–100 Mb 0.4–0.5 M 1% indel PacBio Sequel 8–12 kb (SE) 3.5–7 Gb >100,000 N/A

PacBio RS II ~20 kb 0.5–1Gb ~55,000 ~13%, indel 454 GS Junior ~400 (SE, PE) 35 Mb ~0.1 M 1%, indel 454 GS Junior+ ~700 (SE, PE) 70 Mb ~0.1 M 1%, indel

SOLiD 5500 xl 50 or 75 (SE) 160–320 Gb ~1.4 B ≤0.1%, AT bias SOLiD 5500 Wildfire 50 or 75 (SE) 80–160 Gb 700 M ≤0.1%, AT bias

~4 Tb

**Table 1.** Overview of some sequencing platforms for transcriptome analysis and their characteristics.

Up to 200 Kb ~1.5 Gb ~12%, indel

**Throughput2 Number of reads3 Error profile**

http://dx.doi.org/10.5772/intechopen.69872

107

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

450 Mb ~1 M 1%, indel

700 Mb ~1 M 1%, indel

The procedure for the generation of miRNA sequence data differs slightly from the procedure for lncRNA analysis. First of all, miRNAs are small (18–24 bp) in size and do not require RNA Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity http://dx.doi.org/10.5772/intechopen.69872 107


1 SE: single end, PE: paired end, Kb, Kilo base pair.

2 Mb: Megabyte, Gb: Gigabyte, TB: Terabyte.

3 M: Million, B: Billion.

**Platform Read length1**

Illumina MiniSeq (high output)

Illumina MiniSeq (mid

Illumina NextSeq 500/550 (high output)

Illumina NextSeq 500/550 (mid output)

Illumina HiSeq250v2

Rapid run

Illumina HiSeq3000/4000

output)

**pair)**

 **(base** 

106 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

150 (PE) 6.6–7.5 Gb 44–50 M

Illumina MiSeq v2 36 (SE) 540–610 Mb 12–15 M <0.1%, substitution

Illumina MiSeq v3 75 (PE) 3–4 Gb 44–50 M <0.1%, substitution

Illumina HiSeq250v3 36 (SE) 47–52 Gb 1.5 B 0.1%, substitution

Illumina HiSeq250v4 36 (SE) 64–72 Gb 2 B 0.1%, substitution

Illumina HiSeqX 150 (PE) 800–900 Gb 2.6–3 B 0.1%, substitution 150 (PE) 1.6–20 B 167 Gb–6 Tb

**Throughput2 Number of reads3 Error profile**

75 (SE) 1.6–1.8 Gb 22–25 M <1%, substitution 75 (PE) 3.3–7.5 Gb 44–50 M <1%, substitution

75 (SE) 2.1–2.4 Gb 14–16 M <1%, substitution

25 (PE) 750–850 Mb 24–30 M <0.1%, substitution 150 (PE) 4.5–5.1 Gb 24–30 M <0.1%, substitution 250 (PE) 7.5–8.5 Gb 24–30 M <0.1%, substitution

300 (PE) 13–15 Gb 44–50 M <0.1%, substitution

75 (SE) 25–30 Gb 400 M <1%, substitution 75 (PE) 50–60 Gb 800 M <1%, substitution 150 (PE) 100–120 Gb 800 M <1%, substitution

75 (PE) 16–20 Gb ~260 M <1%, substitution 150 (PE) 32–40 Gb ~260 M <1%, substitution

36 (SE) 9–11 Gb 300 M 0.1%, substitution 50 (PE) 25–30 Gb 600 M 0.1%, substitution 100 (PE) 50–60 Gb 0.1%, substitution 150 (PE) 75–90 Gb 0.1%, substitution 250 (PE) 125–150 Gb 0.1%, substitution

50 (PE) 135–150 Gb 3 B 0.1%, substitution 100 (PE) 270–300 Gb 0.1%, substitution

50 (PE) 180–200 Gb 4 B 0.1%, substitution 100 (PE) 360–400 Gb 0.1%, substitution 125 (PE) 450–500 Gb 0.1%, substitution

50 (SE) 105–125 Gb 2.5 B 0.1%, substitution 75 (PE) 325–375 Gb 0.1%, substitution 150 (PE) 650–750 Gb 0.1%, substitution **Table 1.** Overview of some sequencing platforms for transcriptome analysis and their characteristics.

to preserve strand information during library preparation. As lncRNA annotation is still in the initial phase, it is crucial to preserve strand information to enable correct genome localization of novel transcripts. Paired-end sequencing is to be considered over single end sequencing for lncRNA characterization to facilitate construction of transcripts with clear-cut exon boundaries. Paired-end sequencing also allows accurate detection of splicing position. Sequencing long fragments (>100 bp) is also desired to get adequate coverage of the genome and consequently, better transcript construction. The number of multiplexed samples on each sequencing lane affects lncRNA sequence depth. Reducing cost by multiplexing more samples than necessary reduces quality of results obtained. It has been demonstrated that the depth of sequencing is relative to the nature of the expected results [30, 31]. To accomplish lncRNA discovery with confidence, a minimum of 100 million reads per sample is suggested to enable *de novo* transcript assembly.

The procedure for the generation of miRNA sequence data differs slightly from the procedure for lncRNA analysis. First of all, miRNAs are small (18–24 bp) in size and do not require RNA


\*\*Further alignment tools are available at: https://omictools.com/read-alignment-category/

**Names** miRDeep mirTools UEA sRNA

miRNA

+

+

+

+

−

+

[76]

identification

Workbench

sRNAtoolbox

MIReNA miRExpress

DARIO Target scan DIANA-microT-

Target prediction

 −

> CDS

miRanda

miRDB miRTar mirWIP

MMIA

PITA psRNATarget

RNA22 RNAhybrid

Target prediction

 −

Target prediction

Target prediction

 −

 −

Target prediction

Target prediction

Target prediction

Target prediction

Target prediction

Target prediction

 −

 −

 −

 −

 −

 −

−

−

−

−

−

−

−

−

−

−

+

−

+

[105]

109

−

+

−

+

[104]

−

+

−

−

[103]

−

+

−

+

[102]

−

+

+

−

[101]

http://dx.doi.org/10.5772/intechopen.69872

−

+

−

−

[100]

−

+

−

−

[99]

−

+

−

+

[98]

−

+

−

+

[97]

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

Target prediction

 −

−

−

−

+

+

−

[96]

−

+

−

+

[95]

miRNA

+

+

−

+

−

−

[94]

identification

miRNA

−

+

−

−

−

+

[93]

identification

miRNA

+

−

−

−

−

+

[81]

identification

miRNA

+

+

+

+

−

+

[77]

identification

miRNA

+

+

+

+

−

+

[71]

identification

miRNA

+

+

−

−

−

+

[74]

identification

**Major purpose1**

**Known miRNA** 

**Novel miRNA** 

**DE analyses**

**Target prediction**

**Pathway** 

**Livestock** 

**References**

**Species**

**enrichment**

**discovery**

**annotation2**

**Table 2.** Frequently used tools for trimming and alignment.


**Step Tools Application/Web link References**

PEAT Specific for paired end sequencing quality and adapter trimming. https://github.com/jhhung/PEAT

Trim Galore Quality and adapter trimming with some extra functionality

Skewer Adapter trimming, can take into account indels. https://github.

for Bisulfite-Seq. https://www.bioinformatics.babraham.

sequence reads. ftp://ftp.pasteur.fr/pub/gensoft/projects/

types of unwanted sequences. https://github.com/marcelm/

Mate Pair reads, single end and paired end reads. https://

cms/?page=trimmomatic

ac.uk/projects/trim\_galore

AlienTrimmer Detect and remove alien k-mers in both ends of

Cutadapt Finds and remove adapter, primers, poly-A and other

github.com/sequencing/NxTrim

NxTrim Discard as little sequence as possible from Illumina Nextera

SeqPurge Can detect very short adapter sequences. https://github.com/ imgag/ngs-bits/blob/master/doc/tools/SeqPurge.md

Bowtie / Bowtie2 Align short DNA sequences to genomes with Burrows-

BWA Mapping low-divergent sequences against large reference genome. bio-bwa.sourceforge.net

TopHat2 Use Bowtie for alignment. TopHat analyzes results to

Rockhopper Specific for bacterial RNA-Seq data. It supports de novo

SpliceMap De novo splice junction discovery and alignment tool. https:// web.stanford.edu/group/wonglab/SpliceMap

Trinity De novo reconstruction of transcriptomes from RNA-seq

edu/~btjaden/Rockhopper

StringTie De novo transcript assembly.

\*

software/stringtie

Further trimming tools are available at: https://omictools.com/adapter-trimming-category/ \*\*Further alignment tools are available at: https://omictools.com/read-alignment-category/

**Table 2.** Frequently used tools for trimming and alignment.

junctions. https://github.com/alexdobin/STAR

Wheeler index. bowtie-bio.sourceforge.net/bowtie2

identify splice junctions. https://ccb.jhu.edu/software/tophat

Quantitation of full-length transcripts representing multiple splice variants for each gene locus. https://ccb.jhu.edu/

data. https://github.com/trinityrnaseq/trinityrnaseq/wiki

and reference based transcript assembly. cs.wellesley.

com/relipmoc/skewer

AlienTrimmer/.

Alignment\*\* STAR Align RNA-Seq reads to a reference genome, detect splice

cutadapt

adapter trimming. http://www.usadellab.org/

[39]

[50]

[51]

[52]

[53]

[54]

[55]

[45]

[58]

[59]

[60]

[61]

[47]

[62]

[56, 57]

Trimming\* Trimmomatic Illumina single end and paired end quality and

108 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity http://dx.doi.org/10.5772/intechopen.69872 109


**Table3.**Overview of tools used for the analysis of miRNA sequence data.

fragmentation prior to library construction. Total RNA is the recommended starting material for miRNA library preparation (**Figure 1**). Although some commercial kits provide the option to enrich the miRNA fraction prior to library preparation, there is evidence that some small RNA species are lost during enrichment [32]. The protocols for miRNA library preparation are generally similar to lncRNA and include adapter ligation step, reverse transcription and amplification followed by size selection and purification of the cDNA. Fifty bp single end sequencing is sufficient for miRNA libraries since miRNAs are generally small. Thus, Illumina platforms are well suited for sequencing miRNA libraries. Studies showed that approximately 2 million reads are sufficient for differential expression analysis while 8 million reads are sufficient for discovery analysis [33, 34]. Considering that over 150 million reads are available per lane on HiSeq machines, sample multiplexing can be as high as 18 to 20 libraries per lane.

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

http://dx.doi.org/10.5772/intechopen.69872

111

**Figure 1.** Starting material and sequencing method considerations according to RNA species to be analyzed.

Upon availability of sequence data, many bioinformatics tools are used in the analytical procedures. Some processing steps are optional but strongly recommended; while others are required before the next step can be performed. Many pipelines have been developed to

*2.2.2. Common data processing steps*

org/software/target\_prediction; https://omictools.com/mirna-target-prediction-category

2"+" Function is included, "−" Function is not included.

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity http://dx.doi.org/10.5772/intechopen.69872 111

**Figure 1.** Starting material and sequencing method considerations according to RNA species to be analyzed.

fragmentation prior to library construction. Total RNA is the recommended starting material for miRNA library preparation (**Figure 1**). Although some commercial kits provide the option to enrich the miRNA fraction prior to library preparation, there is evidence that some small RNA species are lost during enrichment [32]. The protocols for miRNA library preparation are generally similar to lncRNA and include adapter ligation step, reverse transcription and amplification followed by size selection and purification of the cDNA. Fifty bp single end sequencing is sufficient for miRNA libraries since miRNAs are generally small. Thus, Illumina platforms are well suited for sequencing miRNA libraries. Studies showed that approximately 2 million reads are sufficient for differential expression analysis while 8 million reads are sufficient for discovery analysis [33, 34]. Considering that over 150 million reads are available per lane on HiSeq machines, sample multiplexing can be as high as 18 to 20 libraries per lane.

#### *2.2.2. Common data processing steps*

**Names** TargetRank DIANA-mirPath

Down-stream

−

−

−

+

+

+

[107]

miRNA analyses

Integrated tools

Down-stream

−

−

−

−

+

−

[109]

miRNA analyses

Down-stream

−

−

−

−

+

+

[110]

miRNA analyses

−

−

+

+

+

−

[108]

v3

miRGator

MAGIA miRNet miRSystem miRNAMap

miRTarBase

TransmiR

PicTar miRWalk MiRecords

multiMiR miRconnX DIANA-mirExTra

TarBase

Database

+

+

+ 1Further tools for miRNA annotation are available at: https://tools4mirs.org/software/known\_mirna\_identification/; Further tools for novel miRNA discovery and miRNA

precursor prediction are available at: https://tools4mirs.org/software/precursor\_prediction/; Further tools for miRNA target prediction are available at: https://tools4mirs.

org/software/target\_prediction; https://omictools.com/mirna-target-prediction-category

2"+" Function is included, "−" Function is not included.

**Table 3.**

Overview of tools used for the analysis of miRNA sequence data.

−

+

+

[121]

Down-stream

−

−

−

−

+

−

[120]

miRNA analyses

Down-stream

−

−

−

−

+

+

[114]

miRNA analyses

Target prediction

Integrated tools

Integrated tools

Integrated tools

Integrated tools

+

+

+

+

+

−

[119]

+

+

+

−

−

+

[118]

+

+

+

−

−

+

[117]

+

+

+

−

−

+

[116]

 −

−

−

+

−

+

[115]

Integrated tools

Integrated tools

+

+

+

+

+

+

[113]

+

+

−

+

+

+

[112]

Down-stream

−

−

−

−

+

+

[111]

110 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

miRNA analyses

Target prediction

 −

**Major purpose1**

**Known miRNA** 

**Novel miRNA** 

**DE analyses**

**Target prediction**

**Pathway** 

**Livestock** 

**References**

**Species**

**enrichment**

**discovery**

−

−

+

−

−

[106]

**annotation2**

Upon availability of sequence data, many bioinformatics tools are used in the analytical procedures. Some processing steps are optional but strongly recommended; while others are required before the next step can be performed. Many pipelines have been developed to answer specific questions, but the softwares used can be very different. A global view of the general processing steps and frequently used tools for lncRNA and miRNA sequence data analyses are presented in **Figures 2** and **3**, respectively. These processing steps can be modified to include desired or specific tools depending on the research question.

#### *2.2.3. Raw data quality control*

Sequence data generated by Illumina platforms and most platforms is in FASTQ format. The FASTQ format is a text file consisting of the nucleic acid sequence (read) and base calling accuracy score (Phred score) attributed to each base pair of the sequence. FastQC [35], Picard

tools (https://broadinstitute.github.io/picard/) and NGS QC tool kit [36] are often used to assess the quality of raw sequence reads. This step is necessary to determine if the sequencing outcome is as expected. These tools inform on the total number of reads, the overall quality of base call according to the position, GC percentage and other features. Care should be taken when interpreting the results because GC content is species specific and some softwares evaluate GC content according to the human genome. In order to avoid bias in the mapping step, a quality trimming is necessary to get rid of low quality base pairs and remaining adapter sequences. A recent study showed that incorrect trimming can lead to generation of short reads impairing the capacity to correctly predict differences in expression changes [37]. Several trimming tools are available [38] (https://omictools.com/adapter-trimming-category) including Trimmomatic [39], FASTX-Toolkit [40], CutAdapt [41], etc.(**Table 2**). Following trimming, filtering of reads is necessary to get rid of very short and overall low quality reads to keep bias level as low as possible.

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

http://dx.doi.org/10.5772/intechopen.69872

113

**Figure 3.** General processing steps and tools used in miRNA sequence analysis.

**Figure 2.** General processing steps and tools used in lncRNA sequence analysis.

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity http://dx.doi.org/10.5772/intechopen.69872 113

**Figure 3.** General processing steps and tools used in miRNA sequence analysis.

answer specific questions, but the softwares used can be very different. A global view of the general processing steps and frequently used tools for lncRNA and miRNA sequence data analyses are presented in **Figures 2** and **3**, respectively. These processing steps can be modi-

Sequence data generated by Illumina platforms and most platforms is in FASTQ format. The FASTQ format is a text file consisting of the nucleic acid sequence (read) and base calling accuracy score (Phred score) attributed to each base pair of the sequence. FastQC [35], Picard

fied to include desired or specific tools depending on the research question.

112 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

**Figure 2.** General processing steps and tools used in lncRNA sequence analysis.

*2.2.3. Raw data quality control*

tools (https://broadinstitute.github.io/picard/) and NGS QC tool kit [36] are often used to assess the quality of raw sequence reads. This step is necessary to determine if the sequencing outcome is as expected. These tools inform on the total number of reads, the overall quality of base call according to the position, GC percentage and other features. Care should be taken when interpreting the results because GC content is species specific and some softwares evaluate GC content according to the human genome. In order to avoid bias in the mapping step, a quality trimming is necessary to get rid of low quality base pairs and remaining adapter sequences. A recent study showed that incorrect trimming can lead to generation of short reads impairing the capacity to correctly predict differences in expression changes [37]. Several trimming tools are available [38] (https://omictools.com/adapter-trimming-category) including Trimmomatic [39], FASTX-Toolkit [40], CutAdapt [41], etc.(**Table 2**). Following trimming, filtering of reads is necessary to get rid of very short and overall low quality reads to keep bias level as low as possible.

#### *2.2.4. Alignment*

After trimming and filtering, reads are ready for alignment or *de novo* construction. Alignment consists of mapping reads to a reference genome. Various alignment tools have been developed [42, 43] (https://omictools.com/read-alignment-category) including frequently used tools like TopHat [44], STAR [45], Bowtie [46], StringTie [47], etc. (**Table 2**). These softwares have their own specifications highlighting the importance of understanding the utility of each tool and the options they offer. The alignment tool used can have great impact on the end results. It has been observed that the choice of aligner and specific options can affect results of differential gene expression analysis [48]. Aligners can be grouped in two types, gapped (also known as split, e.g. STAR, BWA, etc.) and ungapped (e.g. Bowtie, etc.). Bowtie (ungapped group) can easily map reads to a genome, but is less effective at finding spliced junctions. Aligners in the gapped group are able to align reads and detect spliced variants. In the absence of a reference genome, *de novo* assembly aligners (e.g. Trinity [49]) can be used. In the context of lncRNA read alignment, gapped softwares are preferred since the transcripts are not all annotated and portions of the reads of the same transcript may align to one position of the genome and the remaining to another position. Alignment is one of the longest steps in RNA-Seq sequence analysis therefore selection of the right tool might have significant impact on the outcome of the analysis. It is also important to perform mapping quality control following alignment. Quality check includes the percentages of mapped and unmapped reads, the location of the reads (intronic and exonic) and the 5′–3′ coverage.

miRanalyzer [70], mirTools 2.0 [71], etc. (**Table 3**). Subsequent interrogation of miRBase database enables classification of retained miRNAs as known or novel miRNAs. A tool like miRDeep2 has a quantifier module that generates a read count table for each miRNA using precursor and mature sequence files as input. An overview of tools for miRNA identification are pre-

Transcriptome Analysis of Non‐Coding RNAs in Livestock Species: Elucidating the Ambiguity

http://dx.doi.org/10.5772/intechopen.69872

115

The identification of miRNAs can be either annotation of known miRNAs or discovery of novel miRNAs. A variety of algorithms and bioinformatics tools are applied to annotate known miRNAs as well as to discover new miRNAs from sequence data. These tools can use several features such as sequence conservation among species, structural features like hairpin and minimal folding free energy [72]. Many tools are available for miRNA annotation (https://tools4mirs.org/software/known\_mirna\_identification/) [73] including frequently used tools like miRdeep [74], miRanalyzer [75], mirTools 2.0[71], UEA sRNA Workbench [76], sRNAtoolbox [77], and SeqBuster [78] (**Table 3**). Many more tools have been developed for novel miRNA discovery and miRNA precursor prediction (https://tools4mirs.org/software/ precursor\_prediction/)[73] including frequently used tools like MiPred [79], miRanalyzer [75], miR-Abela [80], MiReNA [81], UEA sRNA Workbench [76] and mirDeep [74] (**Table 3**). Major features of miRNA discovery tools have been reviewed [82–84]. Regarding livestock species, the choice of methods for miRNA discovery and novel miRNA annotation vary among studies and species. For example, De Vliegher et al. [85] used miRbase [86] and UNAFold [87] for miRNA annotation and discovery in bovine mammary gland tissues while Peng et al [88] used miRbase [86] and RNAfold [89] for these purposes in porcine mammary glands. In our own studies, miRbase [86] and mirDeep2 [74] were used to identify miRNAs in various tissues including bovine mammary gland tissues [90], milk fat [90–92], milk whey and cells [90].

To date, a large number of lncRNA genes have been identified in the genomes of human (141,353), cow (23,896) and chicken (13,085) (http://www.bioinfo.org/noncode/analysis.php, accessed on 24-03-2017). Several methodologies have been described to identify/distinguish lncRNAs from mRNAs and successfully applied to livestock species such as coding potential calculator (CPC) [122], PhyLoCSF [123], coding-non-coding index (CNCI) [124], coding potential assessment tool (CPAT) [125], Predictor of Long non-coding RNAs and mRNAs based on an improved k-mer scheme (PLEK) [126] and Flexible Extraction of LncRNAs (FEELnc) [127], etc. The FEELnc program developed by the functional annotation of animal genome project consortium (FAANG) [128] is recommended as a standardized protocol for lncRNA analyses in animal species. In order to distinguish lncRNAs from mRNAs, FEELnc program uses a machine-learning method for estimation of a protein-coding score according to the

sented in **Table 3** and further discussed in the next section.

**3. Tools for ncRNA identification**

**3.1. Tools for miRNA identification**

**3.2. Tools for lncRNA identification**

#### *2.2.5. Transcript construction and quantification*

RNA-Seq transcript construction and the alignment steps can demand considerable computing time. Transcript construction tools are many (https://omictools.com/transcript-quantification-category) including commonly used tools like Cufflinks [63], iReckon [64], StringTie [47], etc. This step requires paired-end data and high sequence coverage to reconstruct lowly expressed transcripts. With the assumption that transcripts are species specific, raw data or alignment files from all samples from the same population can be merged to increase coverage [65]. This modification will help clarify transcript boundaries in case of *de novo* transcript assembly. Particular considerations for lncRNA transcript construction include sample pooling according to species and tissue type. LncRNA expression is known to demonstrate tissue specificity [66–68].

#### *2.2.6. miRNA processing steps*

Overall, the procedures for miRNA identification and discovery are less time consuming and do not include as many steps as for mRNA and lncRNA identification. The global process includes quality and adaptors trimming with quality checkpoints before and after each step. A size selection to keep sequences between 17 and 30 nt (sometimes up to 35 nt) is often performed right after the quality and adaptors trimming step. This is followed by read mapping and filtering of other RNA sequences (rRNA, tRNA, snRNA, mRNA, lncRNA, etc.). The reads thought to represent miRNA are analyzed with miRNA prediction tools like miRDeep2 [69], miRanalyzer [70], mirTools 2.0 [71], etc. (**Table 3**). Subsequent interrogation of miRBase database enables classification of retained miRNAs as known or novel miRNAs. A tool like miRDeep2 has a quantifier module that generates a read count table for each miRNA using precursor and mature sequence files as input. An overview of tools for miRNA identification are presented in **Table 3** and further discussed in the next section.
