2. Methods

with chronic disease conditions including osteoarthritis, breast and ovarian cancer, diabetes, and cardiovascular disease. The purpose of this study is to explore the relationships between RNA-RNA interactions and to devise a related measure of resilience from network properties

80 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

At various levels, RNA is processed by alternate mechanisms [1], suggesting a biological framework that supports important system network features such as resilience. Trafficking of RNAs is essential for cellular function and homeostasis, but only recently it has become possible to visualize molecular events in vivo. Analysis of RNA motion within the cell nucleus has been particularly intriguing as they have revealed an unanticipated degree of dynamics within the organelle [2]. Single-molecule RNA imaging methods have revealed that the intranuclear and cytoplasmic trafficking occurs largely by energy-independent mechanisms and is driven by diffusion. RNA molecules undergo constrained diffusion, largely limited by the spatial constraint imposed by chromatin and chromatin-binding proteins if in the nucleus as demonstrated in numerous studies. In the cell, transcripts move by a stop-and-go mechanism, where free diffusion is interrupted by random association with cellular structures [3]. The ability and mode of motion of RNAs has implications for how they find nuclear targets on chromatin or cellular sub-compartments and how macromolecular complexes are assembled in vivo. Most importantly, the dynamic nature of RNAs is emerging as a means to control physiological cellular responses and pathways [4]. For example, unexpectedly complicated nuclear egress and nuclear import of small RNAs is more common than previously

Much attention has been focused on noncoding RNAs and their physiological/pathological implications [6]. This focus in RNA research is ultimately directed toward understanding the regulation of protein-coding gene networks, but ncRNAs also form well-orchestrated regulatory interaction networks [7]. For example, computational prediction of miRNA target sites suggests a widespread network of miRNA-lncRNA interaction [8]. Others suggest the possibility of widespread interaction networks involving competitive endogenous RNAs (ceRNAs) where ncRNAs could modulate regulatory RNA by binding and titration of binding sites on protein coding messengers [9]. Cellular uptake and trafficking of RNA could be widespread [10]. As the number of experiments increases rapidly, and transcriptional units are better annotated, databases indexing RNA properties and function will become essential tools

Much of bioinformaticians sequence analyses focuses on methodologies based on string alignment algorithms. However, such approaches fail to discover genomic aspects of systemic nature regarding dynamics or resilience. An alternative framework is based on alignment-free methods of genome analysis, where global properties of genomes are investigated [11]. A key concept of informational analysis is that of probability distributions. A genomic, or in our case transcriptomic, distribution associates to discrete values defined on transcripts, the number of

to understand physiologic processes in the transcriptome.

1.2. Biological-omic information theory

of the whole transcriptome.

1.1. RNA physiology

appreciated [5].

We formally define the transcriptome as an information structure, and then construct several simple models as examples. The most realistic model is used to examine real datasets of partitioned RNAs for validation of framework.

#### 2.1. Transcriptome information theory structure

RNA sequence is abstractly represented as a string over the nucleotide alphabet R ¼ {A, C, G, U}. This can be extended to modified nucleotides with an extended alphabet R ffi {A, C, G, U, N}, such that symbol N represents a modified nucleotide. Wk denotes a set of alphabet letters of length k, called k-mers and Wdenotes the set of all possible nonempty strings over the alphabet R. Given a transcript string S ¼ s1, s2, … to sn, of length n, S[i, j] with 1 ≤ i ≤ j ≤ n is the substring of S from position i to position j (included). The length of S is |S| ¼ n. Substrings of S of length k are called k-words or simply words of S. In the following, the entire transcriptome is denoted by W based on k-mer dictionaries and entropies, which are aimed at defining and computing informational indexes for representative sets of transcriptomes. We assume that the complexity of a transcriptome increases with its distance from randomness, as identified by suitable comparison between transcriptomes of the same length. This framework provides clues about the appropriate k length to consider for analysis of transcriptome properties.

#### 2.2. Spatial transcriptome information cloud (STIC) model construction

We hypothesized that miRNA localization in cellular compartments is an emergent property from Brownian motion interactions of a cloud of RNA sequences and RNA-binding proteins that can be analyzed in W [15]. There k-mer words of miRNA functional size were added to a dictionary from sliding windows of transcript sequences S. A prediction from this cloud model is that anomalous diffusion can occur if random-walk transcripts interact with their surrounding scaffold as a stochastic semantic cloud, and if the cloud relaxation time is a longer time frame than transit [16]. We showed that RNAs with sequences similar to the whole transcriptome exhibit modified or enhanced transport compared to RNA sequences without similar sequences [17]. Thus, RNAs were found to partition into different cellular compartments based on a semantic similarity of word compositions within Wk. We determine the frequency of all k-mers in the transcriptome as a matrix composed of RNA sequences and their word copy levels. For each transcript, we count the number of k-mer words in common within Wk or dictionary as a semantic similarity measure to the transcriptome, and we can also able to compare such counts to randomized Wk sequence words.

from this model with a large set of partial differential equations modeling RNA mobility, but was described as computationally prohibitive [15] for any realistic sized transcriptome. Each sequence S would be dynamically modeled with a neighborhood sized number of RNA-RNA interactions. Instead, we pursue a thermodynamic approach based on the Fokker-Planck equation to quantify stochastic processes in liquid medium [20]. RNA interactions are assumed to function over sequence lengths encompassing small single-stranded and solvent-accessible

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

83

Assuming smallest word in the cloud is 3 nt long, corresponding to the lower limit of size for a seed sequence in miRNA [18]. The upper limit for word size is set at 22 nt, corresponding to the size of a typical mature miRNA. Again, this is the same as the miRNA response elements (MRE) size in the simpler related ceRNA hypothesis by Salmena [9]. Instead, we determine the frequency of all words in the transcriptome as a matrix composed of RNA sequences and copy levels from RNA-seq datasets. For each transcript, count number of words in common with the cloud dictionary as a similarity measure to the transcriptome (tCount), and also count reverse complement words (rcCount) for RNA-RNA interactions. These raw counts can be multiplied by the frequencies of repetitive words in W to yield tWord ¼ tCount \* tFreq and rcWord ¼ rcCount \* rcFreq. As shown by Seffens [17], miRNAs with greater similarity to the transcriptome, i.e., greater tCount and tWord, are suggested to diffuse differentially based on spatial partitioning. In addition, greater intramolecular RNA-RNA interaction would be expected to hinder diffusion. This work proposes a general RNA sequence function that combines the influence of similarity with native (NAM) and reverse complementarity (RCM) measures as a cloud interaction function: C[W, NAM, RCM], such that cloud interactions increase with RCM, and decrease with NAM. Transcripts with low C would have "ideal solution" diffusion coefficients and found in cytoplasmic compartment, and those with greater C would be slowed by RNA-RNA interactions

For each component word of a transcript, determine whether it is expected to be in a singlestranded and solvent-accessible state (state "A"), or double-stranded or buried within the RNA molecule and is inaccessible (state "I"). For model calculations and preliminary studies (Model-1 W1 discussed later), we assume all words are accessible in state "A," and the transcriptome is uniform within the cell (i.e., ignore distance r from transcription site). Construct a matrix Wk(T, fA, fI, r) for each word size k, and populate the respective matrix with the component words of the transcriptome from RNA-seq reads such that S is the actual word sequence, f<sup>A</sup> is the frequency or number of accessible words of that sequence, and f<sup>I</sup> is the frequency or number of inaccessible words. Matrix Wk then contains information of all transcript sub-sequences and is a representation of the spatial transcriptome information cloud in some volume elements of the cell fraction. Let the diffusion coefficient for a transcript be described [24]as DRNA [21]. Then the

, r , S <sup>¼</sup> <sup>D</sup>RNA

where C is the cloud interaction term for molecule RNA exhibiting probabilities of RNA-RNA interactions as a function of the STIC represented by matrix Wk at some position r in the cell.

ideal � RCM <sup>þ</sup> NAM <sup>ð</sup>1<sup>Þ</sup>

and hence enriched in nuclear or perinuclear compartments.

effect of interaction of that transcript with the cloud would yield

ideal � <sup>C</sup> Wk S, f <sup>A</sup>, f <sup>I</sup>

2.2.1. Accessibility of an RNA sequence word

DRNA ¼ DRNA

regions.

Model assumptions are: (1) RNA diffuses away from point of transcription creating a cloud of k-mer sequences. (2) All RNAs comprise the transcriptome, and each transcript is affected by local RNAs with effective interaction windows of some sequence word length k nucleotides (nt). We assume significant k-mer word size to be 3–22 nt, which is equal to the functional miRNA size at the high side, and down to below the size of the "core" sequence [18]. (3) The diffusion rate of individual RNAs depends on degree of (a) sequence similarity and (b) reverse complementarity of RNA words at that location in the STIC (Figure 1). (4) Cloud dictionaries (collection of transcriptome word sets in Wk) change as function of distance from transcriptome site and cell state. (5) The cloud affects anomalous RNA diffusion that can give rise to an emergent and patterned behavior in the cell [19].

We model the RNA sequence word content of the transcriptome cloud as a function of distance from transcription site at the chromosome. RNA molecule diffusion in nuclear compartments would lead to cytoplasmic and extracellular localization of RNA if the transcript half-life is greater than its transit time. Calculations at arbitrary transit distances could be determined

Figure 1. K-mer words classified as solvent-accessible or inaccessible. Note that in this framework, k-mer words are generated from a sliding window and not from a contiguous word segments as could be interpreted by the figures adjacent to word blocks. Transcriptome is a dictionary of N þ n words and their associated frequencies. Interaction of transcript with transcriptome affects diffusion from solvent-accessible bases (words). Transcript itself is a part of the transcriptome.

from this model with a large set of partial differential equations modeling RNA mobility, but was described as computationally prohibitive [15] for any realistic sized transcriptome. Each sequence S would be dynamically modeled with a neighborhood sized number of RNA-RNA interactions. Instead, we pursue a thermodynamic approach based on the Fokker-Planck equation to quantify stochastic processes in liquid medium [20]. RNA interactions are assumed to function over sequence lengths encompassing small single-stranded and solvent-accessible regions.

Assuming smallest word in the cloud is 3 nt long, corresponding to the lower limit of size for a seed sequence in miRNA [18]. The upper limit for word size is set at 22 nt, corresponding to the size of a typical mature miRNA. Again, this is the same as the miRNA response elements (MRE) size in the simpler related ceRNA hypothesis by Salmena [9]. Instead, we determine the frequency of all words in the transcriptome as a matrix composed of RNA sequences and copy levels from RNA-seq datasets. For each transcript, count number of words in common with the cloud dictionary as a similarity measure to the transcriptome (tCount), and also count reverse complement words (rcCount) for RNA-RNA interactions. These raw counts can be multiplied by the frequencies of repetitive words in W to yield tWord ¼ tCount \* tFreq and rcWord ¼ rcCount \* rcFreq. As shown by Seffens [17], miRNAs with greater similarity to the transcriptome, i.e., greater tCount and tWord, are suggested to diffuse differentially based on spatial partitioning. In addition, greater intramolecular RNA-RNA interaction would be expected to hinder diffusion. This work proposes a general RNA sequence function that combines the influence of similarity with native (NAM) and reverse complementarity (RCM) measures as a cloud interaction function: C[W, NAM, RCM], such that cloud interactions increase with RCM, and decrease with NAM. Transcripts with low C would have "ideal solution" diffusion coefficients and found in cytoplasmic compartment, and those with greater C would be slowed by RNA-RNA interactions and hence enriched in nuclear or perinuclear compartments.

#### 2.2.1. Accessibility of an RNA sequence word

similar sequences [17]. Thus, RNAs were found to partition into different cellular compartments based on a semantic similarity of word compositions within Wk. We determine the frequency of all k-mers in the transcriptome as a matrix composed of RNA sequences and their word copy levels. For each transcript, we count the number of k-mer words in common within Wk or dictionary as a semantic similarity measure to the transcriptome, and we can also able to

Model assumptions are: (1) RNA diffuses away from point of transcription creating a cloud of k-mer sequences. (2) All RNAs comprise the transcriptome, and each transcript is affected by local RNAs with effective interaction windows of some sequence word length k nucleotides (nt). We assume significant k-mer word size to be 3–22 nt, which is equal to the functional miRNA size at the high side, and down to below the size of the "core" sequence [18]. (3) The diffusion rate of individual RNAs depends on degree of (a) sequence similarity and (b) reverse complementarity of RNA words at that location in the STIC (Figure 1). (4) Cloud dictionaries (collection of transcriptome word sets in Wk) change as function of distance from transcriptome site and cell state. (5) The cloud affects anomalous RNA diffusion that can give rise to an

We model the RNA sequence word content of the transcriptome cloud as a function of distance from transcription site at the chromosome. RNA molecule diffusion in nuclear compartments would lead to cytoplasmic and extracellular localization of RNA if the transcript half-life is greater than its transit time. Calculations at arbitrary transit distances could be determined

Figure 1. K-mer words classified as solvent-accessible or inaccessible. Note that in this framework, k-mer words are generated from a sliding window and not from a contiguous word segments as could be interpreted by the figures adjacent to word blocks. Transcriptome is a dictionary of N þ n words and their associated frequencies. Interaction of transcript with transcriptome affects diffusion from solvent-accessible bases (words). Transcript itself is a part of the

compare such counts to randomized Wk sequence words.

82 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

emergent and patterned behavior in the cell [19].

transcriptome.

For each component word of a transcript, determine whether it is expected to be in a singlestranded and solvent-accessible state (state "A"), or double-stranded or buried within the RNA molecule and is inaccessible (state "I"). For model calculations and preliminary studies (Model-1 W1 discussed later), we assume all words are accessible in state "A," and the transcriptome is uniform within the cell (i.e., ignore distance r from transcription site). Construct a matrix Wk(T, fA, fI, r) for each word size k, and populate the respective matrix with the component words of the transcriptome from RNA-seq reads such that S is the actual word sequence, f<sup>A</sup> is the frequency or number of accessible words of that sequence, and f<sup>I</sup> is the frequency or number of inaccessible words. Matrix Wk then contains information of all transcript sub-sequences and is a representation of the spatial transcriptome information cloud in some volume elements of the cell fraction. Let the diffusion coefficient for a transcript be described [24]as DRNA [21]. Then the effect of interaction of that transcript with the cloud would yield

$$D\_{\rm RNA} = D\_{\rm RNA}{}^{\rm ideal} - \mathfrak{C} \left[ W\_k \left( \mathcal{S}\_\prime f\_{\mathcal{A}'} f\_{\mathcal{I}'} r \right), \mathcal{S} \right] = D\_{\rm RNA}{}^{\rm ideal} - \rm R\mathcal{C}M + \rm NAM \tag{1}$$

where C is the cloud interaction term for molecule RNA exhibiting probabilities of RNA-RNA interactions as a function of the STIC represented by matrix Wk at some position r in the cell. For RNA expression data from the whole cell, r is ignored. In experiments from purified nuclei, r ranges from 0 to the radius of the cell's nucleus, rN. In experiments derived from the cytosol, r ranges from r<sup>N</sup> to the cellular membrane radius rC. Experimental data from extracellular vesicles will have r > r<sup>C</sup> and RNA half-life becomes important to consider as a factor. As a first approximation for the C function, we assume that the deviation from ideal DRNA scales as the number of reverse complement words (rcCount) in common with transcriptome Wk and is measured by difference to the number tCount of words in common with Wk, which normalizes for transcript size. We could also compare to ranCount, number of words in common with a randomized Wk. Putting together, we have

$$\mathfrak{C}[W\_k, S] \, := \alpha \, \text{rc} \mathbf{C} \text{ount}/4^k \,\tag{2}$$

Refs. [25, 26]) in a pooling predictor using machine learning [27]. Additionally, nucleotide solvent-accessibility in RNA structures could be estimated by the neural network method of Singh [22] using models of window size 3 nt, which could be expanded to 5–9 nt windows for k length. Alternatively, accessible surface area can be calculated by a publically available program NACCESS [28] to refine the STIC transcriptome words to those populated from singlestranded regions only, along with confidence measures. Solvent accessibility estimates for each transcript word partition the frequency entries in the transcriptome matrix Wk(S, fA, fI, r) by reducing f<sup>A</sup> in the amount that f<sup>I</sup> increases. Shifts of f<sup>A</sup> to f<sup>I</sup> could be caused by RNA-binding factors (RNA or protein) that cover a word in the transcript or the word in the transcriptome, or indirectly by binding to some other region of the RNA causing a cis-type of structural alteration leading to solvent inaccessibility (Figure 2). Transcriptome cloud or nebula regulation is introduced here and is proposed to occur as an indirect result of some factor that changes fA/f<sup>I</sup> for some word that then alters a different interacting transcript's diffusion coefficient. Conversely, f<sup>I</sup> to f<sup>A</sup> shifts could be caused by the release of binding factors or conformational change leading to exposure of the particular word and to nearby target RNAs. Dictionaries with this dynamic accounting of the transcriptome are labeled T, instead of the

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

85

Figure 2. Words classified as solvent-accessible or -inaccessible in dynamic transitions. Interaction of RNA binding-

proteins and other RNA affects balance of solvent-accessible words in transcriptome and transcript.

simpler W word matrix.

to normalize number of words, or alternately,

$$\mathfrak{C}[W\_{\text{k}}, \mathrm{S}] = a \begin{pmatrix} \mathrm{rcCount} \; \mathrm{\*} \; \mathrm{rcFreq} - \mathrm{tCount} \; \mathrm{\*} \; \mathrm{tFreq} \end{pmatrix} \\ = \mathrm{RCM} \; \mathrm{for} \; a = 1.0 \tag{3}$$

where α is a scaling factor and is dimensionless. The reverse complement measure (RCM), which factors rcCount word frequencies by rcFreq, then subtracting the count of words in common with transcriptome (tCount) by the corresponding tFreq, is one of several possible measures for correlation to measured compartmentalization of individual transcripts from RNA-expression datasets. The content of Wk(r) changes as a function of r due to changing concentrations of transcripts in the cell. Boundary condition on whole cell measurement from microarray or RNA-seq experiments would be

$$\mathcal{W}\_k = \int\_0^{r\_0} \mathcal{W}\_k(r) \, dr \tag{4}$$

assuming no export from the cell. If there are no reverse complement words in common between transcript S and Wk, then C [Wk, S] is zero and the diffusion of that molecule is ideal. As a first approximation for the C function, we assume that the deviation from ideal DRNA scales as the number (rcCount) of reverse complement words in common with Wk and is compared by a difference to the number tCount of words in common with Wk to normalize for transcript size. We could also compare to ranCount, number of words in common with a randomized Wk. Reverse complement measure (RCM) factors word frequencies to assess transcript-cloud interactions that correlate to measured compartmentalization of individual transcripts [17].

#### 2.2.2. Words that are solvent-accessible

The above model treatment assumed that all RNA sequences are available for reverse complementarity interactions. RNAs except for miRNAs typically have regions that are solventinaccessible and/or double-stranded, preventing intramolecular interactions [22]. mRNAs have more secondary structures or intra-strand base pairing than expected by chance [23]. We have determined the secondary structure of all RefSeq transcripts to predict single-stranded regions using RNAfold [24], while others have used RNA structure predictors (RNAplfold in Refs. [25, 26]) in a pooling predictor using machine learning [27]. Additionally, nucleotide solvent-accessibility in RNA structures could be estimated by the neural network method of Singh [22] using models of window size 3 nt, which could be expanded to 5–9 nt windows for k length. Alternatively, accessible surface area can be calculated by a publically available program NACCESS [28] to refine the STIC transcriptome words to those populated from singlestranded regions only, along with confidence measures. Solvent accessibility estimates for each transcript word partition the frequency entries in the transcriptome matrix Wk(S, fA, fI, r) by reducing f<sup>A</sup> in the amount that f<sup>I</sup> increases. Shifts of f<sup>A</sup> to f<sup>I</sup> could be caused by RNA-binding factors (RNA or protein) that cover a word in the transcript or the word in the transcriptome, or indirectly by binding to some other region of the RNA causing a cis-type of structural alteration leading to solvent inaccessibility (Figure 2). Transcriptome cloud or nebula regulation is introduced here and is proposed to occur as an indirect result of some factor that changes fA/f<sup>I</sup> for some word that then alters a different interacting transcript's diffusion coefficient. Conversely, f<sup>I</sup> to f<sup>A</sup> shifts could be caused by the release of binding factors or conformational change leading to exposure of the particular word and to nearby target RNAs. Dictionaries with this dynamic accounting of the transcriptome are labeled T, instead of the simpler W word matrix.

For RNA expression data from the whole cell, r is ignored. In experiments from purified nuclei, r ranges from 0 to the radius of the cell's nucleus, rN. In experiments derived from the cytosol, r ranges from r<sup>N</sup> to the cellular membrane radius rC. Experimental data from extracellular vesicles will have r > r<sup>C</sup> and RNA half-life becomes important to consider as a factor. As a first approximation for the C function, we assume that the deviation from ideal DRNA scales as the number of reverse complement words (rcCount) in common with transcriptome Wk and is measured by difference to the number tCount of words in common with Wk, which normalizes for transcript size. We could also compare to ranCount, number of words in common with a

84 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

<sup>C</sup>½ �¼ Wk, S <sup>α</sup> rcCount=4<sup>k</sup> <sup>ð</sup>2<sup>Þ</sup>

WkðrÞ dr ð4Þ

<sup>C</sup> Wk ½ �¼ , S <sup>α</sup> rcCount � rcFreq – tCount � tFreq � � <sup>¼</sup> RCM for <sup>α</sup> <sup>¼</sup> <sup>1</sup>:<sup>0</sup> <sup>ð</sup>3<sup>Þ</sup>

where α is a scaling factor and is dimensionless. The reverse complement measure (RCM), which factors rcCount word frequencies by rcFreq, then subtracting the count of words in common with transcriptome (tCount) by the corresponding tFreq, is one of several possible measures for correlation to measured compartmentalization of individual transcripts from RNA-expression datasets. The content of Wk(r) changes as a function of r due to changing concentrations of transcripts in the cell. Boundary condition on whole cell measurement from

Wk ¼

ðrC 0

assuming no export from the cell. If there are no reverse complement words in common between transcript S and Wk, then C [Wk, S] is zero and the diffusion of that molecule is ideal. As a first approximation for the C function, we assume that the deviation from ideal DRNA scales as the number (rcCount) of reverse complement words in common with Wk and is compared by a difference to the number tCount of words in common with Wk to normalize for transcript size. We could also compare to ranCount, number of words in common with a randomized Wk. Reverse complement measure (RCM) factors word frequencies to assess transcript-cloud interactions that correlate to measured compartmentalization of individual

The above model treatment assumed that all RNA sequences are available for reverse complementarity interactions. RNAs except for miRNAs typically have regions that are solventinaccessible and/or double-stranded, preventing intramolecular interactions [22]. mRNAs have more secondary structures or intra-strand base pairing than expected by chance [23]. We have determined the secondary structure of all RefSeq transcripts to predict single-stranded regions using RNAfold [24], while others have used RNA structure predictors (RNAplfold in

randomized Wk. Putting together, we have

to normalize number of words, or alternately,

microarray or RNA-seq experiments would be

transcripts [17].

2.2.2. Words that are solvent-accessible

Figure 2. Words classified as solvent-accessible or -inaccessible in dynamic transitions. Interaction of RNA bindingproteins and other RNA affects balance of solvent-accessible words in transcriptome and transcript.

#### 2.2.3. Genome-wide profiling of in vivo RNA structure

Recent transcriptome-wide RNA structure profiling through application of structure-probing enzymes or chemicals combined with high-throughput sequencing promises in vivo RNA structural information availability. Resultant datasets provide opportunity to investigate RNA structural information on global scale for STIC development. The analysis of high-throughput RNA structure profiling data requires considerable computational effort and currently dataset are not readily available. StructureFold processes and performs analysis of raw high-throughput RNA structure profiling data [29], incorporating wet-bench structural information from chemical probes and ribonucleases to restrain RNA structure prediction via RNAstructure and ViennaRNA package algorithms. StructureFold is deployed via the Galaxy platform. Alternatively, structure-seq is a recent quantitative and high-throughput method that provides genome-wide information on RNA structure with single-nucleotide resolution [30]. The methodology can perform both in vitro and in vivo RNA structure-function determinations, with insights to RNA regulation of gene expression and RNA processing. Implementation of structure-seq begins with chemical RNA structure probing under single-hit kinetics conditions. Modified RNA is then subjected to reverse transcription using random hexamer primers, then reverse transcription executed until it is blocked by chemically modified residues. Resultant cDNAs are amplified by adapter-based polymerase chain reaction (PCR) which are subjected to high-throughput sequencing, subsequently allowing retrieval of structural information on a genome-wide scale. A single structure-seq experiment can provide information on tens of thousands of RNA structures in a matter of weeks. Ding et al. [31] used RNAstructure calculated for each of thousands of mRNAs a positive predictive value (PPV), which they use to compare relative frequencies of base pairing in vivo constrained RNA structures to in silico predicted RNA structure. They found that most mRNAs do not fold in vivo as to in silicopredicted structures, as evident from a broad PPV distribution. Interestingly, mRNAs of cold and metal ion stress-response genes folded in vivo significantly different from their unconstrained in silico predictions. These stressors are known to affect RNA structure and thermo-stability like melting temperature Tm. Instead, genes involved with basic biological functions such as gene expression, protein maturation or processing, and peptide metabolic processes show little change in their in vivo-constrained and in silico-predicted RNA secondary structures. Ding speculated mRNAs related to cell maintenance and showing high PPV may have evolved to resist large conformational changes in order to maintain homeostasis, an idea suggesting RNA resilience. This bias may be detectable in our transcriptome model W. As genome-wide profiling of in vivo RNA structure datasets becomes easily available [32], the information can be incorporated into the STIC model by adding custom transform and load tools for each dataset.

ribosomes and tRNAs are all "A" for adenosine, and the mRNAs are also all "A." We can construct transcriptome model W0 using transcript values found in Table 1 from Seffens [17], with sizes n and respective abundances. The number of nonunique words (of size k) would be n – k þ 1 for each transcript. Using sizes and respective abundance values in Table 1 [17], gives 2.8�1010 words, composed of all "A"s. Assume mRNAs are 2 kb "A"s, then there are no reverse complement interactions, so RCM should be zero, while NAM would be maximal (RCM and NAM defined in Section 2.2). The diffusion coefficient for these mRNAs would be ideal since there are no base-pairing interactions within W0 and transcript similarity to the transcriptome is maximal. Now assume the mRNAs are all "U"s, they now strongly interact

Table 1. Transcriptome construction from different RNA biotypes. Transcriptome model W1 is a subset of 8 human

Class Biotype Network system function

ncRNA rRNA Most abundant; constant character ncRNA tRNA Most frequent; constant character Protein coding mRNA Act as gene; bigger than coded protein ncRNA miRNA Block some mRNA; smallest func. RNA Processed transcripts lncRNA Block or regulate some other RNA

(while NAM would be zero). This would be the expected maximal value for RCM with mRNAs of 2 kb size, yielding diffusion coefficients smaller than ideal values. These RNAs would exhibit larger intramolecular RNA-RNA interactions, and they do not look (NAM ¼ 0)

ideal) as the rest of the transcriptomes.

Now assume that the transcriptome W0 is composed of completely random sequences of A, C, <sup>G</sup>, <sup>U</sup>s-labeled W0-R. How many of the 2.8�1010 words of length <sup>k</sup> composed of four different letters would be unique (not identical) in the model? Combination of all possible k-mer words would be 422 <sup>¼</sup> 1.76�10<sup>13</sup> since there are four possible nucleotide letters at each of the <sup>k</sup> positions. Since there are about 1000 times more possible combinations than there are k ¼ 22 words, we could assume that all 22-mer words are unique. Smaller values of k will result in repeats or duplicate words increasing frequency values in Wk. These calculations give an

The next more realistic model is composed of eight real human RNA transcripts comprising a simple representation of the transcriptome in a cell (Table 1 and in Ref. [17]). It is assembled

. RCM becomes 2000�2.8�1010 or 5.6�1013 for each mRNA transcript

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

87

with the majority of W0

Transcriptome construction

Transcriptome <sup>¼</sup> <sup>Σ</sup> <sup>W</sup>Biotype

rRNA and tRNA types.

Notes: Sequence words summed into dictionary over all transcript types.

and behave (DRNA ≪ DRNA

2.3.2. Transcriptome model 0-R (W0-R)

2.3.3. Transcriptome model 1 (W1

expected value for RCM based on no biases in the sequences.

)

#### 2.3. Simplest models of transcriptomes

We consider simplest models of the transcriptome to examine some limits on the model parameters and functions. Component of a transcriptome are listed in Table 1.

#### 2.3.1. Transcriptome model 0 (W0 )

Simplest model considers a spatially uniform transcriptome composed of ribosomes, tRNA, and a minor number of mRNAs. Here, the distribution of transcripts is assumed uniform and the transcriptome model lacks spatial elements. Further, consider a transcriptome where


#### Transcriptome construction

2.2.3. Genome-wide profiling of in vivo RNA structure

86 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

model by adding custom transform and load tools for each dataset.

)

We consider simplest models of the transcriptome to examine some limits on the model

Simplest model considers a spatially uniform transcriptome composed of ribosomes, tRNA, and a minor number of mRNAs. Here, the distribution of transcripts is assumed uniform and the transcriptome model lacks spatial elements. Further, consider a transcriptome where

parameters and functions. Component of a transcriptome are listed in Table 1.

2.3. Simplest models of transcriptomes

2.3.1. Transcriptome model 0 (W0

Recent transcriptome-wide RNA structure profiling through application of structure-probing enzymes or chemicals combined with high-throughput sequencing promises in vivo RNA structural information availability. Resultant datasets provide opportunity to investigate RNA structural information on global scale for STIC development. The analysis of high-throughput RNA structure profiling data requires considerable computational effort and currently dataset are not readily available. StructureFold processes and performs analysis of raw high-throughput RNA structure profiling data [29], incorporating wet-bench structural information from chemical probes and ribonucleases to restrain RNA structure prediction via RNAstructure and ViennaRNA package algorithms. StructureFold is deployed via the Galaxy platform. Alternatively, structure-seq is a recent quantitative and high-throughput method that provides genome-wide information on RNA structure with single-nucleotide resolution [30]. The methodology can perform both in vitro and in vivo RNA structure-function determinations, with insights to RNA regulation of gene expression and RNA processing. Implementation of structure-seq begins with chemical RNA structure probing under single-hit kinetics conditions. Modified RNA is then subjected to reverse transcription using random hexamer primers, then reverse transcription executed until it is blocked by chemically modified residues. Resultant cDNAs are amplified by adapter-based polymerase chain reaction (PCR) which are subjected to high-throughput sequencing, subsequently allowing retrieval of structural information on a genome-wide scale. A single structure-seq experiment can provide information on tens of thousands of RNA structures in a matter of weeks. Ding et al. [31] used RNAstructure calculated for each of thousands of mRNAs a positive predictive value (PPV), which they use to compare relative frequencies of base pairing in vivo constrained RNA structures to in silico predicted RNA structure. They found that most mRNAs do not fold in vivo as to in silicopredicted structures, as evident from a broad PPV distribution. Interestingly, mRNAs of cold and metal ion stress-response genes folded in vivo significantly different from their unconstrained in silico predictions. These stressors are known to affect RNA structure and thermo-stability like melting temperature Tm. Instead, genes involved with basic biological functions such as gene expression, protein maturation or processing, and peptide metabolic processes show little change in their in vivo-constrained and in silico-predicted RNA secondary structures. Ding speculated mRNAs related to cell maintenance and showing high PPV may have evolved to resist large conformational changes in order to maintain homeostasis, an idea suggesting RNA resilience. This bias may be detectable in our transcriptome model W. As genome-wide profiling of in vivo RNA structure datasets becomes easily available [32], the information can be incorporated into the STIC

Notes: Sequence words summed into dictionary over all transcript types.

Table 1. Transcriptome construction from different RNA biotypes. Transcriptome model W1 is a subset of 8 human rRNA and tRNA types.

ribosomes and tRNAs are all "A" for adenosine, and the mRNAs are also all "A." We can construct transcriptome model W0 using transcript values found in Table 1 from Seffens [17], with sizes n and respective abundances. The number of nonunique words (of size k) would be n – k þ 1 for each transcript. Using sizes and respective abundance values in Table 1 [17], gives 2.8�1010 words, composed of all "A"s. Assume mRNAs are 2 kb "A"s, then there are no reverse complement interactions, so RCM should be zero, while NAM would be maximal (RCM and NAM defined in Section 2.2). The diffusion coefficient for these mRNAs would be ideal since there are no base-pairing interactions within W0 and transcript similarity to the transcriptome is maximal. Now assume the mRNAs are all "U"s, they now strongly interact with the majority of W0 . RCM becomes 2000�2.8�1010 or 5.6�1013 for each mRNA transcript (while NAM would be zero). This would be the expected maximal value for RCM with mRNAs of 2 kb size, yielding diffusion coefficients smaller than ideal values. These RNAs would exhibit larger intramolecular RNA-RNA interactions, and they do not look (NAM ¼ 0) and behave (DRNA ≪ DRNA ideal) as the rest of the transcriptomes.

#### 2.3.2. Transcriptome model 0-R (W0-R)

Now assume that the transcriptome W0 is composed of completely random sequences of A, C, <sup>G</sup>, <sup>U</sup>s-labeled W0-R. How many of the 2.8�1010 words of length <sup>k</sup> composed of four different letters would be unique (not identical) in the model? Combination of all possible k-mer words would be 422 <sup>¼</sup> 1.76�10<sup>13</sup> since there are four possible nucleotide letters at each of the <sup>k</sup> positions. Since there are about 1000 times more possible combinations than there are k ¼ 22 words, we could assume that all 22-mer words are unique. Smaller values of k will result in repeats or duplicate words increasing frequency values in Wk. These calculations give an expected value for RCM based on no biases in the sequences.

#### 2.3.3. Transcriptome model 1 (W1 )

The next more realistic model is composed of eight real human RNA transcripts comprising a simple representation of the transcriptome in a cell (Table 1 and in Ref. [17]). It is assembled from four of the most prevalent human tRNAs with lengths of n ¼ 71–73 nt, and four of the major subunits of the eukaryotic ribosome with sizes from n ¼ 121 to 5034 nt, with the total number of nucleotides N being the sum of the nucleotides in each transcript, or N ¼ 7470 nt. Then the frequency of words with length k that are contained in each transcript is a subset of the number of possible <sup>k</sup>-mer words which is <sup>n</sup> – <sup>k</sup> <sup>þ</sup> 1. In Model-1 labeled as W<sup>1</sup> , for each word length from k ¼ 3 to 22, word count was calculated along with the sum of the frequencies of those words extracted from the simple eight RNA transcripts. The intermediate output from program TIC-generator (for transcriptome information cloud generator, described in Ref. [17]) listed all k-mer words contained in each transcript, together with their frequency of occurrence. These lists from the eight rRNA and tRNA transcripts were combined, and then duplicate words resolved to form dictionary Wk 1 . With the total possible number of words of length k ¼ 4k , the fraction of all the words actually present in Wk <sup>1</sup> decreased for increasing word size [17]. It is interesting that the peak in unique and total duplicate (blue diamonds in Ref. [17]) words is maximal at the same size as the miRNA "seed" sequence as defined in Ref. [18].

is normalized for sequence length. The first five studies examined miRNA, while the Chen [33] and Friedel [34] studies measured mRNAs. Description of data sources that were grouped into high and low study parameter sets, with mean values and t-tests calculated detailed in sections

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

89

The Villarroya-Beltri [35] work reports on microarray datasets of exosome and cellular fractions from activated and resting human T lymphocyte cultures. They differentially assessed whether RNAs are specifically enriched within exosomes by performing microarray analysis of activation-induced variations in mRNA and miRNA profiles from primary T lymphoblast and their secreted exosomes. Data found in their supplementary data and also data publicly available at gene expression omnibus as Gene Expression Omnibus (GEO) Series accession number GSE50972 were used for Table 2. They showed that for most cases, miRNAs modulated upon activation are differentially found in cells and exosomes for either upregulated or downregulated miRNAs. This suggests that mRNA and miRNA loading into exosomes is not a simple passive process. Specific miRNAs were more highly expressed in exosomes than found in the cells, and in most cases this difference is preserved under cellular resting or activated conditions. Similarly, most miRNAs that are preferentially found in cells than in exosomes also keep this tendency regardless of the activation state of the cell. As such, Villarroya-Beltri classified some miRNAs as specifically sorted into exosomes (labeled EXOmiRNAs), whereas others are specifically retained in cells (as CLmiRNAs). We calculated tCount and rcCount as a count (Table 2 in Ref. [17]), and tWord and rcWord, the latter which factor the expression level of that word. Other measures compared counts and words to a randomized transcriptome (RAN). We used a word size k ¼ 7 roughly equal to the miRNA seed sequence length [17]. Values of rcWord (mean 10.31) were lower than tWord (mean 12.45), and hence RCM and RCC were more negative for exosomes compared to cytoplasmic miRNAs. This supports the STIC model since exosome transcripts must diffuse further than cytoplasmic (CL) RNA, so avoid reverse complementarity. In summary, all trial measures calculated from this dataset showed

2.4.1. Model 1 validation with miRNA from exosome datasets

significant support for the transcriptome model except for Z-RCM.

Park et al. [36] study compared microarray analysis of cytoplasmic and in this case nuclear fractions of hct116 colon cancer cells. They identified various miRNAs that exist in isolated nuclei from miRNA profiles correlated between cytoplasmic and nuclear fractions from multiple microarray analyses. Nuclear confinement of the mature form of miRNAs was validated by controlling reverse transcriptase RT-PCR conditions excluding the presence of precipitate forms of miRNA (e.g., as pri-miRNA or pre-miRNA). They found that elevated levels of representative miRNAs in purified nuclei support the idea that significant numbers of mature miRNAs survive not only in the cytoplasm but also in the nucleus. We sorted their data by N/C ratio and \*partitioned these data into two groups: N/C > 0.47, which was nuclear-enriched (45 samples), and N/C < 0.47, which was preferentially found in the cytoplasm (33 samples). We found that tCount was 4.02 for nuclear-enriched, and 5.00 for cytoplasmic, with a t-test p-value of 0.116 between the groups; while tWord was 4.73 for nuclear and 10.58 for cytoplasmic

2.4.2. Model 1 validation with nuclear-enriched miRNAs

below.

### 2.3.4. Randomized transcriptome of Model-1 (W1-R)

We ran TIC-generator with shuffled-sequence transcripts labeled Model 1-R. Base composition of Model-1 transcriptome is 1341 "A," 2320 "C," 2519 "G," and 1291 "T," or 18% "A," 31% "C," 34% "G," and 17% "T." Using a random letter generator, we assembled four random transcriptomes with the same transcript length for the eight sequences and equal Model-1 base composition. We examined mostly word lengths k of 7 and 8 in preliminary studies shown below.

#### 2.4. Real model validation

As a validation of this transcriptome model framework, we utilized the simple transcriptome model version (simple model W1 ) that used real highly expressed genes, and for comparison separately, randomized sequences of that transcriptome (W1-R). This simple realistic model is composed of only eight real human RNA transcripts as a basic representation of the transcriptome in a cell. Experimental validation of the basic model transcriptome for k-mers considered various trial functions of semantic word similarity and reverse complementarity, which were calculated using published data sets. For example, trial functions evaluated include tWord for transcriptome words in common with target multiplied by respective word frequency in Wk. A total of seven RNA studies, with data sources grouped into high and low study parameter sets, were statistically analyzed by mean values and t-test calculated as two-tail t-test under two-sample equal variance assumption models (Table 2). Validation for the STIC model examined various functions of reverse complementarity using these published data sets. Here, we assume that appearance in exosomes or microparticles requires greater mobility and hence larger diffusion coefficients than cytoplasmic or nuclear RNAs [17]. Several functions tested include tWord for transcriptome words in common with target multiplied by word frequency in the transcriptome, rcWord (reverse complement k-mer words in common times frequency), RCM ¼ rcWord – tWord, reverse complement count (RCC) measure ¼ tCount – rcCount, Z-RCC as a zscore of RCC compared to four randomized transcriptomes Model 1-R, Z-RCM as the z-score of RCM, RCC-Ran which subtracts the value computed from 1-R and finally (RCC-Ran)/Len which is normalized for sequence length. The first five studies examined miRNA, while the Chen [33] and Friedel [34] studies measured mRNAs. Description of data sources that were grouped into high and low study parameter sets, with mean values and t-tests calculated detailed in sections below.

#### 2.4.1. Model 1 validation with miRNA from exosome datasets

from four of the most prevalent human tRNAs with lengths of n ¼ 71–73 nt, and four of the major subunits of the eukaryotic ribosome with sizes from n ¼ 121 to 5034 nt, with the total number of nucleotides N being the sum of the nucleotides in each transcript, or N ¼ 7470 nt. Then the frequency of words with length k that are contained in each transcript is a subset of

word length from k ¼ 3 to 22, word count was calculated along with the sum of the frequencies of those words extracted from the simple eight RNA transcripts. The intermediate output from program TIC-generator (for transcriptome information cloud generator, described in Ref. [17]) listed all k-mer words contained in each transcript, together with their frequency of occurrence. These lists from the eight rRNA and tRNA transcripts were combined, and then duplicate

It is interesting that the peak in unique and total duplicate (blue diamonds in Ref. [17]) words

We ran TIC-generator with shuffled-sequence transcripts labeled Model 1-R. Base composition of Model-1 transcriptome is 1341 "A," 2320 "C," 2519 "G," and 1291 "T," or 18% "A," 31% "C," 34% "G," and 17% "T." Using a random letter generator, we assembled four random transcriptomes with the same transcript length for the eight sequences and equal Model-1 base composition. We examined mostly word lengths k of 7 and 8 in preliminary studies shown

As a validation of this transcriptome model framework, we utilized the simple transcriptome

separately, randomized sequences of that transcriptome (W1-R). This simple realistic model is composed of only eight real human RNA transcripts as a basic representation of the transcriptome in a cell. Experimental validation of the basic model transcriptome for k-mers considered various trial functions of semantic word similarity and reverse complementarity, which were calculated using published data sets. For example, trial functions evaluated include tWord for transcriptome words in common with target multiplied by respective word frequency in Wk. A total of seven RNA studies, with data sources grouped into high and low study parameter sets, were statistically analyzed by mean values and t-test calculated as two-tail t-test under two-sample equal variance assumption models (Table 2). Validation for the STIC model examined various functions of reverse complementarity using these published data sets. Here, we assume that appearance in exosomes or microparticles requires greater mobility and hence larger diffusion coefficients than cytoplasmic or nuclear RNAs [17]. Several functions tested include tWord for transcriptome words in common with target multiplied by word frequency in the transcriptome, rcWord (reverse complement k-mer words in common times frequency), RCM ¼ rcWord – tWord, reverse complement count (RCC) measure ¼ tCount – rcCount, Z-RCC as a zscore of RCC compared to four randomized transcriptomes Model 1-R, Z-RCM as the z-score of RCM, RCC-Ran which subtracts the value computed from 1-R and finally (RCC-Ran)/Len which

. With the total possible number of words of length k ¼

) that used real highly expressed genes, and for comparison

<sup>1</sup> decreased for increasing word size [17].

, for each

the number of possible <sup>k</sup>-mer words which is <sup>n</sup> – <sup>k</sup> <sup>þ</sup> 1. In Model-1 labeled as W<sup>1</sup>

1

88 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

is maximal at the same size as the miRNA "seed" sequence as defined in Ref. [18].

words resolved to form dictionary Wk

, the fraction of all the words actually present in Wk

2.3.4. Randomized transcriptome of Model-1 (W1-R)

4k

below.

2.4. Real model validation

model version (simple model W1

The Villarroya-Beltri [35] work reports on microarray datasets of exosome and cellular fractions from activated and resting human T lymphocyte cultures. They differentially assessed whether RNAs are specifically enriched within exosomes by performing microarray analysis of activation-induced variations in mRNA and miRNA profiles from primary T lymphoblast and their secreted exosomes. Data found in their supplementary data and also data publicly available at gene expression omnibus as Gene Expression Omnibus (GEO) Series accession number GSE50972 were used for Table 2. They showed that for most cases, miRNAs modulated upon activation are differentially found in cells and exosomes for either upregulated or downregulated miRNAs. This suggests that mRNA and miRNA loading into exosomes is not a simple passive process. Specific miRNAs were more highly expressed in exosomes than found in the cells, and in most cases this difference is preserved under cellular resting or activated conditions. Similarly, most miRNAs that are preferentially found in cells than in exosomes also keep this tendency regardless of the activation state of the cell. As such, Villarroya-Beltri classified some miRNAs as specifically sorted into exosomes (labeled EXOmiRNAs), whereas others are specifically retained in cells (as CLmiRNAs). We calculated tCount and rcCount as a count (Table 2 in Ref. [17]), and tWord and rcWord, the latter which factor the expression level of that word. Other measures compared counts and words to a randomized transcriptome (RAN). We used a word size k ¼ 7 roughly equal to the miRNA seed sequence length [17]. Values of rcWord (mean 10.31) were lower than tWord (mean 12.45), and hence RCM and RCC were more negative for exosomes compared to cytoplasmic miRNAs. This supports the STIC model since exosome transcripts must diffuse further than cytoplasmic (CL) RNA, so avoid reverse complementarity. In summary, all trial measures calculated from this dataset showed significant support for the transcriptome model except for Z-RCM.

#### 2.4.2. Model 1 validation with nuclear-enriched miRNAs

Park et al. [36] study compared microarray analysis of cytoplasmic and in this case nuclear fractions of hct116 colon cancer cells. They identified various miRNAs that exist in isolated nuclei from miRNA profiles correlated between cytoplasmic and nuclear fractions from multiple microarray analyses. Nuclear confinement of the mature form of miRNAs was validated by controlling reverse transcriptase RT-PCR conditions excluding the presence of precipitate forms of miRNA (e.g., as pri-miRNA or pre-miRNA). They found that elevated levels of representative miRNAs in purified nuclei support the idea that significant numbers of mature miRNAs survive not only in the cytoplasm but also in the nucleus. We sorted their data by N/C ratio and \*partitioned these data into two groups: N/C > 0.47, which was nuclear-enriched (45 samples), and N/C < 0.47, which was preferentially found in the cytoplasm (33 samples). We found that tCount was 4.02 for nuclear-enriched, and 5.00 for cytoplasmic, with a t-test p-value of 0.116 between the groups; while tWord was 4.73 for nuclear and 10.58 for cytoplasmic


values of 4.56 and 6.00 compared to 5.58 and 8.80, respectively, for low abundance transcripts. This set of exosome miRNAs had RCM and RCC values of �1.54 and 3.8 compared to 0.36 and 5.8 for low abundance transcripts, again supporting the STIC model. Several of the trial

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

91

Pursuing in-depth understanding of the mechanism supporting selective exportation of miRNAs to extracellular vesicles (EVs), Guduric-Fuchs [39] employed next generation sequencing to discriminate global expression patterns of small RNAs in HEK293T cells and the EVs that they released. Enrichment of overexpressed miRNA in EVs was measured by RTqPCR in HEK293T cells, mesenchymal stem cells, macrophages, and immune cells. We sorted data from Guduric-Fuchs by EV/cell ratio, then compared the top 10 (exosome-enriched) and bottom (cytoplasmic enriched) miRNAs by evaluating the measures listed in Table 2. Only trial functions Z-RCC and RCC-RAN were significant from this dataset. Overall from using EV/cell in various measures examined across the studies, tWord and tCount (from Ref. [17]), along with their difference (tW–tC), have values that progress from lower for nuclear, higher for cytoplasmic, and highest for exosomal miRNAs. Therefore, we consider under transitivity, EXO > CL > NUC for these transcriptome measures of similarity. This supports the notion that miRNAs with sequence similarity to the overall transcriptome can random-walk furthest from their points of transcription if the secretion mechanism requires a great distance to travel. These conclusions on trial functions are most significant with the tCount measure, with a pvalue close to zero for the Villarroya-Beltri study, and 0.016 for the Guduric-Fuchs study, while the Park study showed little difference (p-value ¼ 0.122) for tCount between nuclear and

Normalization is a crucial step in the analysis of RNA-seq data and has a strong impact on the detection of differentially expressed genes sought to validate the STIC model. Several normalization strategies have been proposed to correct for between-sample distributional differences in read counts, such as differences in total counts (i.e., sequencing depths), and within-sample gene-specific effects, such as gene length or GC-content effects [40]. Global-scaling normalization adjusts gene-level counts by a single factor per sample, such as the per-sample total read count, or reads per kilobase of exon model per million mapped reads (RPKM), or some housekeeping gene count. Statistical corrections by a quantile per-sample count distribution or other robust summaries obtained by relating each sample to a reference sample (e.g., trimmed mean of M values (TMM) and methods of Anders and Huber [41]). Although there have been efforts to systematically compare normalization methods [42], this important aspect of RNA-seq analysis is still not fully resolved. When data arise from complex experiments as in Section 2 above, involving cell fractionation, low-input RNA or different batches and read lengths, there may be more to correct for than differences in sequencing depth, referred to as unknown nuisance technical variation error. One methodology correction is the addition of spike-in controls within the normalization procedure [43]. Control designs have been successfully employed in microarray normalization, for miRNA and mRNA arrays [44]. Negative controls in the normalization procedure test the assumption that the majority of genes are not differentially expressed between study conditions. This assumption can be violated when a

functions in Table 2 were significant measures for data sets in that study.

cytoplasmic enrichment.

2.4.4. Word count normalization from RNA-seq datasets

Notes: Double-asterisk cells have significance below 0.05, while single-asterisk cells have significance below 0.10 but above 0.05. Cells with "nc" were not calculated from randomized transcriptome.

#### Table 2. t-Tests of case studies with STIC model parameters.

miRNAs, with a significant t-test p-value of 0.023 between nuclear and cytoplasmic groups. We also found nuclear-enriched miRNAs have higher rcWord values compared to cytoplasmic miRNA (p-value ¼ 0.021 in Table 2), suggesting those transcripts have greater potential to interact with other transcriptome RNAs and hence may have lower than expected diffusion coefficients. The other evaluated measures did not show significance between groups.

#### 2.4.3. Model 1 validation with additional RNA studies

Huang et al. [37] study utilized RNA-seq with exosomes from human plasma. We found that the top 100 abundant miRNAs in exosomes had tCount (mean 4.80) and tWord (mean 6.72) measures compared to those lower 100 with low "rcmm" reads (mean 4.64 and 7.41, respectively). In support of the STIC model, exosome transcripts have more similarity to the simple model transcriptome. Exosome abundant miRNAs had negative RCM (mean �0.87) and RCC (mean �0.27) measures compared to those with low rcmm reads (mean 1.37 and 0.55, respectively). The most significant trial function was RCM (p-value ¼ 0.002) followed by rcWord (pvalue ¼ 0.02) measure. From these data, we find similarity that exosome transcripts have less reverse complementarity to the simple Model-1 transcriptome. Again, these results are supported by Cheng et al. [38] study of exosomes in human blood. From 50 most abundant miRNAs in exosome samples labeled "Plasma UC Exo," we find mean tCount and tWord values of 4.56 and 6.00 compared to 5.58 and 8.80, respectively, for low abundance transcripts. This set of exosome miRNAs had RCM and RCC values of �1.54 and 3.8 compared to 0.36 and 5.8 for low abundance transcripts, again supporting the STIC model. Several of the trial functions in Table 2 were significant measures for data sets in that study.

Pursuing in-depth understanding of the mechanism supporting selective exportation of miRNAs to extracellular vesicles (EVs), Guduric-Fuchs [39] employed next generation sequencing to discriminate global expression patterns of small RNAs in HEK293T cells and the EVs that they released. Enrichment of overexpressed miRNA in EVs was measured by RTqPCR in HEK293T cells, mesenchymal stem cells, macrophages, and immune cells. We sorted data from Guduric-Fuchs by EV/cell ratio, then compared the top 10 (exosome-enriched) and bottom (cytoplasmic enriched) miRNAs by evaluating the measures listed in Table 2. Only trial functions Z-RCC and RCC-RAN were significant from this dataset. Overall from using EV/cell in various measures examined across the studies, tWord and tCount (from Ref. [17]), along with their difference (tW–tC), have values that progress from lower for nuclear, higher for cytoplasmic, and highest for exosomal miRNAs. Therefore, we consider under transitivity, EXO > CL > NUC for these transcriptome measures of similarity. This supports the notion that miRNAs with sequence similarity to the overall transcriptome can random-walk furthest from their points of transcription if the secretion mechanism requires a great distance to travel. These conclusions on trial functions are most significant with the tCount measure, with a pvalue close to zero for the Villarroya-Beltri study, and 0.016 for the Guduric-Fuchs study, while the Park study showed little difference (p-value ¼ 0.122) for tCount between nuclear and cytoplasmic enrichment.

#### 2.4.4. Word count normalization from RNA-seq datasets

miRNAs, with a significant t-test p-value of 0.023 between nuclear and cytoplasmic groups. We also found nuclear-enriched miRNAs have higher rcWord values compared to cytoplasmic miRNA (p-value ¼ 0.021 in Table 2), suggesting those transcripts have greater potential to interact with other transcriptome RNAs and hence may have lower than expected diffusion

Notes: Double-asterisk cells have significance below 0.05, while single-asterisk cells have significance below 0.10 but above

Source experiment N tWord rcWord RCM RCC Z-RCC Z-RCM RCC-Ran (RCC-Ran)/Len

N/C > 0.471 nuclear 43 0.024\*\* 0.021\*\* 0.62 0.76 0.77 0.31 0.41 0.42

Top-low rcmm 100 0.522 0.02\*\* 0.002 \*\* 0.042 \*\* 0.83 0.16 0.078\* 0.072\*

Top-low 50 0.128 0.002 \*\* 0.035 \*\* 0.002 \*\* 0.062\* 0.25 0.132 nc

Ratio EV/cell top-low 10 0.093\* 0.39 0.3 0.075\* 0.046\*\* 0.178 0.03\*\* nc EV RPMM top-low 10 0.79 0.973 0.736 0.96 0.268 0.816 0.306 nc

Perinuclear-cell 6 0.62 0.76 0.24 0.14 0.15 0.18 0.076\* 0.095\*

mRNA half-life 15 0.017\*\* 0.025\*\* 0.86 nc nc 0.44 nc nc

\*\* 0.08\* 0.023\*\* 0.029\*\* 0.603 0.04\*\* 0.038\*\*

\*\* 0.206 0.008 \*\* 0.033\*\* 0.503 0.032\*\* 0.028\*\*

\*\* 2�10�<sup>5</sup>

90 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

\*\* 1�10�<sup>5</sup>

Huang et al. [37] study utilized RNA-seq with exosomes from human plasma. We found that the top 100 abundant miRNAs in exosomes had tCount (mean 4.80) and tWord (mean 6.72) measures compared to those lower 100 with low "rcmm" reads (mean 4.64 and 7.41, respectively). In support of the STIC model, exosome transcripts have more similarity to the simple model transcriptome. Exosome abundant miRNAs had negative RCM (mean �0.87) and RCC (mean �0.27) measures compared to those with low rcmm reads (mean 1.37 and 0.55, respectively). The most significant trial function was RCM (p-value ¼ 0.002) followed by rcWord (pvalue ¼ 0.02) measure. From these data, we find similarity that exosome transcripts have less reverse complementarity to the simple Model-1 transcriptome. Again, these results are supported by Cheng et al. [38] study of exosomes in human blood. From 50 most abundant miRNAs in exosome samples labeled "Plasma UC Exo," we find mean tCount and tWord

coefficients. The other evaluated measures did not show significance between groups.

2.4.3. Model 1 validation with additional RNA studies

0.05. Cells with "nc" were not calculated from randomized transcriptome.

Table 2. t-Tests of case studies with STIC model parameters.

Villarroya-Beltri

Park paper

Huang paper

Cheng paper

Chen paper

Friedel

Guduric-Fuchs paper

EXO-CL resting 75 4�10�<sup>7</sup>

EXO-CL activated 67 4�10�<sup>7</sup>

Normalization is a crucial step in the analysis of RNA-seq data and has a strong impact on the detection of differentially expressed genes sought to validate the STIC model. Several normalization strategies have been proposed to correct for between-sample distributional differences in read counts, such as differences in total counts (i.e., sequencing depths), and within-sample gene-specific effects, such as gene length or GC-content effects [40]. Global-scaling normalization adjusts gene-level counts by a single factor per sample, such as the per-sample total read count, or reads per kilobase of exon model per million mapped reads (RPKM), or some housekeeping gene count. Statistical corrections by a quantile per-sample count distribution or other robust summaries obtained by relating each sample to a reference sample (e.g., trimmed mean of M values (TMM) and methods of Anders and Huber [41]). Although there have been efforts to systematically compare normalization methods [42], this important aspect of RNA-seq analysis is still not fully resolved. When data arise from complex experiments as in Section 2 above, involving cell fractionation, low-input RNA or different batches and read lengths, there may be more to correct for than differences in sequencing depth, referred to as unknown nuisance technical variation error. One methodology correction is the addition of spike-in controls within the normalization procedure [43]. Control designs have been successfully employed in microarray normalization, for miRNA and mRNA arrays [44]. Negative controls in the normalization procedure test the assumption that the majority of genes are not differentially expressed between study conditions. This assumption can be violated when a global shift in expression occurs between conditions, such that control-based normalization may be necessary for technical variation, and a global mean read for global differences in RNA levels.

They identified these readout sequences with complementary FISH probes via rounds of hybridization and imaging; each round using a different readout probe. To increase the sig-

Models of RNA Interaction from Experimental Datasets: Framework of Resilience

http://dx.doi.org/10.5772/intechopen.69452

93

Brownian effects are ubiquitous in numerous examples of soft condensed matter physics [20] in which the system can be modeled as a set of interacting degrees of freedom in contact with a heat reservoir. Brownian motion plays an important role when one infers macroscopic behaviors from mesoscopic levels of description, frequently a desire in the study of complex systems. Dynamics at the mesoscopic level is governed by a set of Langevin processes or equivalently by the corresponding N-particle Fokker–Planck equation. This scheme applies nonequilibrium thermodynamics to derive the kinetic equations describing the evolution of an N-particle probability distribution function [20]. One then considers a system of N Brownian particles diluted in a solvent, which acts as a thermal reservoir. Particle velocities are then modeled as internal thermodynamic variables and permit an analysis in the phase space of the Brownian particles. A local equilibrium hypothesis constrains the phase space level and from it one derives the thermodynamic entropy balance equation. Entropy production accounts for irreversible processes taking place in the phase space, then quantifying fluxes and forces can be done in a similar manner as in the thermodynamics of irreversible processes [20]. A general thermodynamic treatment of systems of N interacting Brownian motion particles as described

4. Resilience as a systems biology measure from transcriptome model

4.1. Formalization of metric for resilience in biological systems using STIC metrics

scales, consistent with the established multiscale nature of biological systems.

Insight into structural determinants of robustness and resilience can guide the understanding of systems that go through transitions. Systems engineering research has developed methodologies to measure the functionality and complexity of engineered systems for designing and assessing system resilience. While system functions, resilience, functionality, and complexity are widely used concepts in systems engineering, there is significant diversity in definitions and no unified approach to measurement in the systems biology area [51]. One method for measuring impacts on functionality in dynamic engineered systems is based on changes in kinetic energy [52]. This metric can be applied at particular levels of abstraction and system

Development of a resilience measure from transcriptome RNAs could improve basic knowledge of the transcriptome and responses to stress. Transcriptome size and overall variation have been documented across cell cycle stages, tissue types, developmental stages, diurnal cycles, sexes, and environment [50]. Despite the ubiquity of transcriptome size variation, its potential to introduce systematic bias into expression profiling has been largely overlooked

nal-to-background ratio, each cellular RNA is labeled with 192 encoding probes.

by Fokker-Planck equations is detailed by Savel'ev et al. [16].

and this study uncovers responses of the transcriptome to stress.

3.2. RNA diffusion
