**3. Predictive model: random Forest**

The prediction of RPIs was done after training and testing the Random Forest among other classifiers. The RF is a supervised machine learning algorithm that is constructed from decision tree algorithms developed by Tin Kam Ho in 1995 [33, 34] and used to solve classification and regression problems. The random forest establishes the result according to the mean predictions of all the decision trees. A decision tree consists of decision nodes, leaf nodes, and a root node. The algorithm behind the decision tree divides the training set into branches, which further split into new branches branches until a leaf node is attained (a leaf node cannot be splitted into other branches). This sequence of branches uses the Classification And Regression Tree (CART) methodology combined to the resampling with replacement [25]. The random forest has multiple parameters that can be optimized by most of them were kept by default. Among the parameters the criterion Gini and the minimum of sample required to split fixed at two trees and hundred branches were chosen for better results.

#### **3.1 Classification trees (Forest)**

A decision tree is a way of representing knowledge obtained in the inductive learning process. The space is split using a set of conditions, and the resulting structure is the tree. Assuming we have *<sup>n</sup>* pairs and TF-values vectors f g *Xi <sup>n</sup> <sup>i</sup>*¼1with outcomes *yi* , our dataset can be presented as follow:

*Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency… DOI: http://dx.doi.org/10.5772/intechopen.108819*

$$\text{Dataset} = \left\{ \begin{pmatrix} \mathbf{X}\_1, \ \mathbf{y}\_1 \end{pmatrix}, \begin{pmatrix} \mathbf{X}\_2, \ \mathbf{y}\_2 \end{pmatrix}, \dots, \begin{pmatrix} \mathbf{X}\_n, \ \mathbf{y}\_n \end{pmatrix} \right\} \tag{4}$$

Each TF-value vector is *Xk* ¼ *Xk*<sup>1</sup> ð Þ , *Xk*2, … *Xkd* and d is the number of TF-values from RNA and RBP altogether.

The decision tree is defined as binary process where a decision is made based on whether the TF-value *Xi* is inferior to a threshold t or not. This threshold depends on the node at which the decision is made. The top node contains all examples *Xn*, *yn* � �, and these examples are subdivided into children nodes according to the possibility of classification at that node. The subdivision of examples continues until every node at the bottom has examples which are in one class only.

### **3.2 The Gini criterion**

The Random Forest as a python implementation of the scikit-learn library, this is made by the parameter 'criterion '. This parameter is the function used to measure the quality of a split and it allows users to choose between 'Gini ', or 'entropy '. We preferred the Gini criterion because computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster. The Gini criterion is used to measure the diversity at each tree node when the TF-value and optimal threshold are chosen. Assuming the set of all examples is S and the set of examples at the node j is Sj, then S is a partition of children node sets, i.e.:

#### *<sup>S</sup>* <sup>¼</sup> <sup>⋃</sup>*<sup>l</sup>* <sup>1</sup>*Sj where l is the number of children nodes*

Each sample Sj is portioned into two classes C1 = interacting pair and C2 = noninteracting pair. The proportion of a sample Sj in the set of all examples and the proportion of Sj with a class Ci are respectively defined as follow:

$$P(\mathbf{S}\_{j}) = \frac{|\mathbf{S}\_{j}|}{|\mathbf{S}|}$$

$$P(\mathbf{C}\_{i}|\mathbf{S}\_{j}) = \frac{|\mathbf{S}\_{j} \cap \mathbf{C}\_{i}|}{\mathcal{S}} \tag{5}$$

The Gini criterion is the variation g(Sj) in the set Sj defined as follow:

$$\log\left(\mathbb{S}\_{\mathbb{H}}\right) = \sum\_{1}^{1} P\left(\mathbf{C}\_{i}|\mathbf{S}\_{j}\right) \left(\mathbf{1} - P\left(\mathbf{C}\_{i}|\mathbf{S}\_{j}\right)\right) \tag{6}$$

The variation g(Sj) reaches the maximum when the set Sj is equally divided in the class Ci and the minimum when the set Sj is just made by one of the two classes. The variation the full subdivision Sj (known as Gini Index) is defined as the weighted sum of their respective proportions in the set of all examples.

$$\text{Gini } Index = P(\mathbb{S}\_1)(\mathbf{g}(\mathbb{S}\_1) + P(\mathbb{S}\_2)(\mathbf{g}(\mathbb{S}\_2) + \dots \\ P(\mathbb{S}\_l)) \mathbf{g}(\mathbb{S}\_l) \tag{7}$$

#### **3.3 The random vector**

A random vector is defined as an array X of random variables defined on the same probability space. In this study the array is the TF-values vectors

$$X = (X\_1, X\_2 \dots X\_d) \text{ where } \\ X\_i \\ are \text{ column vectors} \tag{8}$$

The random y = {y1, y2, … yd} with yi⋲ {0,1} is the classification of examples where 1 represent a protein-RNA interaction (RPI) while 0 represent a non-interaction. The model vector (X,y) is defined on the same probability space as the random vector X.

The goal of this predictive model is to build a classifier which predict the random vector y (classes) from random vector X (TF-values) based on the examples in the dataset from paragraph 3.1. This classifier is based on a family of classification trees and the ensemble of those trees is called Random Forest.

#### **3.4 Ten-fold cross-validation method**

The cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given dataset is to be split into, and it is called kfold cross-validation (k = 10 for this study). The 80% was used for the 10-fold cross validation, randomly shuffled and split into 10 groups. Among the 10 groups, only one group was kept as validation data to test the model and the remaining 9 sub-samples were used as training data. Importantly, each observation in the validation set is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model 9 times. The 10 results were then averaged to produce a single estimate by summarizing the mean of the model scores. The metrics we used to evaluate the model performance are Accuracy, Specificity, Sensitivity and MCC (Matthews Correlation Coefficient)

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FN} + \text{TN}}$$

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}$$

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

$$\text{MCC} = \frac{\text{TP} \ast \text{TN} - \text{FP} \ast \text{FN}}{\sqrt{(\text{TP} + \text{TN})(\text{FP} + \text{FN})(\text{TP} + \text{FN})(\text{TP} + \text{FP})}} \tag{9}$$

Where TP, FP, TN, and FN stand for True Positive, False Positive, True Negative and False Negative respectively.

#### **3.5 Independent test**

The remaining 20% of the dataset was used to test the classifier performance to the unseen data. This test dataset was completely independent of the data sample used in 10-fold cross validation. The goal was to train the Random Forest with parameters having the best performance on new data.
