*Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency… DOI: http://dx.doi.org/10.5772/intechopen.108819*

constated that there was not much difference between the model performance when using a sliding window size of 3,4 or 5 and the performance started decreasing at window size of 6. Therefore, we chose the window's size 3 because it gives the best results in less time. After we applied the TF-IDF to the dataset we got a data frame of 10,715 rows and 7461 attributes. The Random Forest applied to this dataset gave a good performance with a scope of improvement because all 7461 features do not have the same importance in the prediction. We applied the XgBoost algorithm to select the best features. The best threshold showed that 232 features contribute to the prediction at 0.2% at least. The performances of different classifiers are summarized in **Table 2**.

The receiver operating characteristic (ROC) curves of the three classifiers confirms our preference of the Random Forest to other classifiers. The Area under the curve is 0.98, 0.95 and 0.93 for Random Forest, Logistic Regression and Multinomial Naïve Bayesian respectively (**Figure 3**). Sometimes, one algorithm can overperform other algorithms for one metric measure and loses for other metrics. But in this study, the


#### **Table 2.**

*Comparative summary of three different predictive models.*

#### **Figure 3.**

*Illustration of classification trees with three nodes. The thresholds ti depends on each note and are learned during the training process.*

Random Forest overperformed other two classifiers in all metrics and more importantly for Matthew Correlation Coefficient (MCC) because it is an ensemble-based algorithm using the resampling with replacement method to reduce variance. This method makes that the Random Forest takes a lot of time to be trained but it is worth it because: a tree-based learning algorithm, on large datasets, allows to quantitative and qualitative input variables, can be immune to redundant variables or variables with high correlation which may lead to overfitting in other learning algorithms and has few parameters to tune (**Figures 4** and **5**).

## **Figure 4.**

*A systematics review of imbalanced data challenges and dimensionality reduction.*

#### **Figure 5.**

*Representation of ROC of the AUC for three classifiers showing that the random Forest curve is higher than other classifiers.*

*Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency… DOI: http://dx.doi.org/10.5772/intechopen.108819*
