**1. Introduction**

The protein-RNA pairs are highly involved in various regulatory processes. Finding the binding sites of the RNA-binding Proteins (RBP) is therefore an important research goal. Studies have shown that RBPs bind to RNA molecules by recognizing both sequences (sequence motifs) and secondary structure contexts (structure motifs) [1–4]. Some of them have been based on sequence-derived features such as amino acid composition, dipeptide composition, composition-transition-distribution of seven physicochemical properties, evolutionary information in terms of positionspecific scoring matrices and functional domain composition [5–7]. Although progress has been made in the implementation of predictive methods for RBPs, insufficient attention has been paid to the development of predictive methods for RNA-protein

interactions (RPI). The history is brief, and there are not many existing computational tools because of the scarcity of available data [8].

The machine learning (ML) methods, which have become standard tools in many fields of science and engineering, are computationally efficient methods that employs computer science, artificial intelligence, computational statistics, and information theory to fit high-dimensional models to large amounts of data. The ML methods read in data points which are generated within some application domain and each data point is characterized by two properties, such as features (predictor variables) and labels (predicted variables). The machine learning algorithms aims at learning to predict the label of a data point based solely on the features of this data point or identify the pattern those data points if they are neither classified nor labeled. The ML algorithms applied to labeled data points is called supervised learning in contrast to unsupervised learning which does not require knowing the label value of any data point. The dataset we used in this research was tagged with known labels (binding pairs are labeled as positive while non-binding pairs are labeled as negative). While the principle behind supervised ML sounds trivial, the challenge of modern ML applications is the data points non-linearity and complexity. This research focuses on three supervised learning algorithms: Logistic Regression, Random Forest, and Multinomial Naïve Bayesian.

The logistic regression is a binary classification method that can be applied to data points with feature vector *X* ∈ *Rn* and binary labels y. These binary labels take on values from a label space that contains two different label values (most cases y = {0,1}). The linear operator *h x*ð Þ¼ *<sup>w</sup>Tx*, with *<sup>w</sup>* <sup>∈</sup>*R<sup>n</sup>*, can take an arbitrary real random number and can predict the label y when compared to a given threshold. The data point with feature x would be classified as y = 1 if the *h x*ð Þ≥0 and *y* ¼ 0 *if the h x*ð Þ<0. The multinomial naïve Bayesian is a simple but important probabilistic model which is defined by a function *h* from the feature space X to the label space Y (*h* : *X* ! *Y*) such that the predicted value *h x*ð Þ, *x*∈ *X*, agrees enough with the true value *y*∈*Y*. The random forest is a flowchart-like description of a function from the feature space to label space that maps the features to their respective labels. While a random forest can be applied to an arbitrary feature space, we will discuss it for a specific space later in this paper.

In 2011, Pancaldi and Bähler [8–10] predicted the RNA-binding proteins and messanger-RNA using two conventional machine learning classifiers: support vector machine (SVM) and random forest (RF), while Bellucci et al. developed an algorithm called catRAPID to facilitate the predictions of 592 RPIs from the Protein Database Bank (PDB). They used the physicochemical properties of sequences as features and found three most predictive features: secondary structure propensities, hydrogen bonding, and van der Waals [8, 11]. The two benchmark datasets, called RPI369 and RPI2241, were constructed from PRIDB (a database of protein-RNA interfaces) [8, 12, 13] and achieved remarkable prediction accuracies on these two datasets using Conjoint Triad Feature (CTF) and normalized 4-gram frequencies. In 2013, the CatRAPID Omics was generated as an improved CatRAPID that used the information on protein and RNA domains involved in macromolecular recognition [8, 14, 15]. Zhao Hui-Zhan et al. [8, 16] proposed a deep learning model to predict RPIs using bi-gram from Position Specific Scoring Matrix (PSSM) approaches to extract features from proteins, and k-mers approach combined with a stacked auto-encoder for RNAs feature extraction.

In 2015, Suresh et al. [8, 17] integrated sequence information and predicted structure together to produce an accurate prediction of non-coding RNA-protein pairs on a newly constructed dataset, called RPI1807. When tested on the RPI369 and RPI2241 datasets mentioned above, some improvements were achieved on prediction

*Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency… DOI: http://dx.doi.org/10.5772/intechopen.108819*

accuracies. In 2017, Liu et al. proposed a semi-supervised method called LPI-NRLMF [18, 19] to predict lncRNA-protein interactions by neighborhood regularized logistic matrix factorization. One year later, Zhao et al. came up with IRWNRLPI method [20], integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA-protein interactions prediction and LPI-BNPRA method using the bipartite network projection recommended algorithm to identify lncRNA-protein interactions. The last four semi-supervised methods and the BNPMDA method proposed by Chen et al. [21] in late 2018, performed well only on interactive pairs with a high predictive accuracy but weakly for non-interactive pairs. In 2018, Hu et al. proposed HLPI Ensemble method [22] for identifying lncRNA-protein interactions in human only, which integrated three common machine learning algorithms, SVM, RF and Extreme Gradient Boosting (XGB).

All the machine learning methods discussed above, use handcrafted features from proteins. In this study we proposed a new method, called TF-IDF borrowed from natural language processing, to extract features from RPI pairs. The TF-IDF standing for Term Frequency–Inverse Document Frequency takes as input a sequence of strings and transform it into a vector of numerical values.
