**2.3 Feature selection**

The TF-IDF method vectorises the RNA and RBP sequences and transforms them into a 2D data frame with 10,715 rows and 7461 columns. I this situation, the dimensionality reduction is required. The XgBoost method, an optimized implementation of gradient boosted decision trees in python libraries, was used to estimate the importance of TF-values. That estimation consists in comparison of all attributes to each other, to rank them based on their contribution to the general classification. Extreme gradient boosting (XGBoost) is a new method that It can take weak feature classifiers and into one strong classifier [39] due to its gradient boosting algorithm, efficiency, flexibility, and portability [40, 41]. The XGBoost was used in the literature to discover and retain the features that highly impact the prediction [42–46] and was ten time less computationally expensive compared to other popular techniques [42].

## **2.4 Dataset balancing**

In the 10,715 samples we have, 6333 were labeled as positive samples (interacting pairs) while other 4382 were labeled as negative samples. The 1951 samples of difference between two classes are not enormous. However, most machine learning algorithms do not work very well with such imbalanced datasets [31, 47, 48]. This why we tried to train our model on unbalanced dataset and balance it thereafter. There are several techniques to balance datasets [32] but we chose to use two of them: Random Oversampling by using the bootstrapping method to increase the size of the minority class, and Under sampling that applies a nearest-neighbors algorithm [48] and "edit" the dataset by removing samples which do not agree "enough" with their neighborhood.
