**2. Material and methodology**

According to Hongchu Wang and Pengfei Wu in 2017 [8] there are 1973 RPI complexes available in the Protein Data Bank (PDB), which contains over 15,000 protein chains and more than 3000 RNA chains. However, according to research using high-throughput sequencing techniques (such as RNA-Seq), at least 30,000 lncRNAs were identified by 2013. In this study we combined the three different datasets; The RPI2241 dataset, containing 2241 RNA–protein pairs was extracted from PRIDB [13] and reconstructed by Wang in 2013, the RPI488, a non-redundant lncRPI dataset based on structural complexes which consists of 488 lncRNA-protein pairs, including 245 non-interacting pairs and 243 interacting pairs from Pan et al. [23, 24] and the RPI12737 dataset containing 12,737 experimentally validated RNA–protein pairs that extracted from NPInter v2.0 database [25]. This dataset contains the same number of non-interacting RNA–protein pairs (negative examples) as the number of interacting RNA–protein pairs. After the dataset combination, we cleaned the data by removing all pairs containing a non-amino acid character for proteins or a nonnucleotide for RNA. The difference between lengths of sequences could increase the sparsity of the TF-IDF data frame and affect the performance of our predictive model. The exploratory data analysis gave more details on the dataset (see **Table 1**). The first quartile of proteins lengths was 252 while the third quartile was 614, which means that the lengths of 50% of our combined dataset lie between 252 and 614. After all considerations, we decided to use this 50% of the dataset, containing 10,715 clean pairs, to train and test the predictive model.

#### **2.1 Transformation of the sequence into text format**

The biological sequences are sequences of successive letters without space with different lengths which are relevant to their biochemical structure and for their biological function. The bioinformaticians use the alignment process to arrange the primary structure of a protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

