**2.2 TF-IDF for feature engineering**

The natural language processing has various types of approaches to transform the sequence of words into numerical values, such as the bag of words, words embedding, the term frequency inverse document frequency (TF-IDF), and so on. The TF-IDF measures the frequency of a term in a sequence which highly depends on the length of the sequence. The purpose of this method is to vectorize sequences [26–30]. To solve the sequence issue with the complicated alignment, TF-IDF method uses the combination of all possible terms in the dataset to have vectors of the same length with two extreme cases where TF value will be zero if the term does not appear in the dataset and 1 if all terms in the sequence are the same. The Term Frequency (TF) is used to measure how many times a term is present in a sequence and The Inverse Document Frequency assigns lower weight to frequent terms and assigns greater weight for the terms that are infrequent [31–33]. The TFIDF is the most widely used term weighting scheme. Yang and Huang [34] used it for calculating term weight according to the location and length of the


**Figure 1.** *Illustration of a sequence to text format using a sliding window of different sizes.* *Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency… DOI: http://dx.doi.org/10.5772/intechopen.108819*

keyword and Tian Xia and Yanmei Chai [35] implemented it by calculating distribution based on local term weighting and global term weighting to improve the efficiency of IR and TC systems and many researchers used the TF-IDF for feature engineering [36–38] to solve classification problems in reasonable time, efforts, and resources.

Assuming S a set of sequences: S = {s: s is a sequence} and T a set of terms: T = {t: t is a term.

TF would be a function defined as follow:

$$\text{TF}: T \ast \text{S} \stackrel{\text{TF}}{\rightarrow} [\mathbf{0}, \mathbf{1}]: (t : s) \stackrel{\text{TF}}{\rightarrow} \text{TF}(t, s) = \frac{\text{Number of approx } \text{of } t}{\text{Number of terms in } s} \tag{1}$$

Where t is a given term in a sequence s. The IDF function or normalization function which calculates the importance of a sequence in the dataset will be defined as follow:

$$IDF(\mathbb{S}\_t): \mathbb{S}\_T \stackrel{IDF}{\to} \mathbb{R}: \mathbb{S}\_t \stackrel{IDF}{\to} IDF(\mathbb{S}\_t) = \frac{N}{\mathbb{S}\_t} \tag{2}$$

Where ST is the set of all sequences containing the term t and N is the number of all sequences in the dataset and *st* ¼ ∣*ST*∣. Thereafter, the TF-IDF is the multiplicative value of TF(t,s) and IDF(st)

$$\text{TFIDF}(t, s) = TF(t, s) \* IDF(\mathbb{S}\_t) = \frac{N\_t^t \* N}{N\_t \* \mathbb{S}\_t} \tag{3}$$

Where*N<sup>s</sup> <sup>t</sup>* is the number appearances of term t in a sequence s and *Nt* is the number of sequences containing the term t (**Figure 2**).

**Figure 2.** *The term frequency-inverse document frequency flowchart.*
