**5. Conclusions**

The Term Frequency Inverse Document Frequency borrowed from natural language processing was combined with the sliding window to transform the RNA and protein sequences into a data frame of numerical values and 232 most contributing TF-values were selected using the XgBoost feature importance. Based on these features, we trained the Random Forest classier on 10,132 samples and tested it on 2534 remaining samples. The results in the **Table 2** show that the Random Forest overperformed all other predictive models that we trained on this dataset for comparison such as Logistic Regression and Multinomial Naïve Bayesian. The highest AUC for the Random Forest, combined with the high specificity and sensitivity, provides an indication of its ability to correctly predict all classes in large datasets. The Random Forest is computationally expensive, but there is a significant performance difference compared to other classifiers which is worth the training time.
