**2. Use of machine learning in biology and health domains**

As pointed out previously, machine learning algorithms have been successfully applied in many areas of biology and health-related research, including the identification of previously unknown processes in the genome, identification and understanding of several differentially expressed genes, binding specificities, and alternative splicing effects on cell processes, gene-gene and gene-environment interactions, disease-causing mutations, genetic determinants of diseases, pathway analysis, network and co-expression analysis, prediction of new drug-targets and response to treatment, etc. Here, we provide some illustrations of the use of supervised classification machine learning algorithms such as regression, SVM, ANN, and RF in some specific genomic applications, including predicting sequence specificities, analyzing gene expression profiles, identifying gene-gene and proteinprotein interactions, and elucidating disease-associated variants.

#### **2.1 Predicting sequence specificities of DNA- and RNA-binding proteins**

Sequence specificities of DNA- and RNA-binding proteins are essential for developing models of regulatory processes in biological systems. Alipanahi et al. [16] present the possibility of predicting sequence specificities from experimental data through

*Artificial Intelligence - Applications in Medicine and Biology*

need of machine learning techniques in genomic medicine.

*supervised*, *unsupervised* and *reinforcement* learning, described below:

the extraction of appropriate features for application in a novel event or situation to support decision-making by mapping a given system to an input-output transformation task as shown in **Figure 1**. Emerging trends in (deep) machine learning algorithms have made possible the identification and discovery of new patterns and hidden processes in genomic sequences that are essential in the functioning of biological systems. The heterogeneity of diseases, such as cancer, requires primarily the analysis of genomic data in order to improve diagnosis and to design an optimal therapy for an efficient clinical management of the disease. There is an increasing

Machine learning algorithms can be classified into three main categories, namely

**Supervised learning algorithms** build a mapping function, f, from the input variable, X, to the output result, Y, expressed by: Y = f(X). There exist two main groups of supervised learning algorithms, namely classification and regression. Classification model is used to predict the outcome of a given sample with categorical output, for instance, case or sick individuals, labeled 0, and control or healthy individuals, labeled 1. On the other hand, a regression model is used to predict the outcome of a given sample with a real-valued output. Examples of supervised learning algorithms include logistic and linear regression models, Naive Bayes, classification and regression trees (CART) [3], K-nearest neighbor (KNN) [4, 5], support vector machine (SVM) [6], random forest (RF) [7], and artificial neural networks (ANNs) [8]. **Unsupervised learning algorithms** retrieve the underlying structure of the dataset based on input X only, using unlabeled data, that is, input data with no corresponding output. In this type of learning algorithm, we have: *clustering*, *dimensionality reduction*, and *association* models. Clustering consists of grouping samples so that items within the same cluster are more similar to each other than to items from another cluster for a given well-defined metric. Dimensionality reduction uses feature extraction and selection methods to reduce the number of input variables, conveying the most important information and minimizing noise in the dataset. Feature selection extracts a subset of useful variables among the original variables and transforms data from a high- to a low-dimensional space. Finally, association model just computes the probability of the co-occurrence of elements in

*Mapping a system to an input-output transformation task through learning algorithms namely supervised,* 

**4**

**Figure 1.**

*unsupervised, and reinforcement learning.*

deep learning. They developed a software tool (DeepBind) based on deep convolutional neural networks that has the ability to discover new patterns in a sequence without knowledge of the particular location of the pattern within the sequence. DeepBind is also said to have the ability to: learn from very large amounts of sequence data through parallel implementation on a graphics processing unit (GPU); use both microarray and sequencing data; automatically train predictive models without requiring hand-tuning; tolerate mislabeled data and some noise; and generalize well across technologies regardless of existing biases across technologies. Furthermore, DeepBind was also used for identifying RNA- and DNA-binding protein sequence specificities, and showed resilience to outliers and array biases. This suggests that the issue of predicting sequence specificities has been efficiently addressed using the deep learning approach.
