**1. Introduction**

Advancements in the human deoxyribonucleic acid (DNA) microarray and genome sequencing technology have resulted in an exponential growth of publicly available and accessible biological datasets [1, 2]. These "big data" are being explored to systematically uncover useful signals and gain more insights to advance current knowledge and answer specific biological and health questions. Considering current data delude and relatively increased computing power, it is becoming possible to accurately infer desirable features from such data. This highlights the need for efficient learning algorithms to process these data for knowledge discovery by identifying pertinent patterns related to the comparison and classification of different features in these datasets. These learning algorithms should enable

the extraction of appropriate features for application in a novel event or situation to support decision-making by mapping a given system to an input-output transformation task as shown in **Figure 1**. Emerging trends in (deep) machine learning algorithms have made possible the identification and discovery of new patterns and hidden processes in genomic sequences that are essential in the functioning of biological systems. The heterogeneity of diseases, such as cancer, requires primarily the analysis of genomic data in order to improve diagnosis and to design an optimal therapy for an efficient clinical management of the disease. There is an increasing need of machine learning techniques in genomic medicine.

Machine learning algorithms can be classified into three main categories, namely *supervised*, *unsupervised* and *reinforcement* learning, described below:

**Supervised learning algorithms** build a mapping function, f, from the input variable, X, to the output result, Y, expressed by: Y = f(X). There exist two main groups of supervised learning algorithms, namely classification and regression. Classification model is used to predict the outcome of a given sample with categorical output, for instance, case or sick individuals, labeled 0, and control or healthy individuals, labeled 1. On the other hand, a regression model is used to predict the outcome of a given sample with a real-valued output. Examples of supervised learning algorithms include logistic and linear regression models, Naive Bayes, classification and regression trees (CART) [3], K-nearest neighbor (KNN) [4, 5], support vector machine (SVM) [6], random forest (RF) [7], and artificial neural networks (ANNs) [8].

**Unsupervised learning algorithms** retrieve the underlying structure of the dataset based on input X only, using unlabeled data, that is, input data with no corresponding output. In this type of learning algorithm, we have: *clustering*, *dimensionality reduction*, and *association* models. Clustering consists of grouping samples so that items within the same cluster are more similar to each other than to items from another cluster for a given well-defined metric. Dimensionality reduction uses feature extraction and selection methods to reduce the number of input variables, conveying the most important information and minimizing noise in the dataset. Feature selection extracts a subset of useful variables among the original variables and transforms data from a high- to a low-dimensional space. Finally, association model just computes the probability of the co-occurrence of elements in

#### **Figure 1.**

*Mapping a system to an input-output transformation task through learning algorithms namely supervised, unsupervised, and reinforcement learning.*

**5**

*Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic…*

a collection, thus inferring how likely two different elements are to co-occur in a collection. Unsupervised learning includes hierarchical clustering [9], K-means [10],

**Reinforcement learning algorithms** are a class of learning algorithms allowing an agent to decide the optimal next action based on its current state to control an environment or a system [11], by learning behaviors that will maximize the reward or outcome [12, 13]. These algorithms interact with a system, for example human system under a specific condition which may be disease or treatment, to learn the best setting and optimally perform sequential decisions along a timeline [11], generally under uncertainty, based solely on the present state of the system. It follows that this sequential and dynamic decision-making process is assumed to be a Markov decision process [14], in which the present state of the system fully describes the system and is sufficient to optimally predict the best next state. Reinforcement learning algorithms generally use a dynamic programming method following Bellman-based optimality principle [12], which requires optimal substructure for a given optimal option. In clinical research, these algorithms can be effective for longitudinal analyses, including retrospective and prospective studies,

which consist of following a cohort across a specific-time interval [15].

clinical management, and for predicting effective therapeutic strategies.

As pointed out previously, machine learning algorithms have been successfully applied in many areas of biology and health-related research, including the identification of previously unknown processes in the genome, identification and understanding of several differentially expressed genes, binding specificities, and alternative splicing effects on cell processes, gene-gene and gene-environment interactions, disease-causing mutations, genetic determinants of diseases, pathway analysis, network and co-expression analysis, prediction of new drug-targets and response to treatment, etc. Here, we provide some illustrations of the use of supervised classification machine learning algorithms such as regression, SVM, ANN, and RF in some specific genomic applications, including predicting sequence specificities, analyzing gene expression profiles, identifying gene-gene and protein-

**2. Use of machine learning in biology and health domains**

protein interactions, and elucidating disease-associated variants.

**2.1 Predicting sequence specificities of DNA- and RNA-binding proteins**

Sequence specificities of DNA- and RNA-binding proteins are essential for developing models of regulatory processes in biological systems. Alipanahi et al. [16] present the possibility of predicting sequence specificities from experimental data through

Most of these learning algorithms have been extensively used to overcome several issues in genomic medicine, including identification of genetic markers underlying disease risk, novel mechanisms for disease prevention, control, diagnosis and therapy, building predictive disease models, predicting treatment outcomes, etc. Currently, there exist several platforms producing large-scale datasets, including genomics, transcriptomics, proteomics, metabolomics, and microbial and epidemiological data. This provides a unique opportunity of setting models and learning algorithms to enable the integration of these different heterogeneous datasets for elucidating determinant factors contributing to disease outcome and therapy in order to take full advantage of this data wealth in post-genomic medicine. In the following sections, we review some cases where machine and deep learning techniques have been used in health era and how post-genomic analysis constitutes a necessary route for optimally elucidating mechanisms of disease for an appropriate disease

*DOI: http://dx.doi.org/10.5772/intechopen.84148*

and principal component analysis (PCA).

## *Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic… DOI: http://dx.doi.org/10.5772/intechopen.84148*

a collection, thus inferring how likely two different elements are to co-occur in a collection. Unsupervised learning includes hierarchical clustering [9], K-means [10], and principal component analysis (PCA).

**Reinforcement learning algorithms** are a class of learning algorithms allowing an agent to decide the optimal next action based on its current state to control an environment or a system [11], by learning behaviors that will maximize the reward or outcome [12, 13]. These algorithms interact with a system, for example human system under a specific condition which may be disease or treatment, to learn the best setting and optimally perform sequential decisions along a timeline [11], generally under uncertainty, based solely on the present state of the system. It follows that this sequential and dynamic decision-making process is assumed to be a Markov decision process [14], in which the present state of the system fully describes the system and is sufficient to optimally predict the best next state. Reinforcement learning algorithms generally use a dynamic programming method following Bellman-based optimality principle [12], which requires optimal substructure for a given optimal option. In clinical research, these algorithms can be effective for longitudinal analyses, including retrospective and prospective studies, which consist of following a cohort across a specific-time interval [15].

Most of these learning algorithms have been extensively used to overcome several issues in genomic medicine, including identification of genetic markers underlying disease risk, novel mechanisms for disease prevention, control, diagnosis and therapy, building predictive disease models, predicting treatment outcomes, etc. Currently, there exist several platforms producing large-scale datasets, including genomics, transcriptomics, proteomics, metabolomics, and microbial and epidemiological data. This provides a unique opportunity of setting models and learning algorithms to enable the integration of these different heterogeneous datasets for elucidating determinant factors contributing to disease outcome and therapy in order to take full advantage of this data wealth in post-genomic medicine. In the following sections, we review some cases where machine and deep learning techniques have been used in health era and how post-genomic analysis constitutes a necessary route for optimally elucidating mechanisms of disease for an appropriate disease clinical management, and for predicting effective therapeutic strategies.
