**2.5 Elucidating disease-causing genetic variants**

The identification of disease-causing genetic variants is challenging because several of them are found in the non-coding regions of the genome. The role of non-coding regions in the maintenance of genome functions is not well understood. However, some machine learning algorithms have been designed to annotate coding and non-coding genetic variants in order to identify disease-causing mutations. *Combined annotation-dependent depletion* (CADD) is an algorithm designed to annotate coding and non-coding variants [37]. CADD trains a linear kernel support vector machine to separate observed genetic variants from simulated ones. However, due to the SVM's inability to capture nonlinear relationships among features, a deep neural network that uses the same feature set and training data as CADD is preferred. Deep neural networks are better suited than SVMs for problems with large samples and features.

How genetic variants, especially those which are not within protein coding regions, affect RNA splicing is not entirely understood. This type of problem can

*Artificial Intelligence - Applications in Medicine and Biology*

**2.4 Predicting gene-gene and gene-environment interactions**

SVM, ANN, and RF are used in addressing these challenges.

manageable set of possible combinations of genetic variants.

*Categories of gene-gene interactions retrieved from Koo et al. [23].*

Finally, support vector machine (SVM) is a machine learning algorithm that uses hyper-planes for classification and regression tasks. The SVM approach has been applied to detecting gene-gene interactions through learning from the features of genetically interacting pairs. For training, SVM takes in two sets of feature vectors

Generally, disease outcome involves multiple genes contributing in every stage of disease progression [27]. This suggests the influence of gene-gene and gene-environment interactions in the outcome of a disease. Genes interact in large networks and some genes in the network are more important or central than others. Understanding these interactions is necessary for setting optimal prevention and control mechanisms to contain the disease. There have been challenges in identifying the distinctive nature of gene-gene and gene-environment interactions and their impact on disease risk, using traditional statistical methods. This has been due to the high dimensionality of the data, presence of epistasis and multiple polymorphisms leading to complex datasets for analysis. Machine learning methods such as

Neural networks use pattern recognition to address challenges in genomics. In the context of predicting gene-gene interaction, the neural network architecture depends on the type of interactions [28], shown in **Figure 3**. Genetic programming has been utilized to optimize the architecture of neural networks and back propagation for modeling gene-gene interactions as illustrated by Ritchie et al. [29]. Genetic programing neural nets (GPNN) were found to have more prediction power for models with heritability greater than 0.026 as compared to back propagation neural nets (BPNNs) which had only 80% power for models with greater than 0.051 heritability. The GPNN also outperformed the BPNN when applied to models containing functional and nonfunctional SNPs. Complex nonlinear interactions with binary endpoints that have previously been analyzed by logistic regression and classification and regression trees (CARTs) can be examined by GPNN. Motsinger et al. [30] demonstrated the use of grammatical evolution neural networks (GENNs) in detecting gene-gene and gene-environment interactions in high dimensional data with noise. GENN were found to be more vigorous with missing data and genotyping errors. On the other hand, random forest (RF) algorithm is a flexible supervised machine learning algorithm that can be used for classification and regression. The RF algorithm is often able to produce good results even with missing values in the data and without need for hyper-parameter tuning. Therefore, RF algorithm can be well suited for high-dimensional genomic data analysis. This algorithm is also useful in reducing the search space of epistatic interactions, thereby creating a

**8**

**Figure 3.**

however be addressed by machine learning computational models designed to predict splicing during gene expression. Regulation of splicing is very important and faulty regulation could lead to several diseases, such as cancer and neurological disorders. A computational technique, that scores the magnitude of the effects of genetic variants on RNA splicing, was developed by Xiong et al. [38]. The computational model can be applied to any sequence with a triplet of exons and used to determine how splicing is altered by genetic variants. The model computes a score that predicts how much a given variant affects splicing.

*Linkage* and *association analysis* are types of neural network methods used to identify genes associated with diseases. Linkage analysis is used to detect the connection between a disease locus and a marker and uses genotypes as inputs and the outputs are phenotype values such as disease status and quantitative clinical variables. Association analysis on the other hand is used for detecting the disequilibrium between disease locus and marker. The data in association analysis are of case-control type with a sample comprised of genotypes for multiple markers. In most cases, it is useful to integrate genotype information into pathway analysis for more effective biological interpretation of these genotype contributions into the trait under consideration. In this case, *random survival forest pathway hunting* algorithm can be used to identify signaling pathways in a relatively small sample size [39].

Finally, considering the RF features, the RF algorithm can also be used in identifying a set of risk-associated SNPs from a large number of unassociated SNPs in models of complex diseases. There are unknown interactions among true riskassociated SNPs or SNPs and the environment in large-scale genetic data and RF can be used to significantly reduce the number of SNPs in the data as pointed out previously.

### **2.6 Applying learning algorithms in clinical decision process**

Setting appropriate diagnostic and effective therapeutic regimens is a critical clinical decision and essential for setting effective health measures and efficient strategies to control a disease. This process is limited by the lack of advanced diagnostic tools and approved therapy or vaccine against most existing and emerging diseases [40, 41]. Moreover, despite undeniable advances made in understanding of human biology, etiology, and pathogenesis of several diseases, and emergence of advanced technologies, the translation of the existing biological knowledge toward effective new treatments and clinical interventions has not been as fast as expected or anticipated. This highlights the need for powerful and general tools for orienting these clinical decision processes. Machine learning algorithms are contributing to satisfying this need with several advantages in representational power even though challenges in biological interpretation still hamper clinical applications [15].

As an initial illustration, Adabor and Acquaah-Mensah [42] introduced the median supplement model to appropriately balance a training set with unequal numbers of instances associated with each class or group to improve the classification decision. They also assessed different machine learning techniques in predicting the receptor expression status of breast cancer patients, namely progesterone receptor (PR) status and HER2 expression status using gene expression datasets. These receptors are essential in deciding on treatment and predicting the treatment outcome. In this chapter, we used results of their performance evaluations to highlight two essential features common to most of the machine learning algorithms as shown in **Figure 4**: (1) as the size of the training set increases, the performance of the learning algorithm increases (see Sample Data 1 vs. Sample Data 2) and (2) learning algorithm on a balanced training set may perform better than on an unbalanced training set (see NB vs. MNB and RF vs. MRF).

**11**

directing to drug re-purposing.

**Figure 4.**

**3. Integrative approaches for post-genomic analysis**

*Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic…*

It is worth mentioning that machine learning algorithms have been used in several contexts with a common goal of improving healthcare measures and patient clinical management. For examples, deep learning algorithms are used to classify patients based on clinical healthcare records [43], to predict the effectiveness of clinical trials (i.e., likelihood of success or failure of clinical trials) [44], to improve and predict patient treatment response and outcome based on pharmaco-genomics data [45]. Moreover, Nemati et al. [14] optimized a treatment dosing policy for intensive care patients using deep reinforcement learning and Wang et al. [46] predicted drug-target binding site interactions using ANN with two hidden layers taking a drug and a target binding site as inputs. Finally, it is known that drug repositioning or re-purposing approach, which examines new therapeutic uses for approved drugs, represents an optimal model for suggesting new drugs using drug-target interactions [40, 41]. Wang and Zeng [47] used a learning technique based on restricted Boltzmann machines to predict novel drug-target interactions

*Performance of different machine learning techniques for predicting progesterone receptor (PR) status phenotype of breast cancer patients based on classification rate (proportion of correctly classified instances), information extracted from [42]. Sample Data 1 is a smaller-sized dataset as compared to sample data 2, containing 162 and 1146 instances of breast cancer patients, respectively. Learning techniques: support vector machine (SVM), logistic regression (logistic), Bayesian network (BN), Naive Bayes (NB), random trees (RT), random forest (RF), median-supplement Naive Bayes (MNB), and median-supplement random forest (MRF).*

Over the years, thousands of genetic associations have been discovered using genetic approach, known as *genome-wide association studies* (GWAS). GWAS approaches are mostly based on a single-marker association test model that leverages thousands of genomes of cases and controls (sick and healthy individuals) in order to elucidate variants or single-nucleotide polymorphisms (SNPs) with unusual significant differences in frequency throughout genomes [48]. This indicates that GWAS approaches are based on machine learning techniques, which mostly take SNP profiles of cases and controls as inputs, and predict a SNP carrying disease risk. Note that these approaches have been successful [49] and several GWAS results have helped elucidating genetic determinants of susceptibility to several diseases, including complex diseases, such as cancer, and monogenic diseases, such as sickle cell

*DOI: http://dx.doi.org/10.5772/intechopen.84148*

*Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic… DOI: http://dx.doi.org/10.5772/intechopen.84148*

#### **Figure 4.**

*Artificial Intelligence - Applications in Medicine and Biology*

that predicts how much a given variant affects splicing.

**2.6 Applying learning algorithms in clinical decision process**

than on an unbalanced training set (see NB vs. MNB and RF vs. MRF).

Setting appropriate diagnostic and effective therapeutic regimens is a critical clinical decision and essential for setting effective health measures and efficient strategies to control a disease. This process is limited by the lack of advanced diagnostic tools and approved therapy or vaccine against most existing and emerging diseases [40, 41]. Moreover, despite undeniable advances made in understanding of human biology, etiology, and pathogenesis of several diseases, and emergence of advanced technologies, the translation of the existing biological knowledge toward effective new treatments and clinical interventions has not been as fast as expected or anticipated. This highlights the need for powerful and general tools for orienting these clinical decision processes. Machine learning algorithms are contributing to satisfying this need with several advantages in representational power even though challenges in biological interpretation still hamper clinical applications [15]. As an initial illustration, Adabor and Acquaah-Mensah [42] introduced the median supplement model to appropriately balance a training set with unequal numbers of instances associated with each class or group to improve the classification decision. They also assessed different machine learning techniques in predicting the receptor expression status of breast cancer patients, namely progesterone receptor (PR) status and HER2 expression status using gene expression datasets. These receptors are essential in deciding on treatment and predicting the treatment outcome. In this chapter, we used results of their performance evaluations to highlight two essential features common to most of the machine learning algorithms as shown in **Figure 4**: (1) as the size of the training set increases, the performance of the learning algorithm increases (see Sample Data 1 vs. Sample Data 2) and (2) learning algorithm on a balanced training set may perform better

however be addressed by machine learning computational models designed to predict splicing during gene expression. Regulation of splicing is very important and faulty regulation could lead to several diseases, such as cancer and neurological disorders. A computational technique, that scores the magnitude of the effects of genetic variants on RNA splicing, was developed by Xiong et al. [38]. The computational model can be applied to any sequence with a triplet of exons and used to determine how splicing is altered by genetic variants. The model computes a score

*Linkage* and *association analysis* are types of neural network methods used to identify genes associated with diseases. Linkage analysis is used to detect the connection between a disease locus and a marker and uses genotypes as inputs and the outputs are phenotype values such as disease status and quantitative clinical variables. Association analysis on the other hand is used for detecting the disequilibrium between disease locus and marker. The data in association analysis are of case-control type with a sample comprised of genotypes for multiple markers. In most cases, it is useful to integrate genotype information into pathway analysis for more effective biological interpretation of these genotype contributions into the trait under consideration. In this case, *random survival forest pathway hunting* algorithm can be used to identify signaling pathways in a relatively small sample size [39]. Finally, considering the RF features, the RF algorithm can also be used in identifying a set of risk-associated SNPs from a large number of unassociated SNPs in models of complex diseases. There are unknown interactions among true riskassociated SNPs or SNPs and the environment in large-scale genetic data and RF can be used to significantly reduce the number of SNPs in the data as pointed out

**10**

previously.

*Performance of different machine learning techniques for predicting progesterone receptor (PR) status phenotype of breast cancer patients based on classification rate (proportion of correctly classified instances), information extracted from [42]. Sample Data 1 is a smaller-sized dataset as compared to sample data 2, containing 162 and 1146 instances of breast cancer patients, respectively. Learning techniques: support vector machine (SVM), logistic regression (logistic), Bayesian network (BN), Naive Bayes (NB), random trees (RT), random forest (RF), median-supplement Naive Bayes (MNB), and median-supplement random forest (MRF).*

It is worth mentioning that machine learning algorithms have been used in several contexts with a common goal of improving healthcare measures and patient clinical management. For examples, deep learning algorithms are used to classify patients based on clinical healthcare records [43], to predict the effectiveness of clinical trials (i.e., likelihood of success or failure of clinical trials) [44], to improve and predict patient treatment response and outcome based on pharmaco-genomics data [45]. Moreover, Nemati et al. [14] optimized a treatment dosing policy for intensive care patients using deep reinforcement learning and Wang et al. [46] predicted drug-target binding site interactions using ANN with two hidden layers taking a drug and a target binding site as inputs. Finally, it is known that drug repositioning or re-purposing approach, which examines new therapeutic uses for approved drugs, represents an optimal model for suggesting new drugs using drug-target interactions [40, 41]. Wang and Zeng [47] used a learning technique based on restricted Boltzmann machines to predict novel drug-target interactions directing to drug re-purposing.
