**2.4 Predicting gene-gene and gene-environment interactions**

Generally, disease outcome involves multiple genes contributing in every stage of disease progression [27]. This suggests the influence of gene-gene and gene-environment interactions in the outcome of a disease. Genes interact in large networks and some genes in the network are more important or central than others. Understanding these interactions is necessary for setting optimal prevention and control mechanisms to contain the disease. There have been challenges in identifying the distinctive nature of gene-gene and gene-environment interactions and their impact on disease risk, using traditional statistical methods. This has been due to the high dimensionality of the data, presence of epistasis and multiple polymorphisms leading to complex datasets for analysis. Machine learning methods such as SVM, ANN, and RF are used in addressing these challenges.

Neural networks use pattern recognition to address challenges in genomics. In the context of predicting gene-gene interaction, the neural network architecture depends on the type of interactions [28], shown in **Figure 3**. Genetic programming has been utilized to optimize the architecture of neural networks and back propagation for modeling gene-gene interactions as illustrated by Ritchie et al. [29]. Genetic programing neural nets (GPNN) were found to have more prediction power for models with heritability greater than 0.026 as compared to back propagation neural nets (BPNNs) which had only 80% power for models with greater than 0.051 heritability. The GPNN also outperformed the BPNN when applied to models containing functional and nonfunctional SNPs. Complex nonlinear interactions with binary endpoints that have previously been analyzed by logistic regression and classification and regression trees (CARTs) can be examined by GPNN. Motsinger et al. [30] demonstrated the use of grammatical evolution neural networks (GENNs) in detecting gene-gene and gene-environment interactions in high dimensional data with noise. GENN were found to be more vigorous with missing data and genotyping errors.

On the other hand, random forest (RF) algorithm is a flexible supervised machine learning algorithm that can be used for classification and regression. The RF algorithm is often able to produce good results even with missing values in the data and without need for hyper-parameter tuning. Therefore, RF algorithm can be well suited for high-dimensional genomic data analysis. This algorithm is also useful in reducing the search space of epistatic interactions, thereby creating a manageable set of possible combinations of genetic variants.

Finally, support vector machine (SVM) is a machine learning algorithm that uses hyper-planes for classification and regression tasks. The SVM approach has been applied to detecting gene-gene interactions through learning from the features of genetically interacting pairs. For training, SVM takes in two sets of feature vectors

**9**

*Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic…*

labeled as positive and negative, indicating presence and absence of genetic interaction, respectively. Feature mapping is done by use of a hyper-plane with maximum margin to separate genetically interacting pairs and non-genetically interacting pairs. SVM and neural network modeling was used to investigate gene-gene interactions in a study by Matchenko-Shimko and Dube [31]. They used pre-selection of SNP-SNP combination to determine the effects of interactions between genes. However, the pre-selection strategy did not work well with combinations of low disease allele frequencies and low margin effects. It was discovered that larger sample sizes are required for determining gene-gene interactions with SNPs having low marginal effect sizes as compared to interactions with moderate marginal gene effect sizes. Both SVM and ANN models exhibited good performance in increasing allele frequency with low marginal gene effects [31]. SVM was used to identify the most promising SNPs and interactions. Shen et al. used it in two stages for determining gene-gene interactions where the second stage involves the application of logistic regression analysis. It was shown that SVM is also useful in methods for case-control studies in which multiple logistic regression performs better than traditional logistic regression for each interaction. Additionally, application of the SVM in improving the accuracy of cancer classification, through extending the SVM pedigree-based generalized multifactor dimensionality, has been functional in detecting gene-gene and gene-covariate interactions in limited family samples [32]. Moreover, the SVM can also be used to extract known gene-disease associations and infer known genes for future experimental analysis using automatic literature

In addition, the application of SVM in SUPPORTMIX [34], which is a local ancestry inference method, facilitates gene-gene and gene-environment interactions. For instance, Aschard et al. [35] highlighted that local ancestry estimates might provide insights into detecting gene-gene interactions, while Florez et al. [36] showed that non-European ancestry in the Latino populations is associated with type 2 diabetes and lower economic status, illustrating gene-environment interaction. Local ancestry inference estimates the proportion of alleles that originates from a particular population at every chromosomal site of an admixed individual. SUPPORTMIX integrates SVM with hidden Markov models (HMMs). Using SVM in SUPPORTMIX improves multi-way local ancestry inference overall, since it addresses the challenge of few genotyped or existing reference panels [1]. Furthermore, it facilitates both gene-gene and gene-environment interactions due to the improved computational

The identification of disease-causing genetic variants is challenging because several of them are found in the non-coding regions of the genome. The role of non-coding regions in the maintenance of genome functions is not well understood. However, some machine learning algorithms have been designed to annotate coding and non-coding genetic variants in order to identify disease-causing mutations. *Combined annotation-dependent depletion* (CADD) is an algorithm designed to annotate coding and non-coding variants [37]. CADD trains a linear kernel support vector machine to separate observed genetic variants from simulated ones. However, due to the SVM's inability to capture nonlinear relationships among features, a deep neural network that uses the same feature set and training data as CADD is preferred. Deep neural networks are better suited than SVMs for problems

How genetic variants, especially those which are not within protein coding regions, affect RNA splicing is not entirely understood. This type of problem can

*DOI: http://dx.doi.org/10.5772/intechopen.84148*

mining based on dependency parsing and SVM [33].

time as a result of its flexibility and ability to handle "big data."

**2.5 Elucidating disease-causing genetic variants**

with large samples and features.

**Figure 3.** *Categories of gene-gene interactions retrieved from Koo et al. [23].*

## *Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic… DOI: http://dx.doi.org/10.5772/intechopen.84148*

labeled as positive and negative, indicating presence and absence of genetic interaction, respectively. Feature mapping is done by use of a hyper-plane with maximum margin to separate genetically interacting pairs and non-genetically interacting pairs. SVM and neural network modeling was used to investigate gene-gene interactions in a study by Matchenko-Shimko and Dube [31]. They used pre-selection of SNP-SNP combination to determine the effects of interactions between genes. However, the pre-selection strategy did not work well with combinations of low disease allele frequencies and low margin effects. It was discovered that larger sample sizes are required for determining gene-gene interactions with SNPs having low marginal effect sizes as compared to interactions with moderate marginal gene effect sizes. Both SVM and ANN models exhibited good performance in increasing allele frequency with low marginal gene effects [31]. SVM was used to identify the most promising SNPs and interactions. Shen et al. used it in two stages for determining gene-gene interactions where the second stage involves the application of logistic regression analysis. It was shown that SVM is also useful in methods for case-control studies in which multiple logistic regression performs better than traditional logistic regression for each interaction. Additionally, application of the SVM in improving the accuracy of cancer classification, through extending the SVM pedigree-based generalized multifactor dimensionality, has been functional in detecting gene-gene and gene-covariate interactions in limited family samples [32]. Moreover, the SVM can also be used to extract known gene-disease associations and infer known genes for future experimental analysis using automatic literature mining based on dependency parsing and SVM [33].

In addition, the application of SVM in SUPPORTMIX [34], which is a local ancestry inference method, facilitates gene-gene and gene-environment interactions. For instance, Aschard et al. [35] highlighted that local ancestry estimates might provide insights into detecting gene-gene interactions, while Florez et al. [36] showed that non-European ancestry in the Latino populations is associated with type 2 diabetes and lower economic status, illustrating gene-environment interaction. Local ancestry inference estimates the proportion of alleles that originates from a particular population at every chromosomal site of an admixed individual. SUPPORTMIX integrates SVM with hidden Markov models (HMMs). Using SVM in SUPPORTMIX improves multi-way local ancestry inference overall, since it addresses the challenge of few genotyped or existing reference panels [1]. Furthermore, it facilitates both gene-gene and gene-environment interactions due to the improved computational time as a result of its flexibility and ability to handle "big data."
