1. Introduction

Today, advances in high-throughput technologies have generated huge amounts of human genomics data in public domains. These data are useful for medical and population genetics to understand the population history, human evolution and demographics, susceptibility to disease, and response to drug. Over time, humanity has experienced the exchange of genetic materials across populations, mainly due to population migrations [1], which have led to wide human genetic variations as results of interbreeding or mating between different populations previously isolated. These genetic variations observed in the human deoxyribonucleic acid (DNA) sequences are caused by inheritance processes, such as mutation and recombination. Generally, the mating process yields the genetic recombination break points,

introduces some variations, and creates mixed DNA segments. As a consequence, current human populations are admixed [2, 3] with specific genomes displaying a mosaic of segments originating from different ancestral populations [1, 2, 4], wide phenotypic variations, divergent genetic ancestry, and different traits observed among individuals in worldwide population groups. Thus, it is critical to understand the dynamics related to the origin of these variations, the evolution process, and its consequences in human heredity and health.

Studying admixture patterns in human populations consists of characterization of admixture features in human populations, including admixture mapping and date to admixture events. Admixture mapping combines both the identification of genetic variants underlying the ethnic difference in disease risk and inference of ancestry estimates associated with these genetic variants. Estimation of ancestry is commonly known as genetic ancestry inference, which is either global or local ancestry inference. Global ancestry inference estimates the overall proportion contributed by each ancestral population to the admixed genome; while, local ancestry deconvolution (local ancestry inference) estimates the number of copies from a particular population at a given site [5]. Together, admixture mapping and date to admixture events provide a better understanding of the genetic variation features throughout modern human evolution, the demographics, and adaptive processes of human populations. Currently, analyzing admixture patterns has become central to genomics research, contributing to a wide range of biomedical applications. Current advance in technologies is facilitating the movement of people worldwide, thus influencing the complexity of population admixture dynamics and leading to multifaceted admixture events. On the other hand, the determination of local ancestry through genotyping and microarray datasets has empowered the approaches for dating mutation, selection, and admixture events [6, 7].

treatment [16]. The date of admixture in a given population can be predicted by analyzing the ancestral track, break-points, and linkage disequilibrium (LD) [17]. Also, distinction between date of admixture events is made with the use of LD and ancestral tracts in the admixed genomes [17]. Nowadays, there are several models for predicting the age of an admixture event, which are classified into two main groups: LD-based approaches and haplotype-based approaches [17, 18]. These models use information from genomes of several population groups around the world as representative or equivalent ancient populations known to influence the migration and/or admixture processes, yielding observed admixed population pat-

A partial worldwide admixture painting map. The figure shows several worldwide admixed populations with patterns identified through published paper on population structure from 2008 to 2018. The population migrations within and between continents have resulted in different admixed populations ranging from one-

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

DOI: http://dx.doi.org/10.5772/intechopen.82764

In this chapter, we survey current models for deconvoluting local ancestry and

In this section, we survey current models used to elucidate admixture patterns, including local ancestry estimates (deconvolution) and dating admixture events. These models assume that the T genotyped sites are biallelic and the genotype information of the K reference candidate ancestral and admixed populations are considered known. Ancestry at different sites or windows follows a Markov chain. Recombination is assumed to occur at every generation resulting in Poison recombination points with a rate which depends on both the recombination rate, r, and number of genera-

dating admixture events and explore computational techniques used in these models. We highlight advances made so far in this genomic era and opportunities behind these models and challenges or gaps that still need to be addressed. This informs users and researchers on the current state of research, and orient future trends in designing more effective models, which account for current challenges and produce more accurate and biological relevant estimates. In the subsequent sections, we provide an overview of existing methods used for inferring local

terns worldwide (Figure 2).

Figure 2.

37

to five-way admixtures.

ancestry estimates and dating admixture events.

2. Overview of admixture feature inference models

tions since admixture, g, and individuals are independent of each other.

The significance of the local ancestry inference topic is viewed through the research interests it has raised over the last two decades. Several models exist for local ancestry deconvolution, including ANCESTRYMAP [8], ADMIXMAP [9], SABER [10], LAMP [11], LAMPLD/LAMPHAP [12], SUPPORTMIX [13], EILA [14], LOTER [15], etc. Figure 1 displays the implementation dynamics of different local ancestry deconvolution models graphically, indicating the time each model was introduced. Local ancestry inference is relevant in personalizing medicines, understanding complex diseases, localizing missing sequences in reference genomes and understanding the population history and demographics. Subsequently, several studies have particularly been focusing on dating past admixture events, relevant to population migrations, heritable genes associated to some diseases, and responses to

#### Figure 1.

The evolution of local ancestry deconvolution since 2003 to 2017.

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

#### Figure 2.

introduces some variations, and creates mixed DNA segments. As a consequence, current human populations are admixed [2, 3] with specific genomes displaying a mosaic of segments originating from different ancestral populations [1, 2, 4], wide phenotypic variations, divergent genetic ancestry, and different traits observed among individuals in worldwide population groups. Thus, it is critical to understand the dynamics related to the origin of these variations, the evolution process, and its

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

Studying admixture patterns in human populations consists of characterization of admixture features in human populations, including admixture mapping and date to admixture events. Admixture mapping combines both the identification of genetic variants underlying the ethnic difference in disease risk and inference of ancestry estimates associated with these genetic variants. Estimation of ancestry is commonly known as genetic ancestry inference, which is either global or local ancestry inference. Global ancestry inference estimates the overall proportion contributed by each ancestral population to the admixed genome; while, local ancestry deconvolution (local ancestry inference) estimates the number of copies from a particular population at a given site [5]. Together, admixture mapping and date to admixture events provide a better understanding of the genetic variation features throughout modern human evolution, the demographics, and adaptive processes of human populations. Currently, analyzing admixture patterns has become central to genomics research, contributing to a wide range of biomedical applications. Current advance in technologies is facilitating the movement of people worldwide, thus influencing the complexity of population admixture dynamics and leading to multifaceted admixture events. On the other hand, the determination of local ancestry through genotyping and microarray datasets has empowered the approaches for

The significance of the local ancestry inference topic is viewed through the research interests it has raised over the last two decades. Several models exist for local ancestry deconvolution, including ANCESTRYMAP [8], ADMIXMAP [9], SABER [10], LAMP [11], LAMPLD/LAMPHAP [12], SUPPORTMIX [13], EILA [14], LOTER [15], etc. Figure 1 displays the implementation dynamics of different local ancestry deconvolution models graphically, indicating the time each model was introduced. Local ancestry inference is relevant in personalizing medicines, understanding complex diseases, localizing missing sequences in reference genomes and understanding the population history and demographics. Subsequently, several studies have particularly been focusing on dating past admixture events, relevant to population migrations, heritable genes associated to some diseases, and responses to

consequences in human heredity and health.

dating mutation, selection, and admixture events [6, 7].

The evolution of local ancestry deconvolution since 2003 to 2017.

Figure 1.

36

A partial worldwide admixture painting map. The figure shows several worldwide admixed populations with patterns identified through published paper on population structure from 2008 to 2018. The population migrations within and between continents have resulted in different admixed populations ranging from oneto five-way admixtures.

treatment [16]. The date of admixture in a given population can be predicted by analyzing the ancestral track, break-points, and linkage disequilibrium (LD) [17]. Also, distinction between date of admixture events is made with the use of LD and ancestral tracts in the admixed genomes [17]. Nowadays, there are several models for predicting the age of an admixture event, which are classified into two main groups: LD-based approaches and haplotype-based approaches [17, 18]. These models use information from genomes of several population groups around the world as representative or equivalent ancient populations known to influence the migration and/or admixture processes, yielding observed admixed population patterns worldwide (Figure 2).

In this chapter, we survey current models for deconvoluting local ancestry and dating admixture events and explore computational techniques used in these models. We highlight advances made so far in this genomic era and opportunities behind these models and challenges or gaps that still need to be addressed. This informs users and researchers on the current state of research, and orient future trends in designing more effective models, which account for current challenges and produce more accurate and biological relevant estimates. In the subsequent sections, we provide an overview of existing methods used for inferring local ancestry estimates and dating admixture events.

### 2. Overview of admixture feature inference models

In this section, we survey current models used to elucidate admixture patterns, including local ancestry estimates (deconvolution) and dating admixture events. These models assume that the T genotyped sites are biallelic and the genotype information of the K reference candidate ancestral and admixed populations are considered known. Ancestry at different sites or windows follows a Markov chain. Recombination is assumed to occur at every generation resulting in Poison recombination points with a rate which depends on both the recombination rate, r, and number of generations since admixture, g, and individuals are independent of each other.

#### 2.1 Local ancestry inference models

As pointed out previously, existing local ancestry inference models can be categorized into two main groups based on whether the model makes use of admixture/ background linkage disequilibrium (LD) or not.

#### 2.1.1 LD-based models for local ancestry inference

LD-based models account for LD in local ancestry deconvolution, and due to the importance of LD in disease mapping, the first local ancestry methods fall into this category. They assume that ancestry along an admixed individual genome follows a first order Markov chain. This means that the immediate past state captures all the information on past states [19]. As a result, LD-based models assume that, at every site, the observed admixed genotypes are generated by the unobserved ancestry, and hence, Hidden Markov Model (HMM) and its extensions are used to infer the unobserved (hidden) states. Thus, to deconvolute ancestry along the admixed genome, these models have three model parameters, namely the initial, transition and observation, or emission probability models. Due to uncertainty and the number of parameters involved, LD-based methods use Markov Chain Monte Carlo (MCMC), forward-backward, or Viterbi algorithms to determine the hidden ancestry sequence for a given individual. Falush et al. and Patterson et al. modeled ancestry switch between ancestry populations at a given site, Xt∈f g 1; …;K , by

$$P(X\_1 = k | q, r) = q\_{k^\*} \tag{1}$$

the allele frequency of the ancestry at that site. For instance, assuming K = 2,

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

<sup>k</sup> 1 � pk

� � <sup>þ</sup> p1 <sup>1</sup> � <sup>p</sup><sup>2</sup>

p<sup>1</sup> p<sup>2</sup>

2 1 � p<sup>1</sup> � � <sup>1</sup> � <sup>p</sup><sup>2</sup>

where y and na∈f g 0; 1 are numbers of reference alleles of an admixed individual at t, and that of alleles from population 1, respectively. pk is the allele frequency of population k∈f g 1; 2 at the site t, such that when na ¼ 0, pk ¼ p1 while pk ¼ p2 when na ¼ 2. Nowadays, technological, statistical, and computational advances avail enormous amounts of high density SNP data. Although high density SNPs violate the independence assumption due to background LD [21], they contain more information than in AIMs [22]. To loosen the independence assumption and minimize noise and systematic biases from unmodelled LD, more advanced local ancestry inference methods emerged [22]. These methods include SEQMIX [23], PCADMIX

SUPPORTMIX [11] models only admixture LD by combining support vector machines (SVMs) and HMM. It was proposed in 2012 to improve on the computational time and address the challenge of a few typed or nonexistent reference panels, which overall improve multi-way local ancestry deconvolution.

SUPPORTMIX is the first model to allow the learning of ancestral surrogates given a pool of reference panels. As a result, it is capable to train ancestral populations that are bigger in size than those that are mixed. Since SVMs can handle huge datasets, SUPPORTMIX is faster than early methods. It uses the rich haplotype information. Also proposed in 2012, PCADMIX [24] divides the genome into contiguous windows of SNPs as in SUPPORTMIX. It leverages principal component analysis from proxy ancestral haplotypes to model admixture LD under a standard HMM. Similar to SUPPORTMIX, PCADMIX is fast and requires phased data. Nevertheless, SUPPORTMIX and PCADMIX do not model phase switch errors, and as a result, in 2013, SEQMIX [23] was proposed. Unlike all other admixture LD-based methods, SEQMIX is based on exome sequence, reads data, and uses HMM. SEQMIX models only admixture LD and prunes SNPs in background LD. As a result, to reduce noise and systematic biases from using all SNPs [10] whilst not fully modeling LD (back-

Since the biological data often have some dependences that violate the independence assumption in standard HMM, admixture LD-based methods are often not realistic. To relax the independence assumption, the HMM is extended to either Markov HMM, factorial HMM, hierarchical HMM, or two-layer HMM or other multivariate statistical models such as multivariate normal distribution (MVN) and a rich ancestral haplotype data are used unlike early methods. This is the case for SABER [10], SWITCH [25], HAPAA [26], HAPMIX [4], MULTIMIX [27], ALLOY [28], and ELAI [29]. MHMMs were the first HMM extension in local ancestry. They were first implemented in SABER and later in SWITCH. SABER was the first method to model background LD in the genetic ancestry inference. MHMM assumes that the current observed haplotype depends on both the current ancestry

ground), admixture and background LD methods emerged [22].

2.1.1.2 Admixture and background LD models

39

p<sup>2</sup> 1 � p<sup>2</sup>

� �2�na if na <sup>¼</sup> 0 or 1

1

CCA

if na ¼ 2

(3)

� �

� �

2 na !pna

0

8

>>>>>>>><

>>>>>>>>:

BB@

Patterson et al. defined the emission probability by

P Yð <sup>t</sup> ¼ yjXt ¼ naÞ ¼

DOI: http://dx.doi.org/10.5772/intechopen.82764

[24], and SUPPORTMIX [13].

$$P\left(X\_t = k | X\_{t-1} = k', q, r\right) = \delta\left(k' = k\right)e^{-d\_l r} + (1 - e^{-d\_l r})q\_k \text{ for } 1 < t \le T \tag{2}$$

representing the first marker, and the transition probability between consecutive markers with δ k ¼ k <sup>0</sup> is the indicator function and dt the genetic distance between sites t and t � 1, above and qk the proportion of ancestry contributed by candidate ancestral population k such that q ¼ q1; …; qk is a vector of ancestry inherited from each ancestral population. On haploid data, the probability of a recombination event is 1 � <sup>e</sup>�dtr , meaning that the probability of no recombination is e�dtr [8, 20]. LD-based methods can be subdivided into admixture LD-based and admixture and background LD methods. Note that admixture LD occurs when ancestry at nearby markers is inherited together and background LD is the LD within ancestral populations, and it depends highly on population history (i.e, generated by genetic drift and population bottlenecks).

#### 2.1.1.1 Admixture LD-based models

Admixture LD-based methods are models that account for LD that resulted from the admixture process. They do not model background LD. Admixture LD-based methods include the early methods, for example, STRUCTURE V2 [20], ANCESTRYMAP [8], and ADMIXMAP [9], which are based on the Bayesian framework. Early methods rely on markers that show significant difference in frequency between ancestral populations (AIMs). Admixture LD-based models assume that markers are independent and the global and ancestral allele frequencies are known. They integrate HMM with MCMC, and their switch model and initial and transition models are as in Eqs. (1) and (2), respectively. Since LD-based methods do not model background LD, their observation model depends on only

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

the allele frequency of the ancestry at that site. For instance, assuming K = 2, Patterson et al. defined the emission probability by

$$P(Y\_t = y | \mathbf{X}\_t = n\_d) = \begin{cases} \binom{2}{n\_d} p\_k^{n\_d} (\mathbf{1} - p\_k)^{2 - n\_d} & \text{if } n\_d = 0 \text{ or } 1 \\\\ \binom{2(1 - p\_1)(1 - p\_2)}{p\_2(1 - p\_2) + p\_1(1 - p\_2)} & \text{if } n\_d = 2 \\\\ \binom{2}{p\_1 p\_2} \end{cases} \tag{3}$$

where y and na∈f g 0; 1 are numbers of reference alleles of an admixed individual at t, and that of alleles from population 1, respectively. pk is the allele frequency of population k∈f g 1; 2 at the site t, such that when na ¼ 0, pk ¼ p1 while pk ¼ p2 when na ¼ 2. Nowadays, technological, statistical, and computational advances avail enormous amounts of high density SNP data. Although high density SNPs violate the independence assumption due to background LD [21], they contain more information than in AIMs [22]. To loosen the independence assumption and minimize noise and systematic biases from unmodelled LD, more advanced local ancestry inference methods emerged [22]. These methods include SEQMIX [23], PCADMIX [24], and SUPPORTMIX [13].

SUPPORTMIX [11] models only admixture LD by combining support vector machines (SVMs) and HMM. It was proposed in 2012 to improve on the computational time and address the challenge of a few typed or nonexistent reference panels, which overall improve multi-way local ancestry deconvolution. SUPPORTMIX is the first model to allow the learning of ancestral surrogates given a pool of reference panels. As a result, it is capable to train ancestral populations that are bigger in size than those that are mixed. Since SVMs can handle huge datasets, SUPPORTMIX is faster than early methods. It uses the rich haplotype information. Also proposed in 2012, PCADMIX [24] divides the genome into contiguous windows of SNPs as in SUPPORTMIX. It leverages principal component analysis from proxy ancestral haplotypes to model admixture LD under a standard HMM. Similar to SUPPORTMIX, PCADMIX is fast and requires phased data. Nevertheless, SUPPORTMIX and PCADMIX do not model phase switch errors, and as a result, in 2013, SEQMIX [23] was proposed. Unlike all other admixture LD-based methods, SEQMIX is based on exome sequence, reads data, and uses HMM. SEQMIX models only admixture LD and prunes SNPs in background LD. As a result, to reduce noise and systematic biases from using all SNPs [10] whilst not fully modeling LD (background), admixture and background LD methods emerged [22].

#### 2.1.1.2 Admixture and background LD models

Since the biological data often have some dependences that violate the independence assumption in standard HMM, admixture LD-based methods are often not realistic. To relax the independence assumption, the HMM is extended to either Markov HMM, factorial HMM, hierarchical HMM, or two-layer HMM or other multivariate statistical models such as multivariate normal distribution (MVN) and a rich ancestral haplotype data are used unlike early methods. This is the case for SABER [10], SWITCH [25], HAPAA [26], HAPMIX [4], MULTIMIX [27], ALLOY [28], and ELAI [29]. MHMMs were the first HMM extension in local ancestry. They were first implemented in SABER and later in SWITCH. SABER was the first method to model background LD in the genetic ancestry inference. MHMM assumes that the current observed haplotype depends on both the current ancestry

2.1 Local ancestry inference models

P Xt ¼ kjXt�<sup>1</sup> ¼ k

markers with δ k ¼ k

recombination event is 1 � <sup>e</sup>�dtr

2.1.1.1 Admixture LD-based models

38

<sup>0</sup>

0 ; q;r ¼ δ k 0 ¼ k 

candidate ancestral population k such that q ¼ q1; …; qk

(i.e, generated by genetic drift and population bottlenecks).

e

representing the first marker, and the transition probability between consecutive

between sites t and t � 1, above and qk the proportion of ancestry contributed by

is e�dtr [8, 20]. LD-based methods can be subdivided into admixture LD-based and admixture and background LD methods. Note that admixture LD occurs when ancestry at nearby markers is inherited together and background LD is the LD within ancestral populations, and it depends highly on population history

Admixture LD-based methods are models that account for LD that resulted from the admixture process. They do not model background LD. Admixture LD-based

methods include the early methods, for example, STRUCTURE V2 [20], ANCESTRYMAP [8], and ADMIXMAP [9], which are based on the Bayesian framework. Early methods rely on markers that show significant difference in frequency between ancestral populations (AIMs). Admixture LD-based models assume that markers are independent and the global and ancestral allele frequencies are known. They integrate HMM with MCMC, and their switch model and initial and transition models are as in Eqs. (1) and (2), respectively. Since LD-based methods do not model background LD, their observation model depends on only

inherited from each ancestral population. On haploid data, the probability of a

�dtr <sup>þ</sup> <sup>1</sup> � <sup>e</sup>

is the indicator function and dt the genetic distance

background linkage disequilibrium (LD) or not.

2.1.1 LD-based models for local ancestry inference

As pointed out previously, existing local ancestry inference models can be categorized into two main groups based on whether the model makes use of admixture/

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

LD-based models account for LD in local ancestry deconvolution, and due to the importance of LD in disease mapping, the first local ancestry methods fall into this category. They assume that ancestry along an admixed individual genome follows a first order Markov chain. This means that the immediate past state captures all the information on past states [19]. As a result, LD-based models assume that, at every site, the observed admixed genotypes are generated by the unobserved ancestry, and hence, Hidden Markov Model (HMM) and its extensions are used to infer the unobserved (hidden) states. Thus, to deconvolute ancestry along the admixed genome, these models have three model parameters, namely the initial, transition and observation, or emission probability models. Due to uncertainty and the number of parameters involved, LD-based methods use Markov Chain Monte Carlo (MCMC), forward-backward, or Viterbi algorithms to determine the hidden ancestry sequence for a given individual. Falush et al. and Patterson et al. modeled ancestry switch between ancestry populations at a given site, Xt∈f g 1; …;K , by

P Xð Þ¼ <sup>1</sup> ¼ kjq;r qk, (1)

, meaning that the probability of no recombination

�dtr qk for 1 < <sup>t</sup>≤<sup>T</sup> (2)

is a vector of ancestry

and the immediate past observation. The difference in the MHMM and admixture LD HMM-based is that when ancestry switches between sites t � 1 and t, then the MHMM observation model depends on the joint distribution of allele frequencies at the two sites [6, 30], defined as follows [10]:

$$P\left(Y\_t = c \middle| Y\_{t-1} = d, X\_t = k, X\_{t'} = k'\right) = B\_t\left(c, d, k', k\right),$$

$$P\left(Y\_t = c \middle| Y\_{t-1} = d, X\_t = k, X\_{t'} = k'\right) = \begin{cases} \bar{B}\_{k', t}(c, d) \text{ for } k' = k\\ \overline{B}\_{k', t}(c) & \text{otherwise} \end{cases} \tag{4}$$

two-way admixtures. It uses the naive Bayes classifier and a clustering algorithm known as the iterative conditional modes. LAMP estimates the most probable ancestry at a site by applying the majority vote for each SNP [11]. Although accuracy is comprised, LAMP does not suffer from challenges of HMM and extension. As a result, LAMP underperforms in closely related populations, and hence it was extended to WINPOP [31], a dynamic programming algorithm. Unlike LAMP, WINPOP assumes at least one recombination event within each window and varies the window length depending on the genetic distance between populations. Hence, WINPOP and LAMP outperform other methods in closely and distantly related populations, respectively. Both LAMP and WINPOP assume unlinked markers and

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

As the admixed sequence data availability increases, Maples et al. proposed a discriminative approach to estimate local ancestry, RFMIX [32]. A discriminative approach estimates the posterior probability directly and not via the joint probability distribution. In contrast to generative ancestry inference models, RFMIX uses the information contained in admixed individuals. This is advantageous in cases of genotyped few reference panels. This is the case for Native Americans [32]. RFMIX uses conditional random fields (CRFs) parametrized on random forests. It outperforms in multi-way admixtures maybe due to modeling phase switch errors. In 2013, EILA [14], a multivariate statistic based method, was proposed particularly to increase inference power through addressing three common challenges in local ancestry. Addressed challenges are the independence of SNP assumption, difficulties in identifying break points, and the use of three genotype values. Instead of raw genotypes, EILA uses a numerical value between 0 and 1. The score determines how close SNPs are to the ancestral populations. Breakpoints are a challenge to identify, but EILA identifies them by fused quantile regression facilitating the use of estimates in admixture dating. Finally, k-means classifiers are used to infer ancestry

Recently, a software package that deconvolves local ancestry in multi-way admixtures for a wide range of species, LOTER [15], was proposed. LOTER can account for phase errors in two-way admixture only. It facilitates the local ancestry inference process and its application in non-model species [15]. Unlike other methods, LOTER needs no biological such as admixture time and recombination rate or statistical parameters such as, number of hidden states and misfit probabilities to deconvolve ancestry [15]. Although it uses the Li and Stephen's copying model [33] as in LAMPLD/LAMPHAP, LOTER is a nonprobabilistic approach formulated from an optimization problem. Its solution is obtained through dynamic

Finally, different existing LD and non-LD-based local ancestry inference models

Several models are now available to determine the date of admixture events in a given admixed genome. Breakpoints of haplotypes are used by some models while others focus on the ancestry blocks. Models based on ancestry blocks for dating admixture are formulated using either an empirical criteria or variants associated with a specific population. In order to determine the average length of the admixture block, these methods then assign ancestry on predefined windows using either wavelet transformation or conditional random fields [35]. On the other hand, there are models requiring rapid decrease in haplotype block sizes to estimate the date of the admixture event [36]. This suggests that, in general,

are summarized in Table 1 extracted from Geza et al. [34].

2.2 Models for dating admixture events in a genome

discards SNPs in LD.

DOI: http://dx.doi.org/10.5772/intechopen.82764

using all genotyped SNPs [14].

programming.

41

where <sup>B</sup>~k,t ð Þ c; d is the probability of having alleles at marker t provided there was allele d at t � 1 and Bk,tð Þc the allele frequency of alleles at marker t have for origins the population k. However, if the ancestry does not switch, then the observation model is like that of models in Section 2.1.1.1. The transition model of the SABER model accounts for the differences in admixture times that are in the real case of continuous gene flow where populations contribute their genetic material to the admixture in different generations [10]. Tang et al. defined the probability of switching from ancestry k at t to k at t as

$$A\_{\boldsymbol{\dot{y}}} = \begin{cases} q\_i \frac{\mathcal{G}\_i^2}{\sum\_{k=1}^K q\_k \mathcal{g}\_k} - \mathcal{g}\_i, & \text{for } \mathbf{i} = \mathbf{j}, \\ q\_j \frac{\mathcal{G}\_i \mathcal{G}\_j}{\sum\_{k=1}^K q\_k \mathcal{g}\_k}, & \text{otherwise} \end{cases} \tag{5}$$

where gk is the admixture time when population k started to contribute to the admixture.

However, SABER has a large parameter set, and does not explicitly model background LD as it models background LD using first order Markov chain [22]; other methods such as SWITCH were proposed. SWITCH takes into recombination even if it does not result in an ancestry switch, emerged. In contrast to SABER, SWITCH conditions the MHMM on recombination. Similar to early methods, probability of recombination depends on the admixture generations, genetic distance between consecutive SNPs, and the recombination rate. Thus, if the transition probability model in SWITCH is marginalized over recombination, then it is similar to Eq. (2) for two-way and Eq. (5) for multi-way. Although SWITCH models background LD and estimates recombination rates, the authors recommended richer MHMM or other different models that would outperform the SWITCH and SABER pairwise models [25]. As a result, methods that use both large- and small-scale HMM, referred to as the HHMM, were introduced.

#### 2.1.2 Non-LD-based local ancestry inference models

Non-LD methods neither model background nor admixture LD. They either remove SNPs in LD which is the case for LAMP [11] and WINPOP [31], or use all SNPs (linked and unlinked SNPs) without modeling LD; this is the case for EILA [14], RFMIX [32], and LOTER [15]. Since MHMMs have a large number of parameters and do not model LD explicitly, an algorithmic approach that divides genome into windows of SNPs, LAMP [11], emerged in 2008. LAMP is fast and robust, and can infer local ancestry even without proxy ancestral genotypes. This is the case for Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

two-way admixtures. It uses the naive Bayes classifier and a clustering algorithm known as the iterative conditional modes. LAMP estimates the most probable ancestry at a site by applying the majority vote for each SNP [11]. Although accuracy is comprised, LAMP does not suffer from challenges of HMM and extension. As a result, LAMP underperforms in closely related populations, and hence it was extended to WINPOP [31], a dynamic programming algorithm. Unlike LAMP, WINPOP assumes at least one recombination event within each window and varies the window length depending on the genetic distance between populations. Hence, WINPOP and LAMP outperform other methods in closely and distantly related populations, respectively. Both LAMP and WINPOP assume unlinked markers and discards SNPs in LD.

As the admixed sequence data availability increases, Maples et al. proposed a discriminative approach to estimate local ancestry, RFMIX [32]. A discriminative approach estimates the posterior probability directly and not via the joint probability distribution. In contrast to generative ancestry inference models, RFMIX uses the information contained in admixed individuals. This is advantageous in cases of genotyped few reference panels. This is the case for Native Americans [32]. RFMIX uses conditional random fields (CRFs) parametrized on random forests. It outperforms in multi-way admixtures maybe due to modeling phase switch errors. In 2013, EILA [14], a multivariate statistic based method, was proposed particularly to increase inference power through addressing three common challenges in local ancestry. Addressed challenges are the independence of SNP assumption, difficulties in identifying break points, and the use of three genotype values. Instead of raw genotypes, EILA uses a numerical value between 0 and 1. The score determines how close SNPs are to the ancestral populations. Breakpoints are a challenge to identify, but EILA identifies them by fused quantile regression facilitating the use of estimates in admixture dating. Finally, k-means classifiers are used to infer ancestry using all genotyped SNPs [14].

Recently, a software package that deconvolves local ancestry in multi-way admixtures for a wide range of species, LOTER [15], was proposed. LOTER can account for phase errors in two-way admixture only. It facilitates the local ancestry inference process and its application in non-model species [15]. Unlike other methods, LOTER needs no biological such as admixture time and recombination rate or statistical parameters such as, number of hidden states and misfit probabilities to deconvolve ancestry [15]. Although it uses the Li and Stephen's copying model [33] as in LAMPLD/LAMPHAP, LOTER is a nonprobabilistic approach formulated from an optimization problem. Its solution is obtained through dynamic programming.

Finally, different existing LD and non-LD-based local ancestry inference models are summarized in Table 1 extracted from Geza et al. [34].

#### 2.2 Models for dating admixture events in a genome

Several models are now available to determine the date of admixture events in a given admixed genome. Breakpoints of haplotypes are used by some models while others focus on the ancestry blocks. Models based on ancestry blocks for dating admixture are formulated using either an empirical criteria or variants associated with a specific population. In order to determine the average length of the admixture block, these methods then assign ancestry on predefined windows using either wavelet transformation or conditional random fields [35]. On the other hand, there are models requiring rapid decrease in haplotype block sizes to estimate the date of the admixture event [36]. This suggests that, in general,

and the immediate past observation. The difference in the MHMM and admixture LD HMM-based is that when ancestry switches between sites t � 1 and t, then the MHMM observation model depends on the joint distribution of allele frequencies at

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

¼ Bt c; d; k

Bk0

<sup>¼</sup> <sup>B</sup><sup>~</sup> k0 ,t

ð Þ c; d is the probability of having alleles at marker t provided there was

� gi

, for i ¼ j,

, otherwise

allele d at t � 1 and Bk,tð Þc the allele frequency of alleles at marker t have for origins the population k. However, if the ancestry does not switch, then the observation model is like that of models in Section 2.1.1.1. The transition model of the SABER model accounts for the differences in admixture times that are in the real case of continuous gene flow where populations contribute their genetic material to the admixture in different generations [10]. Tang et al. defined the probability of

8 < :

0 ; k � �

ð Þ <sup>c</sup>; <sup>d</sup> for k<sup>0</sup>

,tð Þc otherwise

,

¼ k

(4)

(5)

the two sites [6, 30], defined as follows [10]:

P Yt ¼ cjYt�<sup>1</sup> ¼ d;Xt ¼ k;Xt<sup>0</sup> ¼ k <sup>0</sup> � �

switching from ancestry k at t to k at t as

Aij ¼

referred to as the HHMM, were introduced.

2.1.2 Non-LD-based local ancestry inference models

qi

8 >>><

>>>:

qj

∑<sup>K</sup> <sup>k</sup>¼<sup>1</sup>qk gk

g2 i ∑<sup>K</sup> <sup>k</sup>¼<sup>1</sup>qk gk

gi gj

where gk is the admixture time when population k started to contribute to the

However, SABER has a large parameter set, and does not explicitly model background LD as it models background LD using first order Markov chain [22]; other methods such as SWITCH were proposed. SWITCH takes into recombination even if it does not result in an ancestry switch, emerged. In contrast to SABER, SWITCH conditions the MHMM on recombination. Similar to early methods, probability of recombination depends on the admixture generations, genetic distance between consecutive SNPs, and the recombination rate. Thus, if the transition probability model in SWITCH is marginalized over recombination, then it is similar to Eq. (2) for two-way and Eq. (5) for multi-way. Although SWITCH models background LD and estimates recombination rates, the authors recommended richer MHMM or other different models that would outperform the SWITCH and SABER pairwise models [25]. As a result, methods that use both large- and small-scale HMM,

Non-LD methods neither model background nor admixture LD. They either remove SNPs in LD which is the case for LAMP [11] and WINPOP [31], or use all SNPs (linked and unlinked SNPs) without modeling LD; this is the case for EILA [14], RFMIX [32], and LOTER [15]. Since MHMMs have a large number of parameters and do not model LD explicitly, an algorithmic approach that divides genome into windows of SNPs, LAMP [11], emerged in 2008. LAMP is fast and robust, and can infer local ancestry even without proxy ancestral genotypes. This is the case for

where <sup>B</sup>~k,t

admixture.

40

P Yt ¼ cjYt�<sup>1</sup> ¼ d;Xt ¼ k;Xt<sup>0</sup> ¼ k <sup>0</sup> � �


Software

43

ALLOY RFMIX

EILA

SEQMIX

ELAI LOTER

Table 1. Existing 20 ancestry specified, LD refers to background

deconvolution

 tools:

 LD.

✓

indicates the ability of the software to perform a specified task,

✗

indicates the

inapplicability

 of the task by a particular

 tool. Unless explicitly

✓

✗

 ✗

✓

 ✓

Two layer HMM

 Admixture

—

 generations,

 lower and upper cluster

Phased

Phased/unphased

Phased

Phased/unphased

 May 2014

November

2017

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

✓

✗

 ✗

Genetic map

✓

✗

 ✗

✓

✗

 ✗

✓

 ✓

Non-homogeneous

Markers, ancestral proportions,

generations,

Genetic map, window size, and admixture

generations

Physical map

 and genetic map

 admixture

Phased Phased Unphased (no missing

Unphased (no missing

November

DOI: http://dx.doi.org/10.5772/intechopen.82764

2013

November

2013

values)

values)

Unphased

Unphased

Phased

August 2013

Phased

February 2013

VLMC

Multiway

Account

LD model

Biological/statistical

 parameters

Reference populations

Admixed populations

Year of publication

> LD

#### Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations


Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

#### Table

Existing 20 ancestry deconvolution tools: ✓ indicates the ability of the software to perform a specified task, ✗ indicates the inapplicability of the task by a particular tool. Unless explicitly specified, LD refers to background LD.

Software

42

STRUCTURE

 V2\* ANCESTRYMAP\*

ADMIXMAP\*

SABER "LAMP"

HAPAA SWITCH GEDI-ADMX

WINPOP

HAPMIX CHROMOPAINTER

LAMPLD SUPPORTMIX\*

PCADMIX\* mSPECTRUM

MULTIMIX

✓

 ✓

MVN

Genetic map, legend file and misfitting probabilities

✓

 ✓

SNPs, mutation and

recombination

 rate

Phased

Phased/unphased

Phased/unphased

 November

2012

Phased

August 2012

✓

 ✓

Windows of blocks

Genetic map and window size

> of SNPs

✓

 ✓

HMM

✓

 ✓

HHMM

 ✓

 ✓

Co-ancestry

 matrix

Number of hidden states, window size and physical

map

Admixture

 generations

 and genetic map

Recombination

 rate

✗

✓

HHMM

Genetic map mutation rate and admixed and ancestral

SNPs

✓

✗

 ✗

✓

 ✓

Fixed size FHMM

 Admixed and ancestral SNPs (physical map)

Recombination,

and physical map

 admixture generations,

 LD threshold,

Unphased

Phased Phased Phased Phased Phased

Phased Phased

June 2012

August 2012

Phased Unphased

May 2012

January 2012

Unphased

June 2009

✓

 ✓

MHMM

✓

 ✓

HHMM

✓

✗

 ✗

✓

 ✓

MHMM

✓

 ✓

HMM

Physical map and ancestry proportions

Physical map or

Admixture

map

Admixture

 generations

Recombination

 rate

 and genetic divergence

 Phased Phased

 Phased

Phased Phased Unphased Unphased

June 2009

May 2009

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

February 2008

February 2008

 generations,

 LD threshold, and physical

recombination

 distance

✗

✓

HMM

✓

 ✓

HMM

Markers, and ancestry proportions

Physical map,

proportions

recombination

 and ancestry

Multiway

Account

LD model

Biological/statistical

 parameters

Reference populations

Unphased Unphased Unphased Phased/unphased

Unphased

Unphased

February 2008

Phased/unphased

 July 2006

Unphased

May 2004

Unphased Unphased

May 2004

August 2003

Admixed populations

Year of publication

> LD

models used for dating admixture events can be subdivided in two main classes [17, 18], namely those based on LD and those based on the haplotype distribution, as mentioned earlier.

2.2.2 Haplotype distribution-based models for dating admixture events

an admixed genome, the likelihood of an observed allele is given by

[4]. The number of generations since admixture is given by

admixture using the signal obtained in the first step [17].

the ancestral haplotypes follows exponential distribution given by

Huvwð Þ¼ h

DOI: http://dx.doi.org/10.5772/intechopen.82764

observed number of breakpoints [4].

is equal to <sup>1</sup>

45

g .

Among the haplotype-based approaches, there is the likelihood method introduced in 2009 by Price et al. [4]. It basically determines the number of breakpoints using Hidden Markov Model. It is also able to determine the number of alleles at a particular site inherited from a given ancestor in a population. This is done in two steps. First, the method consists in identifying haplotype from the proxy ancestry populations, and secondly, the origin of each haplotype bock is identified by comparing their likelihood for one ancestral population versus the others. Considering

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

with θu, u∈f g 1; 2; 3 the mutation parameter is; h represents the haplotype site in the chromosomal offspring; the function tvw is an indicator function. It takes the value 1 if individual w coming from offspring x has the same haplotype with the ancestral population v and 0 otherwise; and P is the probability to inherit a pair of haplotype

<sup>G</sup> <sup>¼</sup> <sup>C</sup>

where ζ is the total Morgan length, γ the proportion of admixture, and C the

On the other hand, Pugach et al. [17] employed the wavelet transform to design a haplotype block approach. The aim of this method is to derive the time of admixture of a given population using the simple hybrid isolation model. It proceeds in two main steps. First, it obtains a signal of admixture from the admixed data using the principal component technique. The second step consists in deriving the date of

Pool and Nielsen also built a haplotype-based approach. It used precautionary ancestral populations to infer the date of admixture from the genome of an admixed population [39]. It assumed that after a number of generation g, the distribution of

where λ is the length of haplotypes. Also, the mean of this distribution is known and

Further methods include that of Gravel developed in 2012 for the identification of multiple ancestral populations in a given admixture dataset [40]. Also, Jin et al. [41] came up with a similar method to explain admixture dynamics. The method incorporates several models including gradual admixture (GA), hybrid isolated (HI), and continuous gene flow (CGF) models [41], which can be extended to GA-Isolation (GA-I) and CGF-Isolation (CGF-I) by considering isolation after admixture [42]. Hellenthal et al. [43] on the other hand built up on the work of Lawson et al. [44] on dating admixture. This method particularly considers the genome of an admixed individual to be a set chunk DNA coming from other individuals. The scheme of this method is mainly made of two stages. The first stage consists in dividing the genome into chunks and matching each of them to the proper ancestral individual. This stage is achieved with the help of Hidden Markov

θuP tð Þþ vw ¼ 0 ð Þ 1 � θ<sup>u</sup> P tð Þ vw ¼ 1 , if u ¼ v, <sup>θ</sup>3P tð Þþ vw <sup>¼</sup> <sup>0</sup> ð Þ <sup>1</sup> � <sup>θ</sup><sup>3</sup> P tð Þ vw <sup>¼</sup> <sup>1</sup> , otherwise

<sup>4</sup>γð Þ <sup>1</sup> � <sup>γ</sup> <sup>ζ</sup> (9)

<sup>f</sup>ð Þ¼ <sup>λ</sup>; <sup>g</sup> ge�λ<sup>g</sup> (10)

(8)

#### 2.2.1 LD-based models for dating admixture events

An admixture event is mainly characterized by the transfer of genes from the ancestral populations to the admixed ones. This leads to the appearance of linkage disequilibrium with regard to the ancestral populations. However, this LD formed often decreases with time. Also, the rate of decay of LD is a function of recombination and the proportion of the admixture [35]. Inversely, many methods employ this rate to calculate the time since the admixture event occurs.

In 2011, Moorjani et al. introduced a method to determine the weighted correlation for a pair of SNPs [36]. This correlation coefficient is further used to measure the LD with ancestral populations [37]. The time of admixture is then determined by analyzing the correlation with respect to the genetic distance, and also fitting using a least squares method the decay of the correlation [35]. This method got improved in 2011 by Loh et al. [18]. The major improvements are in terms of computation. Loh et al. employed instead a fast Fourier transform and other faster techniques to determine the optimal distance to the fitting curve. This method has another advantage that it reduces considerable biases in the estimation of the time of admixture [18, 36]. Later, Loh et al.'s method was improved by Pickrell et al. [38] by introducing the notion of mixture exponential decay in order to take into account the admixture events in the given admixed population history. It mainly focuses on the decay of the LD.

#### 2.2.1.1 Multiple weighted correlation coefficient

Let us consider three ancestral populations k1, k2, and k3, and Q the admixed population. Let us denote by ω<sup>1</sup>�2, ω<sup>1</sup>�3, and ω<sup>2</sup>�<sup>3</sup> three weighted linkage disequilibrium scores computed based on all possible pairs of SNPs between the three ancestral populations: k<sup>1</sup> � k2, k<sup>1</sup> � k3, and k<sup>2</sup> � k3, respectively, in the admixed population Q calculated using the method proposed by Loh et al. According to Prickrell et al., the multiple weighted correlation coefficient is [38],

$$C\_{k\_1-k\_2,k\_1-k\_2,k\_2-k\_3} = \sqrt{\frac{\alpha\_{2-3}^2 + \alpha\_1^2 - 2\alpha\_{2-3}\alpha\_{1-2}\alpha\_{1-3}}{1 - \alpha\_{2-3}^2}}.\tag{6}$$

The date of admixture between population k<sup>1</sup> and k<sup>3</sup> is

$$D\_{k\_1,k\_2,k\_3k\_1-k\_2k\_3} = \begin{cases} w\_0 + w\_1 e^{-u\frac{k\_0}{100}}, & \text{for one admittance event} - D\_{(1)}, \\ w\_0 + w\_1 e^{-u\frac{k\_0}{100}} + w\_2 e^{-u\frac{k\_0}{100}}, & \text{in the case of two admittance events} - D\_{(2)}, \end{cases} \tag{7}$$

with n<sup>1</sup> and n<sup>1</sup> the number of generations; δ<sup>n</sup> the genetic distance; w<sup>1</sup> and w<sup>2</sup> stand for the value of the multiple weighted LD; and w<sup>0</sup> the affine term. Dð Þ<sup>1</sup> is the date of admixture of population Q in the case of admixture either between k<sup>1</sup> � k<sup>2</sup> or k<sup>2</sup> � k3. On the other hand, if it is assumed that two admixture events took place between k<sup>1</sup> � k<sup>3</sup> and either k<sup>1</sup> � k<sup>2</sup> or k<sup>2</sup> � k3, the date of the admixed population is given by Dð Þ<sup>2</sup> .

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

#### 2.2.2 Haplotype distribution-based models for dating admixture events

Among the haplotype-based approaches, there is the likelihood method introduced in 2009 by Price et al. [4]. It basically determines the number of breakpoints using Hidden Markov Model. It is also able to determine the number of alleles at a particular site inherited from a given ancestor in a population. This is done in two steps. First, the method consists in identifying haplotype from the proxy ancestry populations, and secondly, the origin of each haplotype bock is identified by comparing their likelihood for one ancestral population versus the others. Considering an admixed genome, the likelihood of an observed allele is given by

$$H\_{uww}(h) = \begin{cases} \theta\_u P(t\_{vw} = \mathbf{0}) + (\mathbf{1} - \theta\_u) P(t\_{vw} = \mathbf{1}), \text{if } u = v, \\\theta\_3 P(t\_{vw} = \mathbf{0}) + (\mathbf{1} - \theta\_3) P(t\_{vw} = \mathbf{1}), \quad otherwise \end{cases} \tag{8}$$

with θu, u∈f g 1; 2; 3 the mutation parameter is; h represents the haplotype site in the chromosomal offspring; the function tvw is an indicator function. It takes the value 1 if individual w coming from offspring x has the same haplotype with the ancestral population v and 0 otherwise; and P is the probability to inherit a pair of haplotype [4]. The number of generations since admixture is given by

$$G = \frac{C}{4\gamma(1-\gamma)\zeta} \tag{9}$$

where ζ is the total Morgan length, γ the proportion of admixture, and C the observed number of breakpoints [4].

On the other hand, Pugach et al. [17] employed the wavelet transform to design a haplotype block approach. The aim of this method is to derive the time of admixture of a given population using the simple hybrid isolation model. It proceeds in two main steps. First, it obtains a signal of admixture from the admixed data using the principal component technique. The second step consists in deriving the date of admixture using the signal obtained in the first step [17].

Pool and Nielsen also built a haplotype-based approach. It used precautionary ancestral populations to infer the date of admixture from the genome of an admixed population [39]. It assumed that after a number of generation g, the distribution of the ancestral haplotypes follows exponential distribution given by

$$f(\lambda, \mathbf{g}) = \mathbf{g}e^{-\lambda \mathbf{g}} \tag{10}$$

where λ is the length of haplotypes. Also, the mean of this distribution is known and is equal to <sup>1</sup> g .

Further methods include that of Gravel developed in 2012 for the identification of multiple ancestral populations in a given admixture dataset [40]. Also, Jin et al. [41] came up with a similar method to explain admixture dynamics. The method incorporates several models including gradual admixture (GA), hybrid isolated (HI), and continuous gene flow (CGF) models [41], which can be extended to GA-Isolation (GA-I) and CGF-Isolation (CGF-I) by considering isolation after admixture [42]. Hellenthal et al. [43] on the other hand built up on the work of Lawson et al. [44] on dating admixture. This method particularly considers the genome of an admixed individual to be a set chunk DNA coming from other individuals. The scheme of this method is mainly made of two stages. The first stage consists in dividing the genome into chunks and matching each of them to the proper ancestral individual. This stage is achieved with the help of Hidden Markov

models used for dating admixture events can be subdivided in two main classes [17, 18], namely those based on LD and those based on the haplotype distribution,

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

An admixture event is mainly characterized by the transfer of genes from the ancestral populations to the admixed ones. This leads to the appearance of linkage disequilibrium with regard to the ancestral populations. However, this

function of recombination and the proportion of the admixture [35]. Inversely, many methods employ this rate to calculate the time since the admixture

In 2011, Moorjani et al. introduced a method to determine the weighted correlation for a pair of SNPs [36]. This correlation coefficient is further used to measure the LD with ancestral populations [37]. The time of admixture is then determined by analyzing the correlation with respect to the genetic distance, and also fitting using a least squares method the decay of the correlation [35]. This method got improved in 2011 by Loh et al. [18]. The major improvements are in terms of computation. Loh et al. employed instead a fast Fourier transform and other faster techniques to determine the optimal distance to the fitting curve. This method has another advantage that it reduces considerable biases in the estimation of the time of admixture [18, 36]. Later, Loh et al.'s method was improved by Pickrell et al. [38] by introducing the notion of mixture exponential decay in order to take into account the admixture events in the given admixed population history. It mainly

Let us consider three ancestral populations k1, k2, and k3, and Q the admixed population. Let us denote by ω<sup>1</sup>�2, ω<sup>1</sup>�3, and ω<sup>2</sup>�<sup>3</sup> three weighted linkage disequilibrium scores computed based on all possible pairs of SNPs between the three ancestral populations: k<sup>1</sup> � k2, k<sup>1</sup> � k3, and k<sup>2</sup> � k3, respectively, in the admixed population Q calculated using the method proposed by Loh et al. According to

> ω2 <sup>2</sup>�<sup>3</sup> <sup>þ</sup> <sup>ω</sup><sup>2</sup>

<sup>100</sup>, for one admixture event � Dð Þ<sup>1</sup> ,

with n<sup>1</sup> and n<sup>1</sup> the number of generations; δ<sup>n</sup> the genetic distance; w<sup>1</sup> and w<sup>2</sup> stand for the value of the multiple weighted LD; and w<sup>0</sup> the affine term. Dð Þ<sup>1</sup> is the date of admixture of population Q in the case of admixture either between k<sup>1</sup> � k<sup>2</sup> or k<sup>2</sup> � k3. On the other hand, if it is assumed that two admixture events took place between k<sup>1</sup> � k<sup>3</sup> and either k<sup>1</sup> � k<sup>2</sup> or k<sup>2</sup> � k3, the date of the admixed population is

s

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

<sup>1</sup> � <sup>ω</sup><sup>2</sup> 2�3

<sup>1</sup> � 2ω<sup>2</sup>�<sup>3</sup>ω<sup>1</sup>�<sup>2</sup>ω<sup>1</sup>�<sup>3</sup>

<sup>100</sup>, in the case of two admixture events � Dð Þ<sup>2</sup> ,

: (6)

(7)

Prickrell et al., the multiple weighted correlation coefficient is [38],

Ck1�k2,k1�k2,k2�k<sup>3</sup> ¼

The date of admixture between population k<sup>1</sup> and k<sup>3</sup> is

δn

δn <sup>100</sup> <sup>þ</sup> w2e�n<sup>2</sup> <sup>δ</sup><sup>n</sup>

<sup>w</sup><sup>0</sup> <sup>þ</sup> w1e�n<sup>1</sup>

LD formed often decreases with time. Also, the rate of decay of LD is a

as mentioned earlier.

event occurs.

focuses on the decay of the LD.

Dk1,k2,k1k2�k2k<sup>3</sup> <sup>¼</sup> <sup>w</sup><sup>0</sup> <sup>þ</sup> w1e�n<sup>1</sup>

given by Dð Þ<sup>2</sup> .

44

(

2.2.1.1 Multiple weighted correlation coefficient

2.2.1 LD-based models for dating admixture events

Model. The second stage consists in identifying haplotypes and determining their respective ancestral population [43, 44]. Moreover, the admixture event and its date are derived by fitting the decay of the ancestral haplotype with an exponential distribution curve. Moreover, Ni et al. developed a method based on the observation that the date of admixture events is related to the model used. Their method consists in using the likelihood ratio test to identify the best model for the inference of the date of admixture. Furthermore, they are able to estimate several admixture events with the given optimal model [35].

3. Challenges and perspectives

DOI: http://dx.doi.org/10.5772/intechopen.82764

LAMPLD.

admixtures.

outperforming in ancient admixtures [15].

LD and non-LD based models.

47

3.1 Case of local ancestry inference models

Although several models exist to deconvolve local ancestry, most studies that evaluate such models showed that deviations in local ancestry estimates still exist in multi-way admixtures. Deviations in local ancestry also result from genetic drift, miscalling true ancestry, and genotyping errors. However, the signals from these factors affect the whole genome while that of unmodelled natural selection affects particular regions. For example, Chen et al. using four local ancestry inference models to scan for disease-related loci through admixture mapping showed that although all of them are LD based and divide the genome into windows of continuous SNPs, MULTIMIX and LAMPLD estimates differed in almost 20% of the analyzed SNPs. This results from the differences in the biological and statistical parameters they require and the mathematical approaches they use. Another association study by Chimusa et al. [45] also pointed out that admixture mapping is still limited by inaccuracies in multi-way local ancestry deconvolution when they evaluated one LD-based and one non-LD-based local ancestry models, WINPOP and

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

Inaccuracies in local ancestry estimates may result from the use of statistical or biological parameters in the estimation process, which are not always accurate when provided. It could also be due to the dependence of models on reference panels which for some populations are few or even not sampled for others. This is the case for the Native Americans. More so for other admixed populations, their history is not well known. When applied to ancient admixtures, existing methods may yield spurious estimates as they were designed for recent admixtures. Existing methods do not account for natural selection; hence, some deviations exist in regions that are

Since each model was introduced to address a particular challenge that models before it faced, it is clearly expected that no model or tool can achieve the best performance in all admixture scenarios and not trading estimate accuracy with computational speed. Using existing studies by Geza et al. [34], more than 50% of studies that either introduced a model or evaluated methods for association mapping showed that LAMPLD/LAMPHAP outperforms most LD-based methods. And the only LD-based method than outperformed LAMPLD is ELAI; however, this is the only study that assessed ELAI with other models. In cases where LD-based models were compared to non-LD-based models, RFMIX outperformed LAMPLD in three cases highlighted in [34], while another separate study aiming to determine the place of admixture of an admixed population RFMIX also outperformed. This could be because RFMIX can deconvolve ancestry in closely related populations [46]. However, a recent assessment between RFMIX and LOTER resulted in LOTER

Generally, each model is implemented as a tool in local ancestry deconvolution, existing as individual scripts requiring unique inputs and producing unique outputs. This challenges researchers with a limited computational background; thus, there is lack of a unified framework which can require a standard easy to manipulate input files and output results in a way that is easy to process for further application. In conclusion, for informed decisions on models and algorithms, existing models or tools should be assessed within a unified framework. This will allow them to be tested on different admixture scenarios and also incorporating most state-of-the-art

under selection [45]. Also, most of them are benchmarked for three-way


Finally, different existing models and tools for dating admixture events are summarized in Table 2 extracted from Chimusa et al. [35].

#### Table 2.

Existing dating admixture genomic tools.

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

### 3. Challenges and perspectives

Model. The second stage consists in identifying haplotypes and determining their respective ancestral population [43, 44]. Moreover, the admixture event and its date are derived by fitting the decay of the ancestral haplotype with an exponential distribution curve. Moreover, Ni et al. developed a method based on the observation that the date of admixture events is related to the model used. Their method consists in using the likelihood ratio test to identify the best model for the inference of the date of admixture. Furthermore, they are able to estimate several admixture

Bioinformatics Tools for Detection and Clinical Interpretation of Genomic Variations

Finally, different existing models and tools for dating admixture events are

Priori proxy ancestral raw data

Multiway events Online link

DReichLab/Ad mixTools/

pickrell/malder/

vid940408/CAMer

n/PGG/resource.php

de/download/Ste pPCO/

org/web/packages/ad wave/index.html

rvard.edu/reichlab/Re ich\_Lab/Software.h

xyang619/MultiWave

http://www.picb.ac.c n/PGG/resource.php

maarjalepamets/huma n-admixture/

cb/alder/

Yes Yes https://github.com/da

Yes Yes http://www.picb.ac.c

Yes Yes https://cran.r-project.

tml/

Infer/ or

No Yes https://github.com/

No Yes https://github.com/sg

ravel/tracts/

russcd/

HI Yes Yes https://bioinf.eva.mpg.

model

ROLLOFF LD-based model HI Yes No https://github.com/

ALDER HI Yes No http://cb.csail.mit.edu/

MALDER HI Yes Yes https://github.com/joe

CGF, GA-I, CGF-I

CGF, GA-I, CGF-I

admixture

HAPMIX HI Yes Yes http://genetics.med.ha

MultiWaveIner HI Yes Yes https://github.com/

CGF

CGF

Ancestry\_HMM HI No No https://github.com/

events with the given optimal model [35].

Tool Category Admixture

CAMer HI, GA,

IMAAPs HI, GA,

model

Adware HI, Dual-

GLOBBERTROTTER HI, GA,

Tracts HI, GA,

Existing dating admixture genomic tools.

Table 2.

46

ancestry block size distribution-based

StepPCO Haplotype/

summarized in Table 2 extracted from Chimusa et al. [35].

#### 3.1 Case of local ancestry inference models

Although several models exist to deconvolve local ancestry, most studies that evaluate such models showed that deviations in local ancestry estimates still exist in multi-way admixtures. Deviations in local ancestry also result from genetic drift, miscalling true ancestry, and genotyping errors. However, the signals from these factors affect the whole genome while that of unmodelled natural selection affects particular regions. For example, Chen et al. using four local ancestry inference models to scan for disease-related loci through admixture mapping showed that although all of them are LD based and divide the genome into windows of continuous SNPs, MULTIMIX and LAMPLD estimates differed in almost 20% of the analyzed SNPs. This results from the differences in the biological and statistical parameters they require and the mathematical approaches they use. Another association study by Chimusa et al. [45] also pointed out that admixture mapping is still limited by inaccuracies in multi-way local ancestry deconvolution when they evaluated one LD-based and one non-LD-based local ancestry models, WINPOP and LAMPLD.

Inaccuracies in local ancestry estimates may result from the use of statistical or biological parameters in the estimation process, which are not always accurate when provided. It could also be due to the dependence of models on reference panels which for some populations are few or even not sampled for others. This is the case for the Native Americans. More so for other admixed populations, their history is not well known. When applied to ancient admixtures, existing methods may yield spurious estimates as they were designed for recent admixtures. Existing methods do not account for natural selection; hence, some deviations exist in regions that are under selection [45]. Also, most of them are benchmarked for three-way admixtures.

Since each model was introduced to address a particular challenge that models before it faced, it is clearly expected that no model or tool can achieve the best performance in all admixture scenarios and not trading estimate accuracy with computational speed. Using existing studies by Geza et al. [34], more than 50% of studies that either introduced a model or evaluated methods for association mapping showed that LAMPLD/LAMPHAP outperforms most LD-based methods. And the only LD-based method than outperformed LAMPLD is ELAI; however, this is the only study that assessed ELAI with other models. In cases where LD-based models were compared to non-LD-based models, RFMIX outperformed LAMPLD in three cases highlighted in [34], while another separate study aiming to determine the place of admixture of an admixed population RFMIX also outperformed. This could be because RFMIX can deconvolve ancestry in closely related populations [46]. However, a recent assessment between RFMIX and LOTER resulted in LOTER outperforming in ancient admixtures [15].

Generally, each model is implemented as a tool in local ancestry deconvolution, existing as individual scripts requiring unique inputs and producing unique outputs. This challenges researchers with a limited computational background; thus, there is lack of a unified framework which can require a standard easy to manipulate input files and output results in a way that is easy to process for further application. In conclusion, for informed decisions on models and algorithms, existing models or tools should be assessed within a unified framework. This will allow them to be tested on different admixture scenarios and also incorporating most state-of-the-art LD and non-LD based models.

#### 3.2 Case of the dating admixture models

The evolution of human populations and the history of the mixture of these populations have been deciphered using statistical and computational methods. These methods have been found to perform well when dealing with single point admixture event in two-way admixed populations [35]. However, as any method, they not only have advantages but also pitfalls regarding the estimation of admixture dates in some cases. It is challenging to fit to real admixed populations (for more than 3-way admixture context) in the existing models dating admixture events due to several reasons, including reliance to optimal local ancestry estimates and accurate ancestry breakpoints. This suggests that there is still a need for designing an integrative or a new model to dating admixture events for current multi-way admixed populations to further advance our understanding of human demographics and movement, and facilitate admixture mapping and estimation of the age of a disease locus contributing to disease risk.

publication is solely the responsibility of the authors and does not necessarily

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher…

The authors declare that they have no competing interest.

Gaston K. Mazandu1,2,3\*, Ephifania Geza1,3, Milaine Seuneu1,2

Cape Town (UCT), Cape Town, South Africa

\*Address all correspondence to: kuzamunu@aims.ac.za

University of Cape Town, South Africa

provided the original work is properly cited.

1 African Institute for Mathematical Sciences (AIMS), Cape Town, South Africa

3 Computational Biology Division, Department of Integrative Biomedical Sciences,

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

2 Division of Human Genetics, Department of Pathology, Faculty of Health Sciences, Institute of Infectious Disease and Molecular Medicine, University of

represent the official views of the funders.

DOI: http://dx.doi.org/10.5772/intechopen.82764

Conflict of interest

Author details

49

and Emile R. Chimusa<sup>2</sup>

In addition, it have been discovered that the mixture exponential decay model over-estimates the date of older admixture events [35] and was suggested to detect at most three admixture events. As mentioned earlier, Ni et al. [47] dealt with the optimization of the method used in dating admixture estimation. They took into account several models but the evaluation of their technique is not effective in the estimation of ancient and multi admixture events [35, 47]. On the other hand, several practical considerations can further limit these approaches including the use of proxy ancestry populations in the estimations which could bias the accuracy of the result. This is the case when dealing for instance with low sample size and inappropriate proxy ancestral populations [35]; the requirement of having accurate LD patterns, ancestry haplotypes distribution, and a big sample size of the admixed population. Thus, there is a need for an adequate model for inferring different dates of admixture events and matching real admixture history using proxy ancestry-based methods [35].

#### 4. Conclusions

Currently, more than 20 models exist and are implemented as software to deconvolve local ancestry and 12 tools for dating admixture events. In this chapter, we discussed in detail and summarized the most commonly used models, the model assumptions, statistical and biological parameters they require, and existing challenges. This discussion highlights the need for designing more effective models, which account for current challenges and produce more accurate and biologically relevant estimates. Furthermore, it provides useful information for the implementation of practical tools, which consider current medical and population genetic demands. More importantly, this may guide users in the choice of appropriate tools for specific applications and can assist software developers in designing more advanced tools for local ancestry deconvolution and dating admixture events.

#### Acknowledgements

Some of the authors are supported in part by the National Institutes of Health (NIH) Common Fund [grant numbers U24HG006941 (H3ABioNet) and 1U01HG007459–01 (SADaCC)]. One of the authors is fully funded by the Organization for Women in Science for the Developing World (OWSD) and Swedish International Development Cooperation Agency (Sida). The content of this

Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher… DOI: http://dx.doi.org/10.5772/intechopen.82764

publication is solely the responsibility of the authors and does not necessarily represent the official views of the funders.
