**4.2. Methods to study viral population genetics**

Genetic structure of a population refers to the number of distinct subpopulations, identified using a characteristic set of allele frequencies [76]. A model-based population analysis can be performed using the STRUCTURE program [77] based on genomic data. The program can infer the genetic structure in haploid, diploid and polyploid species [78].

#### *4.2.1. STRUCTURE program*

Principle: This method is based on Bayesian clustering approach and employs Markov Chain Monte Carlo (MCMC) algorithm to identify genetically distinct subpopulations based on allele frequencies. It assigns individuals to subpopulations based on likelihood estimates. In case of haploids, the program assumes that the loci are in linkage equilibrium or only weakly linked [78]. The program accounts for recombination by incorporating ancestry models such as admixture and linkage models. An admixed strain is assigned with a membership score to belong to two or more subpopulations, to indicate its mixed ancestry. Linkage model is an extension of admixture model to account for weak linkage that arises as a result of admixture linkage disequilibrium (LD). Therefore, the extent of linkage equilibrium within the markers needs to be tested prior to usage of the STRUCTURE program. The relevant linkage analysis (LIAN) programs and measures are discussed in Section 4.3.

Input genotype data: A wide range of markers such as multi-locus genotype data, microsatel‐ lites, SNPs can be used as an input. In case of viruses, the polymorphic sites or more specifically the parsimony-informative (PIs) sites obtained from genome-based alignment are suitable markers for population genetic analyses. A PI site contains at least two types of nucleotide bases and at least two of which occur with a minimum frequency of two. The position of each PI corresponds to a locus. At every locus, any of the four bases (A, T, G and C) and the gap is considered as an allele.

Algorithm steps:

**vi.** Reconstruction of quasispecies is carried out using the sliding window approach by

Principle: It employs the jumping Hidden Markov Model (HMM)-based probabilistic statistics for inference of viral quasispecies, especially for estimating the intra-patient viral haplotype distribution [75]. This method assumes that the true genetic diversity is generated by a few sequences (called generators) through mutation and recombination, and that the observed

**i.** Distribution of haplotypes in a given population is modelled to account for either

**ii.** Expectation maximization algorithm is used to estimate posterior probabilities

Genetic structure of a population refers to the number of distinct subpopulations, identified using a characteristic set of allele frequencies [76]. A model-based population analysis can be performed using the STRUCTURE program [77] based on genomic data. The program can

Principle: This method is based on Bayesian clustering approach and employs Markov Chain Monte Carlo (MCMC) algorithm to identify genetically distinct subpopulations based on allele frequencies. It assigns individuals to subpopulations based on likelihood estimates. In case of haploids, the program assumes that the loci are in linkage equilibrium or only weakly linked [78]. The program accounts for recombination by incorporating ancestry models such as admixture and linkage models. An admixed strain is assigned with a membership score to belong to two or more subpopulations, to indicate its mixed ancestry. Linkage model is an extension of admixture model to account for weak linkage that arises as a result of admixture linkage disequilibrium (LD). Therefore, the extent of linkage equilibrium within the markers needs to be tested prior to usage of the STRUCTURE program. The relevant linkage analysis

Input genotype data: A wide range of markers such as multi-locus genotype data, microsatel‐ lites, SNPs can be used as an input. In case of viruses, the polymorphic sites or more specifically the parsimony-informative (PIs) sites obtained from genome-based alignment are suitable markers for population genetic analyses. A PI site contains at least two types of nucleotide bases and at least two of which occur with a minimum frequency of two. The position of each

associated with rare events of mutation and recombination.

infer the genetic structure in haploid, diploid and polyploid species [78].

(LIAN) programs and measures are discussed in Section 4.3.

point mutation or recombination in the form of probability tables and jumping HMM

*i.e.*, *in-silico* recombinants.

184 Next Generation Sequencing - Advances, Applications and Challenges

diversity results from additional sequencing errors.

**4.2. Methods to study viral population genetics**

states respectively.

*4.2.1. STRUCTURE program*

*4.1.3. QuasiRecomb*

Algorithm steps:

calculating maximal coverage and read diversity, which reduces the false positives,


Salient features of the STRUCTURE program:


Limitations:


Case studies:

The ability of the admixture model to account for recombination has been used to analyse the extent of recombination and its role in determining the population structure of viruses such as *Hepatitis B virus* [81] and *Rhinoviruses* [82].

Population genomic study of *Hepatitis B virus* (HBV) was carried out using both admixture and linkage models (with burn-in of 20,000 and burn-length of 40,000). HBV is an enveloped DNA virus and belongs to the genus *Orthohepadnavirus* and family *Hepadnaviridae*. It is known to consist of eight genotypes designated as A–H, each of which has characteristic geographic distribution. This method helped to resolve the hierarchical nature of population subdivision with the presence of four major clusters (*FST* = 0.497, *p* < 0.0001) and eight sub-clusters. The extent of recombination was observed to be low [81].

*Rhinoviruses* represent the highly diverse members of genus *Enterovirus* and family *Picornavir‐ idae*. They are ss (+) RNA viruses with genome of ~7,200 bases. There are three species, *viz.* *Rhinovirus A, -B* and *-C*, each of which is further subdivided into distinct serotypes. The STRUCTURE-based analysis revealed a strong evidence for existence of seven genetically distinct subpopulations (with *FST* = 0.45, *p* = 0). *Rhinovirus A* and *Rhinovirus C* were subdivid‐ ed into four and two subpopulations respectively, whereas *Rhinovirus B* species remain undivided. Furthermore, usage of both the admixture and the linkage models (with burn-in of 20,000 and burn-length of 40,000) helped to resolve the role of recombination in diversifica‐ tion of subpopulations. In case of *Rhinovirus A*, intra-species recombination was common, whereas in case of *Rhinovirus* C, intra- and inter-species recombination were observed to cause diversity [82].
