**5. How can population genetics inform malaria vaccine development?**

Population genetics is the study of allele frequency distributions and changes that occur in response to the four major forces of evolution: natural selection, genetic drift, mutation and gene flow [216]. All of the current malaria vaccine candidates are potential targets of positive balancing selection due to pressure from human immune responses. Balancing selection maintains alleles at low to medium frequencies and therefore no single allele is likely to provide broad protection [217]. Population genetic analyses can reveal insight into the extent and distribution of alleles and has been important in highlighting antigens as targets of natural immunity [218, 219].

Scores of population genetic surveys have been conducted on malaria vaccine candidates, including isolates from countries in every major malaria-endemic corner of the world. However, the data has generally not been used in the formulation of malaria candidate vaccines. Vaccine developers have included alleles isolated from wellcharacterised reference strains, 3D7 (and its parent NF54, origin unknown), FVO (Vietnam) or FC27 (Papua New Guinea). However, amongst the substantial sequence data available for many countries for several leading malaria vaccine antigens, these alleles are either completely absent or found at low frequencies among naturally circulating parasites (Figure 1, [8]).

A recent metapopulation genetic analysis has summarised the known diversity of twelve leading malaria vaccine candidates [8]. After compiling all available published population data on malaria vaccine candidates, either currently in vaccine trials or in preclinical development, a database of almost 5000 sequences was used to investigate the range and distribution of diversity. Only non-synonymous polymorphisms were investigated as synonymous polymorphisms do not change the protein structure and are therefore antigenically irrelevant. Table 1 summarises the results observed for the ten antigens analysed in this study, as well as other *P. falciparum* and *P. vivax* antigens that are leading malaria vaccine candidates.

The data presented in Table 1 demonstrates that the majority of current malaria vaccine candidate antigens have many distinct haplotypes and this emphasises the problem of diversity in developing a broadly effective malaria vaccine [35]. However, without knowledge of this natural diversity it will be difficult to assess vaccine trials to the full extent. In addition, for genes encoding merozoite antigens including AMA1, EBA175 and MSPs 1-4 the full breadth of diversity was present in all populations with no evidence of geographic population structure, suggesting that they are under strong immune selection. However, genes encoding the non-merozoite antigens including CSP, TRAP, LSA1 and Pfs48/45 showed variable levels of diversity, which were related to transmission levels, and there was evidence of geographic population structure [8]. The consequence of this contrasting distribution of diversity for malaria vaccine design is that a diversity-covering vaccine may be possible for merozoite antigens but for the non-merozoite antigens identifying common alleles across all populations will be difficult. Nevertheless, diversity in the non-merozoite antigens does not appear to be primarily structured by immune selection, and therefore may not be as immunologically relevant as that for the merozoite antigens.

Population genetics is the study of allele frequency distributions and changes that occur in response to the four major forces of evolution: natural selection, genetic drift, mutation and gene flow [216]. All of the current malaria vaccine candidates are potential targets of positive balancing selection due to pressure from human immune responses. Balancing selection maintains alleles at low to medium frequencies and therefore no single allele is likely to provide broad protection [217]. Population genetic analyses can reveal insight into the extent and distribution of alleles and has been important in highlighting antigens as

Scores of population genetic surveys have been conducted on malaria vaccine candidates, including isolates from countries in every major malaria-endemic corner of the world. However, the data has generally not been used in the formulation of malaria candidate vaccines. Vaccine developers have included alleles isolated from wellcharacterised reference strains, 3D7 (and its parent NF54, origin unknown), FVO (Vietnam) or FC27 (Papua New Guinea). However, amongst the substantial sequence data available for many countries for several leading malaria vaccine antigens, these alleles are either completely absent or found at low frequencies among naturally

A recent metapopulation genetic analysis has summarised the known diversity of twelve leading malaria vaccine candidates [8]. After compiling all available published population data on malaria vaccine candidates, either currently in vaccine trials or in preclinical development, a database of almost 5000 sequences was used to investigate the range and distribution of diversity. Only non-synonymous polymorphisms were investigated as synonymous polymorphisms do not change the protein structure and are therefore antigenically irrelevant. Table 1 summarises the results observed for the ten antigens analysed in this study, as well as other *P. falciparum* and *P. vivax* antigens that are leading

The data presented in Table 1 demonstrates that the majority of current malaria vaccine candidate antigens have many distinct haplotypes and this emphasises the problem of diversity in developing a broadly effective malaria vaccine [35]. However, without knowledge of this natural diversity it will be difficult to assess vaccine trials to the full extent. In addition, for genes encoding merozoite antigens including AMA1, EBA175 and MSPs 1-4 the full breadth of diversity was present in all populations with no evidence of geographic population structure, suggesting that they are under strong immune selection. However, genes encoding the non-merozoite antigens including CSP, TRAP, LSA1 and Pfs48/45 showed variable levels of diversity, which were related to transmission levels, and there was evidence of geographic population structure [8]. The consequence of this contrasting distribution of diversity for malaria vaccine design is that a diversity-covering vaccine may be possible for merozoite antigens but for the non-merozoite antigens identifying common alleles across all populations will be difficult. Nevertheless, diversity in the non-merozoite antigens does not appear to be primarily structured by immune selection, and therefore may not be as immunologically relevant as that for the merozoite

**5. How can population genetics inform malaria vaccine development?** 

targets of natural immunity [218, 219].

circulating parasites (Figure 1, [8]).

malaria vaccine candidates.

antigens.


a.Total includes both natural populations and other isolates, range includes only natural populations; b. Total (range) includes only natural populations; c.Individual domains also analysed, data not shown; n.r. result not available.

Table 1. Summary of population genetic data for leading malaria vaccine candidates

allele.

vaccine-mediated selection.

**5.3.2 Clustering patterns** 

advance the development of clustering algorithms.

Using Population Genetics to Guide Malaria Vaccine Design 247

a vaccine is based on a variant that is found at low frequencies at a testing site, the positive effects of variant-specific immunity may be dampened by infection with parasites carrying non-vaccine alleles. The vaccine may then be interpreted as being non-protective, unless variant-specific end-points are included by genotyping post-vaccine infections. Another important consideration in this scenario is that the statistical power to measure variantspecific efficacy will be limited with only a small number of infections carrying the vaccine

The importance of allele/haplotype frequencies in vaccine design has been demonstrated by a study that measured MSP119 diversity at a vaccine-testing site in Mali. High throughput genotyping of six common MSP119 polymorphisms in more than 2000 isolates showed that there were two highly prevalent haplotypes (FVO, 46% and FUP, 36%) whereas the majority of the haplotypes were relatively rare (<10%). The common haplotypes remained common over long periods of time (>44% and 34% respectively), with fluctuations that could be explained by frequency dependant selection [113]. The vaccine haplotype, 3D7, was the third most common, being found at a frequency of 16% (14-18%) throughout the study period. The authors concluded that the previous MSP119 vaccine trial [114] probably failed as a result of a lack of parasites harbouring 3D7 MSP119 haplotypes. Studies have investigated changes in vaccine antigen haplotype frequencies over time under natural conditions ([93, 230, 231]) but there have been e.g. relatively few during vaccine trials [98, 115]. More studies are needed to monitor fluctuations that occur in parasite populations over time, especially under the influence of natural immune or

Cluster analyses have been used to identify population substructure within a given sample to understand the underlying population biology of *P. falciparum* [232]. This analysis has also been adapted to understand relationships among haplotypes of several malaria vaccine candidates [8, 91, 92, 226]. A study of 150 AMA1 sequences by Xin Zhuan Su and colleagues demonstrated that they clustered into six distinct subgroups. Some evidence was also presented that sera from rabbits immunized with AMA1 variants from one cluster tended to inhibit invasion of parasite isolates carrying sequences from the same cluster, but were less active against those from other clusters, suggesting that different clusters contain immunologically distinct sequences [91]. Later studies, including 506 AMA1 domain I sequences analysed have suggested up to 16 clusters [92]. Although the full utility of clustering patterns is yet to be confirmed, they may be used as a guide to select representative variants to cover diversity as well as to predict the effects of a vaccine trial within a defined geographic area. This is particularly important because different malaria antigens show different clustering patterns ([8], Figure 2). If clusters represent immunologically distinct subgroups, the patterns observed in Figure 2 suggest that vaccination with CSP would be significantly more effective in Africa and parts of the Americas, whereas for MSP1 the effects would be similar among many populations. Different polymorphisms have varying immunological significance [85, 97] and a better understanding of the relationship between polymorphisms and antigenic diversity will help

#### **5.1 Sampling diversity**

An important consideration when investigating the genetic diversity of a vaccine candidate is the origin and number of samples required to obtain reliable allele frequency estimates. In early studies, only a handful of parasite isolates from diverse geographic origins were used to investigate diversity, however more recently a number of investigations on larger numbers of locally circulating field isolates which can represent natural parasite populations, have been completed (reviewed in [8], Table 1). While a geographically disparate sampling approach can provide insights into levels of polymorphism, immune selection and can allow the extent of diversity to be predicted, it cannot provide reliable information on allele frequencies. The latter approach is more appropriate if data will be used for prioritizing common alleles for vaccine development. However it is critical that large sample sizes of a minimum of 30-50 isolates be used to obtain a reliable estimate of diversity and values approaching natural allele frequencies. Once defined, natural allele frequencies can provide an indication of the minimum proportion of the parasite population that would be covered by a particular vaccine formulation. Further analysis, discussed below, can identify relationships amongst alleles and therefore the potential for crossreactivity between distinct alleles.

#### **5.2 Defining the extent of diversity**

In population genetics, the extent of diversity at a defined locus is measured using a number of different statistics. These include statistics that are simple to estimate: e.g. the number of alleles or haplotypes and more complex statistics such as the allelic richness, which is normalised for sample size [228] and therefore is useful to compare among populations if samples sizes vary considerably. Other statistics include the nucleotide diversity (π), which is the average proportion of sites that are polymorphic within a group of sequences; the average number of differences and the expected heterozygosity, all of which can be easily calculated with the help of a myriad of population genetic software (reviewed in [229]). The most informative statistics for vaccine design includes the numbers of alleles or haplotypes that need to be considered in developing a broadly efficacious malaria vaccine. For example if we consider the data in Table 1, some antigens have larger numbers of haplotypes than others, such as CSP with 71 haplotypes while MSP3 has only 21 even though the two datasets were similar in size, showing that CSP is the more diverse antigen [8].

### **5.3 Defining the distribution of diversity**

#### **5.3.1 Allele and haplotype frequencies**

Knowledge of the distribution of alleles and haplotypes (variants) is critical both for vaccine design and for monitoring the effects of vaccine trials. In trials for vaccine candidate antigens with allele-specific immunity, a high frequency variant has a higher likelihood of resulting in a protective effect than a low frequency variant (if the vaccine construct covers polymorphic regions). Furthermore, the identification of geographically distinct population structures is an indication that variants may be present at different frequencies. In this situation, it is possible that a vaccine based on a common allele in one population that is rare in another population, may have differential effects across the two populations. Similarly, if

An important consideration when investigating the genetic diversity of a vaccine candidate is the origin and number of samples required to obtain reliable allele frequency estimates. In early studies, only a handful of parasite isolates from diverse geographic origins were used to investigate diversity, however more recently a number of investigations on larger numbers of locally circulating field isolates which can represent natural parasite populations, have been completed (reviewed in [8], Table 1). While a geographically disparate sampling approach can provide insights into levels of polymorphism, immune selection and can allow the extent of diversity to be predicted, it cannot provide reliable information on allele frequencies. The latter approach is more appropriate if data will be used for prioritizing common alleles for vaccine development. However it is critical that large sample sizes of a minimum of 30-50 isolates be used to obtain a reliable estimate of diversity and values approaching natural allele frequencies. Once defined, natural allele frequencies can provide an indication of the minimum proportion of the parasite population that would be covered by a particular vaccine formulation. Further analysis, discussed below, can identify relationships amongst alleles and therefore the potential for cross-

In population genetics, the extent of diversity at a defined locus is measured using a number of different statistics. These include statistics that are simple to estimate: e.g. the number of alleles or haplotypes and more complex statistics such as the allelic richness, which is normalised for sample size [228] and therefore is useful to compare among populations if samples sizes vary considerably. Other statistics include the nucleotide diversity (π), which is the average proportion of sites that are polymorphic within a group of sequences; the average number of differences and the expected heterozygosity, all of which can be easily calculated with the help of a myriad of population genetic software (reviewed in [229]). The most informative statistics for vaccine design includes the numbers of alleles or haplotypes that need to be considered in developing a broadly efficacious malaria vaccine. For example if we consider the data in Table 1, some antigens have larger numbers of haplotypes than others, such as CSP with 71 haplotypes while MSP3 has only 21 even though the two

datasets were similar in size, showing that CSP is the more diverse antigen [8].

Knowledge of the distribution of alleles and haplotypes (variants) is critical both for vaccine design and for monitoring the effects of vaccine trials. In trials for vaccine candidate antigens with allele-specific immunity, a high frequency variant has a higher likelihood of resulting in a protective effect than a low frequency variant (if the vaccine construct covers polymorphic regions). Furthermore, the identification of geographically distinct population structures is an indication that variants may be present at different frequencies. In this situation, it is possible that a vaccine based on a common allele in one population that is rare in another population, may have differential effects across the two populations. Similarly, if

**5.1 Sampling diversity** 

reactivity between distinct alleles.

**5.2 Defining the extent of diversity** 

**5.3 Defining the distribution of diversity 5.3.1 Allele and haplotype frequencies** 

a vaccine is based on a variant that is found at low frequencies at a testing site, the positive effects of variant-specific immunity may be dampened by infection with parasites carrying non-vaccine alleles. The vaccine may then be interpreted as being non-protective, unless variant-specific end-points are included by genotyping post-vaccine infections. Another important consideration in this scenario is that the statistical power to measure variantspecific efficacy will be limited with only a small number of infections carrying the vaccine allele.

The importance of allele/haplotype frequencies in vaccine design has been demonstrated by a study that measured MSP119 diversity at a vaccine-testing site in Mali. High throughput genotyping of six common MSP119 polymorphisms in more than 2000 isolates showed that there were two highly prevalent haplotypes (FVO, 46% and FUP, 36%) whereas the majority of the haplotypes were relatively rare (<10%). The common haplotypes remained common over long periods of time (>44% and 34% respectively), with fluctuations that could be explained by frequency dependant selection [113]. The vaccine haplotype, 3D7, was the third most common, being found at a frequency of 16% (14-18%) throughout the study period. The authors concluded that the previous MSP119 vaccine trial [114] probably failed as a result of a lack of parasites harbouring 3D7 MSP119 haplotypes. Studies have investigated changes in vaccine antigen haplotype frequencies over time under natural conditions ([93, 230, 231]) but there have been e.g. relatively few during vaccine trials [98, 115]. More studies are needed to monitor fluctuations that occur in parasite populations over time, especially under the influence of natural immune or vaccine-mediated selection.

#### **5.3.2 Clustering patterns**

Cluster analyses have been used to identify population substructure within a given sample to understand the underlying population biology of *P. falciparum* [232]. This analysis has also been adapted to understand relationships among haplotypes of several malaria vaccine candidates [8, 91, 92, 226]. A study of 150 AMA1 sequences by Xin Zhuan Su and colleagues demonstrated that they clustered into six distinct subgroups. Some evidence was also presented that sera from rabbits immunized with AMA1 variants from one cluster tended to inhibit invasion of parasite isolates carrying sequences from the same cluster, but were less active against those from other clusters, suggesting that different clusters contain immunologically distinct sequences [91]. Later studies, including 506 AMA1 domain I sequences analysed have suggested up to 16 clusters [92]. Although the full utility of clustering patterns is yet to be confirmed, they may be used as a guide to select representative variants to cover diversity as well as to predict the effects of a vaccine trial within a defined geographic area. This is particularly important because different malaria antigens show different clustering patterns ([8], Figure 2). If clusters represent immunologically distinct subgroups, the patterns observed in Figure 2 suggest that vaccination with CSP would be significantly more effective in Africa and parts of the Americas, whereas for MSP1 the effects would be similar among many populations. Different polymorphisms have varying immunological significance [85, 97] and a better understanding of the relationship between polymorphisms and antigenic diversity will help advance the development of clustering algorithms.

Using Population Genetics to Guide Malaria Vaccine Design 249

that were identified as "admixed" in the cluster analysis (i.e. <75% of sequences assigned into any one cluster), often formed connections between one or more lobes of the network suggesting that these represent novel recombinants resulting from exchange between sequences from the linked clusters. Such recombinants might allow evasion of naturally acquired- or vaccine-mediated immune responses if vaccine formulations were comprised only of haplotypes from distinct subgroups or clusters. Network analysis will allow shifts in the proportion of haplotypes within each cluster and admixed/recombinant haplotypes to

Population genetic analyses can identify signatures of balancing selection in loci that are targeted by natural immune responses and therefore allow vaccine candidates to be ranked [66, 218, 219, 236]. While data from geographically diverse isolates can be useful, the ability to identify balancing selection is strengthened by allele frequency data from natural parasite populations [237]. Comparative studies, investigating polymorphism and allele frequencies by deep population sampling of several novel vaccine candidates have been done [218, 219, 224, 237, 238]. These have demonstrated the relative levels of balancing selection and therefore whether particular candidates are stronger immune targets than others, but may

Balancing selection can be measured using a variety of different statistics, however the *Hudson Kreitman Aguade ratio* (HKAr), which is determined by calculating levels of divergence between species and dividing by the amount of diversity within a species [239] has been shown to be most informative for assessing selection in any dataset, including small numbers of isolates from diverse geographic locations. However, this also relies on the availability of sequences from the most closely related species for which genome data is available, namely *Plasmodium reichenowi*. Additional *P. reichenowi* isolates have been collected by researchers and may also be used to obtain further sequence data to increase the reliability of this statistic [240]. Another statistic that has proven to be reliable for predicting balancing selection, is Tajima's D, which identifies departures from neutrality by measuring the number of polymorphic sites (S) in relation to the nucleotide diversity (π) [241]. Tajima's D requires allele frequency data and therefore can only be effectively used on population

also partially reflect the tolerance of a particular antigen to high levels of mutation.

samples, with large sample sizes (>50 isolates) being the most informative [218].

population genetic data - which may be easier to collect - can be highly informative.

It is clear that some antigens such as AMA1 and CSP have many polymorphic sites, while others such as EBA175 and MSP4 remain relatively conserved [8]. However, this is probably more a reflection of functional constraints or degree of immune exposure, than indicating their capacity to be an effective vaccine candidate. The challenge is how to determine which polymorphisms will be critical to vaccine design. Investigators have used different approaches to identify antigenically relevant polymorphisms, including three-dimensional structural modelling, immunological assays and mutational analysis [91, 97]. However,

When choosing which polymorphisms to consider for vaccine design it is important to define the allele frequencies for each polymorphic site. If the minor allele frequency (MAF)

be monitored during vaccine trials.

**5.5 Credentialing polymorphisms** 

**5.4 Immune selection** 

Fig. 2. Cluster analysis of sequences for (A) the pre-erythrocytic antigen, CSP and (B) the merozoite antigen, MSP119 . Non-synonymous haplotypes were submitted to cluster analysis using STRUCTURE software [233]. Each bar represents the mean membership co-efficient for the parasite population of each country and colours represent the mean membership to each of the clusters. Vaccine alleles include CSP: 3D7; MSP119: 3D7, FVO or FC27. Figure adapted from [8].

#### **5.3.3 Networks**

Network analysis originated as a mathematical tool to understand social relationships and has been used to study the transmission of infectious diseases [234]. In population genetics, it has been adapted to explore relationships among sequences by linking haplotypes that are identical at a predefined proportion of polymorphic sites [8, 235]. As each haplotype (node) may have multiple connections (edges), this analysis has the potential to define not only distinct clusters or subgroups of highly related sequences but also the relationships among them. Furthermore, it can identify the location of less frequently observed admixed haplotypes in the network, which may represent novel recombinants. Barry *et al*. [8] have explored the distribution of haplotypes using network analysis and found that the network was concordant with clustering patterns for ten leading malaria vaccine antigens (Figure 3 shows the results for AMA1 as an example). In addition, they demonstrated that haplotypes

Fig. 3. Network of AMA1 (domain I) sequences. Each node (circle) represents a haplotype, shaded in colour to highlight cluster-membership or white for admixed haplotypes (as defined by the Structure analysis discussed above [233]). Nodes are tied by edges (black lines) demonstrating that they share a predefined threshold of 48 nsSNPs. Admixed haplotypes originating from isolates with unknown origin are shaded in white (unless they were vaccine haplotypes). Adapted from [8].

that were identified as "admixed" in the cluster analysis (i.e. <75% of sequences assigned into any one cluster), often formed connections between one or more lobes of the network suggesting that these represent novel recombinants resulting from exchange between sequences from the linked clusters. Such recombinants might allow evasion of naturally acquired- or vaccine-mediated immune responses if vaccine formulations were comprised only of haplotypes from distinct subgroups or clusters. Network analysis will allow shifts in the proportion of haplotypes within each cluster and admixed/recombinant haplotypes to be monitored during vaccine trials.

## **5.4 Immune selection**

248 Malaria Parasites

Fig. 2. Cluster analysis of sequences for (A) the pre-erythrocytic antigen, CSP and (B) the merozoite antigen, MSP119 . Non-synonymous haplotypes were submitted to cluster analysis using STRUCTURE software [233]. Each bar represents the mean membership co-efficient for the parasite population of each country and colours represent the mean membership to each of the clusters. Vaccine alleles include CSP: 3D7; MSP119: 3D7, FVO or FC27. Figure adapted from [8].

Network analysis originated as a mathematical tool to understand social relationships and has been used to study the transmission of infectious diseases [234]. In population genetics, it has been adapted to explore relationships among sequences by linking haplotypes that are identical at a predefined proportion of polymorphic sites [8, 235]. As each haplotype (node) may have multiple connections (edges), this analysis has the potential to define not only distinct clusters or subgroups of highly related sequences but also the relationships among them. Furthermore, it can identify the location of less frequently observed admixed haplotypes in the network, which may represent novel recombinants. Barry *et al*. [8] have explored the distribution of haplotypes using network analysis and found that the network was concordant with clustering patterns for ten leading malaria vaccine antigens (Figure 3 shows the results for AMA1 as an example). In addition, they demonstrated that haplotypes

Fig. 3. Network of AMA1 (domain I) sequences. Each node (circle) represents a haplotype, shaded in colour to highlight cluster-membership or white for admixed haplotypes (as defined by the Structure analysis discussed above [233]). Nodes are tied by edges (black lines) demonstrating that they share a predefined threshold of 48 nsSNPs. Admixed haplotypes originating from isolates with unknown origin are shaded in white (unless they

were vaccine haplotypes). Adapted from [8].

**5.3.3 Networks** 

Population genetic analyses can identify signatures of balancing selection in loci that are targeted by natural immune responses and therefore allow vaccine candidates to be ranked [66, 218, 219, 236]. While data from geographically diverse isolates can be useful, the ability to identify balancing selection is strengthened by allele frequency data from natural parasite populations [237]. Comparative studies, investigating polymorphism and allele frequencies by deep population sampling of several novel vaccine candidates have been done [218, 219, 224, 237, 238]. These have demonstrated the relative levels of balancing selection and therefore whether particular candidates are stronger immune targets than others, but may also partially reflect the tolerance of a particular antigen to high levels of mutation.

Balancing selection can be measured using a variety of different statistics, however the *Hudson Kreitman Aguade ratio* (HKAr), which is determined by calculating levels of divergence between species and dividing by the amount of diversity within a species [239] has been shown to be most informative for assessing selection in any dataset, including small numbers of isolates from diverse geographic locations. However, this also relies on the availability of sequences from the most closely related species for which genome data is available, namely *Plasmodium reichenowi*. Additional *P. reichenowi* isolates have been collected by researchers and may also be used to obtain further sequence data to increase the reliability of this statistic [240]. Another statistic that has proven to be reliable for predicting balancing selection, is Tajima's D, which identifies departures from neutrality by measuring the number of polymorphic sites (S) in relation to the nucleotide diversity (π) [241]. Tajima's D requires allele frequency data and therefore can only be effectively used on population samples, with large sample sizes (>50 isolates) being the most informative [218].
