**3. SNPs in** *M. tuberculosis*

Genetic diversity within bacterial species is usually generated by mutations and by the exchange of genetic material. The process of HGT is thought to be an important driver of bacterial evolution in both pathogenic and non-pathogenic bacteria (Becq *et al.*, 2007). Horizontally transferred genes can be acquired in clusters known as genomic islands or pathogenicity islands that can be identified by characteristics that distinguish them from the host genome, such as GC content, flanking nucleotide repeats and insertion elements. In the case of *M. tuberculosis*, there is evidence of ancient gene transfer events that could have taken place in a progenitor tubercle bacilli pool before the clonal expansion that gave rise to the MTBC (Gutierrez et al., 2005). One of these events involved the Rv0986-8 virulence operon (Rosas-Magallanes *et al.*, 2006) that could have originated from genetic exchange between an environmental bacillus ancestor and other bacterial species (Nicol & Wilkinson, 2008). In the absence of recent events of HGT, modern *M. tuberculosis* lineages evolve essentially by mutations that alter its genome, resulting in SNPs and LSPs, such as deletions and insertions, the latter mainly mediated by transposition of the IS*6110* insertion element.

Although allelic variation in MTBC organisms is quite restricted when compared with other pathogenic bacteria (Sreevatsan et al., 1997), there is a growing recognition that there is substantial genetic diversity among isolates. At the level of SNPs changes can be either synonymous (sSNP) or non-synonymous (nsSNP) and this diversity has been undeniably useful for typing and defining evolutionary relationships among strains. SNPs provide many advantages for the analysis of phylogenetic relationships among microorganisms, especially among closely related clonal organisms such as the MTBC. Initial descriptions of the *M. tuberculosis* population structure involved analysis of SNPs in the *katG* and *gyrA* genes and defined three major genetic groups (Sreevatsan et al., 1997). Later surveys have extended this strategy to include more than 100 sSNPs identified in 112 *M. tuberculosis* isolates (Gutacker et al., 2002). In more recent work using 159 sSNPs identified by wholegenome comparison of sequenced strains, it was possible to classify 212 isolates into 56 haplotypes that grouped strains into six *M. tuberculosis* SNP Cluster Groups (SCG) and one SCG that grouped all the *M. bovis* strains (Filliol et al., 2006). A re-evaluation of the SNP phylogeny was obtained by using *de novo* sequencing of 89 randomly distributed genes in 108 global strains (Comas et al., 2009). This study suggested that initial classification could be done using a subset of discriminatory SNPs and then, if further molecular characterization were needed, a MIRU-VNTR typing technique could be applied to differentiate individual strains. However, the choice of discriminatory SNPs is not an easy

Genomic Variability of *Mycobacterium tuberculosis* 43

assess the *in vivo* rates. This was achieved in a recent report, however, using whole genome sequencing and identification of SNPs generated during different disease states in macaque monkeys (Ford *et al.*, 2011). Similar mutation rates were observed during latency and during active disease, and these were also consistent with *in vitro* rates. Based on these results and on the types of SNPs observed, it was suggested that *M. tuberculosis* can acquire mutations during latency and that these mutations are the result of oxidative DNA damage rather than errors in replication. This could be explained by increased oxidative damage during latency, as a result of the immune response, or by diminished DNA repair in metabolically quiescent

The identification of SNPs in *M. tuberculosis* has provided important insight regarding genetic variability and evolution of this pathogen. SNPs can also impact strain fitness, as is evident by the acquisition of antibiotic resistance markers. It remains to be seen if many of the identified SNPs have an effect on the biology of *M. tuberculosis* and the host-pathogen interaction. Whole genome sequencing will undoubtedly allow more extensive SNP identification and analysis on a genome-wide scale. As more sequence data becomes available, comparative genomics studies may help to identify markers that can contribute to our understanding of the molecular mechanisms underlying phenotypes such as drug

LSPs can include both insertions and deletions (indels) and have been identified as one of the main sources of genomic variability in *M. tuberculosis*. The effect of LSPs can vary and may provide insights into the biology of *M. tuberculosis* strains. Large deletions have been shown to group closely related strains and have been associated with phylogeographical lineages, suggesting that a deletion event is specific to a particular lineage (Tsolaki *et al.*, 2004). Some LSPs occur rarely in the population and could have arisen from random genomic events and then become associated with a particular phylogenetic lineage (Alland *et al.*, 2007). In contrast, other LSPs are present in multiple strains from different lineages, as a result of selective pressure, and are not necessarily associated with particular groups

Soon after completing the genome sequence of the laboratory strain H37Rv (Cole et al., 1998), the clinical isolate, strain CDC1551 that had caused an outbreak in the United States, was sequenced (Fleischmann *et al.*, 2002). A whole genome comparative study carried out using these two genomic sequences identified 1,075 SNPs and 86 LSPs larger than 10 bp. The analysis of these LSPs using a panel of 169 clinical isolates, showed that clinical strains were genetically more variable than expected from a clonal bacterial population (Fleischmann et

The continued advances in methods for high-throughput nucleic acid sequencing now allow more rapid generation of sequence data and thus access to information from a growing number of sequenced clinical *M. tuberculosis* genomes. Up to now, there are more than 200 on-going sequencing projects of *M. tuberculosis* strains with different characteristics, such as strains with epidemic potential and strains characterized by multidrug resistance, as well as isolates obtained before and after a passage through an immunocompetent animal model, among others (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). This information,

bacilli (Ford *et al.*, 2011).

resistance and persistence.

(Alland et al., 2007).

al., 2002).

**4. Large Sequence Polymorphisms** 

task. For example, SNPs comparison in 32 fully sequenced strains that caused an outbreak in a community in Canada, allowed the identification of two co-circulating "lineages" with the same MIRU-VNTR profile (Gardy *et al.*, 2011), which would not have been evident if only the discriminatory SNPs used previously had been included. The study allowed tracking the transmission and demonstrated the power of coupling comparative genomics with social epidemiological studies.

Genetic variation at the SNP level can also have profound implications in strain fitness and disease outcome. One such case applies to the Esx protein family that has been implicated in host-pathogen interactions. To survey genetic diversity in the Esx family, and its potential for antigenic variation, all *esx* genes were sequenced from 108 clinical isolates of *M. tuberculosis* belonging to different clades. The SNP distribution affecting Esx proteins indicated high genetic variability and a total of 109 unique SNPs, 59 of which were nonsynonymous. Some of the resultant amino acid substitutions affected known Esx epitopes likely to result in immune variation, thus revealing a dynamic *esx* gene family (Vasilyeva *et al.*, 2009).

Another important area of research focuses on variability associated with specific phenotypes of clinical importance, such as antibiotic resistance. In *M. tuberculosis,* resistance to antibiotics results essentially from mutations, such as SNPs, that can be acquired during treatment and can spread within the population. The mutations conferring antibiotic resistance can have a variable effect on strain fitness and bacteria can develop compensatory mechanisms to recover fitness capacity (Borrell & Gagneux, 2011). Isoniazid (INH) resistance in *M. tuberculosis* is associated with mutations in the genes *katG*, *inhA* and *ahpC*. Most identified mutations map to *katG*, which encodes the catalase-peroxidase required to activate INH (Ramaswamy & Musser, 1998) and to protect *M. tuberculosis* from the oxidative free radicals in the macrophage. Thus *M. tuberculosis* INH resistant strains are less virulent (Pym et al, 2002). However, the katGS315T mutation, the most common mutation for INH resistance (Sandgren *et al.*, 2009), results in reduced INH activation while maintaining KatG activity and virulence in mice (Pym *et al.*, 2002) suggesting compensatory evolution as has been suggested in other bacteria (Maisnier-Patin & Andersson, 2004). If compensatory evolution occurs in MDR and XDR strains it will have deep impacts in the control of tuberculosis (Borrell & Gagneux, 2011), an area that must be further investigated.

The identification of SNPs associated with resistance has also indicated the existence of multiple gene determinants for resistance, not all of which have been fully identified. Streptomycin (Sm) resistance, for example, is associated in the majority of cases with mutations in *rpsL* and *rrs* (Sreevatsan et al., 1997). However, 27% of Sm-resistant strains lack mutations in these genes. There is evidence that in some cases mutations in *gidB,* a gene coding for a 7-methylguanosine (m7G) methyltransferase specific for 16S rDNA, are associated with low level of Sm resistance (Donoghue, 2011). However, some susceptible strains also contain such mutations, thus requiring sequence analysis of more *M. tuberculosis* clinical isolates to better understand the role of *gid*B gene mutations in Sm resistance.

A longstanding question in tuberculosis has been the precise mechanisms by which mycobacteria can acquire resistant mutations, especially during latent infections. The mutation rates that confer antibiotic resistance have been determined *in vitro*, yet the slow growth and different metabolic states of *M. tuberculosis* during infection make it difficult to

task. For example, SNPs comparison in 32 fully sequenced strains that caused an outbreak in a community in Canada, allowed the identification of two co-circulating "lineages" with the same MIRU-VNTR profile (Gardy *et al.*, 2011), which would not have been evident if only the discriminatory SNPs used previously had been included. The study allowed tracking the transmission and demonstrated the power of coupling comparative genomics with social

Genetic variation at the SNP level can also have profound implications in strain fitness and disease outcome. One such case applies to the Esx protein family that has been implicated in host-pathogen interactions. To survey genetic diversity in the Esx family, and its potential for antigenic variation, all *esx* genes were sequenced from 108 clinical isolates of *M. tuberculosis* belonging to different clades. The SNP distribution affecting Esx proteins indicated high genetic variability and a total of 109 unique SNPs, 59 of which were nonsynonymous. Some of the resultant amino acid substitutions affected known Esx epitopes likely to result in immune variation, thus revealing a dynamic *esx* gene family (Vasilyeva *et* 

Another important area of research focuses on variability associated with specific phenotypes of clinical importance, such as antibiotic resistance. In *M. tuberculosis,* resistance to antibiotics results essentially from mutations, such as SNPs, that can be acquired during treatment and can spread within the population. The mutations conferring antibiotic resistance can have a variable effect on strain fitness and bacteria can develop compensatory mechanisms to recover fitness capacity (Borrell & Gagneux, 2011). Isoniazid (INH) resistance in *M. tuberculosis* is associated with mutations in the genes *katG*, *inhA* and *ahpC*. Most identified mutations map to *katG*, which encodes the catalase-peroxidase required to activate INH (Ramaswamy & Musser, 1998) and to protect *M. tuberculosis* from the oxidative free radicals in the macrophage. Thus *M. tuberculosis* INH resistant strains are less virulent (Pym et al, 2002). However, the katGS315T mutation, the most common mutation for INH resistance (Sandgren *et al.*, 2009), results in reduced INH activation while maintaining KatG activity and virulence in mice (Pym *et al.*, 2002) suggesting compensatory evolution as has been suggested in other bacteria (Maisnier-Patin & Andersson, 2004). If compensatory evolution occurs in MDR and XDR strains it will have deep impacts in the control of

tuberculosis (Borrell & Gagneux, 2011), an area that must be further investigated.

The identification of SNPs associated with resistance has also indicated the existence of multiple gene determinants for resistance, not all of which have been fully identified. Streptomycin (Sm) resistance, for example, is associated in the majority of cases with mutations in *rpsL* and *rrs* (Sreevatsan et al., 1997). However, 27% of Sm-resistant strains lack mutations in these genes. There is evidence that in some cases mutations in *gidB,* a gene coding for a 7-methylguanosine (m7G) methyltransferase specific for 16S rDNA, are associated with low level of Sm resistance (Donoghue, 2011). However, some susceptible strains also contain such mutations, thus requiring sequence analysis of more *M. tuberculosis* clinical isolates to better understand the role of *gid*B gene mutations in Sm resistance.

A longstanding question in tuberculosis has been the precise mechanisms by which mycobacteria can acquire resistant mutations, especially during latent infections. The mutation rates that confer antibiotic resistance have been determined *in vitro*, yet the slow growth and different metabolic states of *M. tuberculosis* during infection make it difficult to

epidemiological studies.

*al.*, 2009).

assess the *in vivo* rates. This was achieved in a recent report, however, using whole genome sequencing and identification of SNPs generated during different disease states in macaque monkeys (Ford *et al.*, 2011). Similar mutation rates were observed during latency and during active disease, and these were also consistent with *in vitro* rates. Based on these results and on the types of SNPs observed, it was suggested that *M. tuberculosis* can acquire mutations during latency and that these mutations are the result of oxidative DNA damage rather than errors in replication. This could be explained by increased oxidative damage during latency, as a result of the immune response, or by diminished DNA repair in metabolically quiescent bacilli (Ford *et al.*, 2011).

The identification of SNPs in *M. tuberculosis* has provided important insight regarding genetic variability and evolution of this pathogen. SNPs can also impact strain fitness, as is evident by the acquisition of antibiotic resistance markers. It remains to be seen if many of the identified SNPs have an effect on the biology of *M. tuberculosis* and the host-pathogen interaction. Whole genome sequencing will undoubtedly allow more extensive SNP identification and analysis on a genome-wide scale. As more sequence data becomes available, comparative genomics studies may help to identify markers that can contribute to our understanding of the molecular mechanisms underlying phenotypes such as drug resistance and persistence.
