**4. Large Sequence Polymorphisms**

LSPs can include both insertions and deletions (indels) and have been identified as one of the main sources of genomic variability in *M. tuberculosis*. The effect of LSPs can vary and may provide insights into the biology of *M. tuberculosis* strains. Large deletions have been shown to group closely related strains and have been associated with phylogeographical lineages, suggesting that a deletion event is specific to a particular lineage (Tsolaki *et al.*, 2004). Some LSPs occur rarely in the population and could have arisen from random genomic events and then become associated with a particular phylogenetic lineage (Alland *et al.*, 2007). In contrast, other LSPs are present in multiple strains from different lineages, as a result of selective pressure, and are not necessarily associated with particular groups (Alland et al., 2007).

Soon after completing the genome sequence of the laboratory strain H37Rv (Cole et al., 1998), the clinical isolate, strain CDC1551 that had caused an outbreak in the United States, was sequenced (Fleischmann *et al.*, 2002). A whole genome comparative study carried out using these two genomic sequences identified 1,075 SNPs and 86 LSPs larger than 10 bp. The analysis of these LSPs using a panel of 169 clinical isolates, showed that clinical strains were genetically more variable than expected from a clonal bacterial population (Fleischmann et al., 2002).

The continued advances in methods for high-throughput nucleic acid sequencing now allow more rapid generation of sequence data and thus access to information from a growing number of sequenced clinical *M. tuberculosis* genomes. Up to now, there are more than 200 on-going sequencing projects of *M. tuberculosis* strains with different characteristics, such as strains with epidemic potential and strains characterized by multidrug resistance, as well as isolates obtained before and after a passage through an immunocompetent animal model, among others (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). This information,

Genomic Variability of *Mycobacterium tuberculosis* 45

control. More studies will be needed to address this issue and to verify the presence of

Fig. 2. Whole genome alignment generated with the MAUVE software, showing synteny

The presence of insertion elements in different bacteria has been well appreciated for some time, especially because of the impact they can have on the host genome (Siguier *et al.*, 2006). Insertion elements can not only re-shape the genome but can also cause mutations and alter gene expression. In the case of pathogens such as *M. tuberculosis* the presence of insertion elements can generate genotypic variation and mediate changes that can affect gene function. This variability can therefore alter properties such as strain fitness and

The insertion elements present in the *M. tuberculosis* genome were described in detail upon completion of the *M. tuberculosis* H37Rv whole genome sequence (Gordon et al., 1999). *M. tuberculosis* harbors four main insertion elements, IS*6110*, IS*1081*, IS*1547* and the IS-like element, all of them present in multiple copies. The best studied of these is the 1.36Kb IS*6110*, originally described by Thierry et al. in 1990 (Thierry *et al.*, 1990), which belongs to the group of IS*3* elements and is characterized by having two partially overlapping open reading frames that allow production of a transposase by translational frameshifting (McEvoy *et al.*, 2007). It also has 28 bp imperfect terminal inverted repeats and generates 3 to 4 bp direct repeats upon insertion (McAdam *et al.*, 1990, Thierry *et al.*, 1990, Mendiola *et al.*, 1992). The IS*6110* element is present exclusively in strains of the MTBC that can harbor

and, in the case of strain KZN, the presence of a genomic inversion.

transmissibility and even play a role in the evolution of *M. tuberculosis*.

specific gene targets.

**5. Insertions** 

together with new bioinformatics algorithms will be an invaluable resource for probing these genomes in an effort to further understand the evolution, epidemiology, emergence of drug resistance and phenotypic variability associated with tuberculosis disease.

#### **4.1 Comparative genomics to assess variability**

With the growing number of sequenced strains becoming available, the comparison of complete genomes from different clinical isolates of *M. tuberculosis* becomes an attractive and powerful tool to explore genotypic similarities and differences. This approach can also provide important insights regarding the genotype-phenotype relationship in *M. tuberculosis,* and can therefore contribute to the development of control measures for tuberculosis. In this respect, we carried out whole genome comparisons of six fully sequenced *M. tuberculosis* strains, four clinical isolates and two laboratory strains which showed high synteny, as expected, and no large rearrangements (Cubillos-Ruiz *et al.*, 2008), except for a large inversion seen in the KZN strain that could be due to sequencing errors and incomplete data (Figure 2). Most of the 1,428 LSPs identified were indels involving 120 genes that affected primarily 1) mobile genetic elements such as insertion sequences and prophages, 2) non-coding regions, and 3) the PE/PPE family of genes. The LSPs identified in this work differed among strains, were distributed along the entire genome and were used to identify strain-specific insertions and deletions. When fitted to an exponential decay function these data indicated a tendency towards accumulation of more deletions than insertions, consistent with the notion of genome decay in *M. tuberculosis* (Cubillos-Ruiz et al., 2008). One other remarkable finding was that laboratory strains contained less strain specific polymorphisms than the clinical isolates, suggesting that the selective pressure imposed by the human immune system could be driving variability. The existence of strainspecific polymorphisms also opened the possibility that specific indels could be associated with particular lineages and thus could also be used as markers for strain typing and surveillance.

Taking into account the growing evidence of the phylogeographical origin of *M. tuberculosis* (Gagneux et al., 2006, Wirth et al., 2008), we speculated that the strain-specific polymorphisms could be common to strains of a particular lineage rather than being an exclusive property of one particular isolate. To test this hypothesis, we evaluated strainspecific indels and previously identified SNPs associated with strains of the Haarlem lineage using a large panel of well-characterized *M. tuberculosis* strains (Olano et al, 2008, Cubillos-Ruiz et al., 2008). Six large deletions, two specific IS*6110* insertions and two SNPs were significantly associated with the Haarlem family and thus proposed as genomic signatures of this lineage (Cubillos-Ruiz *et al.*, 2010). These results were completely congruent with spoligotyping and with RFLP data, as well as with the new assignation of a URAL family instead of the Haarlem 4 sublineage (Abadia *et al.*, 2010). One particularly interesting result was the identification of deletions that affected previously proposed drug targets. These include the gene Rv1354c, which encodes a diguanylate cyclase (DGC) enzyme involved in regulating the levels of c-di-GMP, a bacterial second messenger implicated in survival and adaptation to different environmental conditions (Gupta *et al.*, 2010), and gene Rv2275 that codes for a cytochrome P450, Cyp121 (McLean & Munro, 2008). Both of these genes were deleted in the Haarlem strains analyzed, indicating that they would not be adequate targets for antimicrobials. Although this particular study was limited to Haarlem strains, it raises the possibility that other lineage-specific genomic differences might impact treatment and

together with new bioinformatics algorithms will be an invaluable resource for probing these genomes in an effort to further understand the evolution, epidemiology, emergence of

With the growing number of sequenced strains becoming available, the comparison of complete genomes from different clinical isolates of *M. tuberculosis* becomes an attractive and powerful tool to explore genotypic similarities and differences. This approach can also provide important insights regarding the genotype-phenotype relationship in *M. tuberculosis,* and can therefore contribute to the development of control measures for tuberculosis. In this respect, we carried out whole genome comparisons of six fully sequenced *M. tuberculosis* strains, four clinical isolates and two laboratory strains which showed high synteny, as expected, and no large rearrangements (Cubillos-Ruiz *et al.*, 2008), except for a large inversion seen in the KZN strain that could be due to sequencing errors and incomplete data (Figure 2). Most of the 1,428 LSPs identified were indels involving 120 genes that affected primarily 1) mobile genetic elements such as insertion sequences and prophages, 2) non-coding regions, and 3) the PE/PPE family of genes. The LSPs identified in this work differed among strains, were distributed along the entire genome and were used to identify strain-specific insertions and deletions. When fitted to an exponential decay function these data indicated a tendency towards accumulation of more deletions than insertions, consistent with the notion of genome decay in *M. tuberculosis* (Cubillos-Ruiz et al., 2008). One other remarkable finding was that laboratory strains contained less strain specific polymorphisms than the clinical isolates, suggesting that the selective pressure imposed by the human immune system could be driving variability. The existence of strainspecific polymorphisms also opened the possibility that specific indels could be associated with particular lineages and thus could also be used as markers for strain typing and

Taking into account the growing evidence of the phylogeographical origin of *M. tuberculosis* (Gagneux et al., 2006, Wirth et al., 2008), we speculated that the strain-specific polymorphisms could be common to strains of a particular lineage rather than being an exclusive property of one particular isolate. To test this hypothesis, we evaluated strainspecific indels and previously identified SNPs associated with strains of the Haarlem lineage using a large panel of well-characterized *M. tuberculosis* strains (Olano et al, 2008, Cubillos-Ruiz et al., 2008). Six large deletions, two specific IS*6110* insertions and two SNPs were significantly associated with the Haarlem family and thus proposed as genomic signatures of this lineage (Cubillos-Ruiz *et al.*, 2010). These results were completely congruent with spoligotyping and with RFLP data, as well as with the new assignation of a URAL family instead of the Haarlem 4 sublineage (Abadia *et al.*, 2010). One particularly interesting result was the identification of deletions that affected previously proposed drug targets. These include the gene Rv1354c, which encodes a diguanylate cyclase (DGC) enzyme involved in regulating the levels of c-di-GMP, a bacterial second messenger implicated in survival and adaptation to different environmental conditions (Gupta *et al.*, 2010), and gene Rv2275 that codes for a cytochrome P450, Cyp121 (McLean & Munro, 2008). Both of these genes were deleted in the Haarlem strains analyzed, indicating that they would not be adequate targets for antimicrobials. Although this particular study was limited to Haarlem strains, it raises the possibility that other lineage-specific genomic differences might impact treatment and

drug resistance and phenotypic variability associated with tuberculosis disease.

**4.1 Comparative genomics to assess variability** 

surveillance.

control. More studies will be needed to address this issue and to verify the presence of specific gene targets.

Fig. 2. Whole genome alignment generated with the MAUVE software, showing synteny and, in the case of strain KZN, the presence of a genomic inversion.
