**5. Insertions**

The presence of insertion elements in different bacteria has been well appreciated for some time, especially because of the impact they can have on the host genome (Siguier *et al.*, 2006). Insertion elements can not only re-shape the genome but can also cause mutations and alter gene expression. In the case of pathogens such as *M. tuberculosis* the presence of insertion elements can generate genotypic variation and mediate changes that can affect gene function. This variability can therefore alter properties such as strain fitness and transmissibility and even play a role in the evolution of *M. tuberculosis*.

The insertion elements present in the *M. tuberculosis* genome were described in detail upon completion of the *M. tuberculosis* H37Rv whole genome sequence (Gordon et al., 1999). *M. tuberculosis* harbors four main insertion elements, IS*6110*, IS*1081*, IS*1547* and the IS-like element, all of them present in multiple copies. The best studied of these is the 1.36Kb IS*6110*, originally described by Thierry et al. in 1990 (Thierry *et al.*, 1990), which belongs to the group of IS*3* elements and is characterized by having two partially overlapping open reading frames that allow production of a transposase by translational frameshifting (McEvoy *et al.*, 2007). It also has 28 bp imperfect terminal inverted repeats and generates 3 to 4 bp direct repeats upon insertion (McAdam *et al.*, 1990, Thierry *et al.*, 1990, Mendiola *et al.*, 1992). The IS*6110* element is present exclusively in strains of the MTBC that can harbor

Genomic Variability of *Mycobacterium tuberculosis* 47

which include the phospholipase C region (Vera-Cabrera et al., 2001), members of the PPE gene family (McEvoy et al., 2009a), the *dnaA - dnaN* intergenic region (Turcios *et al.*, 2009), the RD724 gene (Kim *et al.*, 2010) and insertion into the IS*1547* element (Fang *et al.*, 1999). PPE genes are considered to be important antigens during the host-pathogen interaction and have been proposed to play a role in evasion of the immune response (Sampson, 2011). Thus variability in the PPE genes generated through IS*6110* transposition could help to evade the immune system during infection and confer advantage to strains. In contrast to these hotspot regions, some loci where integration is rare or not observed have also been identified and these represent sites where *in vivo* transposition events can be harmful to

We recently developed a novel high-throughput method using next-generation sequencing to identify the flanking regions of the IS*6110* insertion element in over 500 *M. tuberculosis*  isolates mainly from Latin-America and Europe. In this study we identified previously reported hotspot regions of insertion as well as novel sites (Table 1) (Reyes et al., submitted).

moaA-3 195 1 A

Rv0836c 227 5 A

Rv2336 - Hypothetical protein 188 2 A\*

Rv2815c - IS6110, transposase 485 2 A\* Rv3113 - Possible phosphatase 218 2 A Table 1. Hotspot insertion sites in *M. tuberculosis* identified by high-throughput sequencing (Reyes et al., submitted). Hotspots (H) or Ancestral (A) insertions for a given lineage; A\*

The copy number of IS*6110* elements in the genomes of circulating *M. tuberculosis* strains can be highly variable and is ultimately limited by the deleterious effects of IS*6110* transposition (McEvoy et al., 2007). Although most *M. tuberculosis* isolates have multiple copies of the IS*6110* element, the presence of a copy in the DR region of the MTBC strains suggests that this could be an ancestral insertion site. It has also been observed that some successful *M. tuberculosis* strains tend to have a high copy number of the IS*6110* element and that this might correlate with phenotypic properties (Alonso *et al.*, 2011). In a recent report, a Beijing family strain considered to have a high transmissibility rate was found to have 19 copies of the IS*6110* element, four of which were shown to up-regulate downstream gene expression. One of these was in the gene Rv2179c, which is normally expressed inside macrophages, suggesting that this gene could influence the infectious process and that the strain's high degree of transmissibility could be due to the up-regulation caused by the IS*6110* insertion (Alonso *et al.*, 2011). However, some clinical strains and MTBC members with a low number

indicates an ancestral insertion in a locus in that more than one lineage*.* 

membrane protein 225 6 A\*, H

hypothetical protein 394 4 A\*

**sites Hotspots** 

**Locus Gene ID Description # strains # independent** 

strain fitness and growth (Yesilkaya *et al.*, 2005).

MT3426:

Rv1754c - Conserved

Rv0403c mmpS1 Probable conserved

MT3426:MT342 7 (RvD5)

Rv2814c-

Rv0835:Rv0836c lpqQ:

from zero to 25 copies per genome (Brosch *et al.*, 2000). For this reason and due to its high degree of copy number and insertion site variation, IS*6110*- RFLP has been widely used for epidemiological purposes and is considered the "gold standard" to study the transmission dynamics of *M. tuberculosi*s (van Embden *et al.*, 1993, Small *et al.*, 1994, Safi *et al.*, 1997). The discriminatory power of the IS*6110-*RFLP method depends on there being sufficient variation and copy number to differentiate between unlinked isolates while allowing identification of specimens that are related (Wall *et al.*, 1999). Thus IS*6110*-RFLP is used to distinguish between epidemiological events but its use as marker for strain evolution is still under debate (McEvoy et al., 2007).

The consequence of IS*6110* transposition can differ depending on the position of integration, with phenotypic outcomes ranging from lethality to its bacterial host due to gene inactivation to possible benefits. There are four mutational events that can be generated by IS*6110* transposition: 1) Integration in intragenic regions; 2) Alteration of IS*6110* flanking regions; 3) Recombination/gene deletion; 4) Alteration of IS*6110* promoter activity (McEvoy et al., 2007)*.* Intragenic insertions interrupt open reading frames and can inactivate genes; this is the most frequently described event in certain clinical isolates. For these interruptions to be observed they must occur in genes that are dispensable for survival of the bacterium or are redundant in function. These insertion events can also alter immune recognition or virulence properties, as has been suggested for insertions in members of the PPE gene family or in the phospholipase C gene region (McEvoy *et al.*, 2009a, Vera-Cabrera *et al.*, 2001). It has also been observed that the regions flanking an IS*6110* insertion contain additional mutations, suggesting that this element can have a disruptive effect on the DNA region of insertion that results in mutations (Warren *et al.*, 2000).

Insertion elements can mediate deletions, as has also been shown for *M. tuberculosis,* where gene deletion can occur by homologous recombination between two flanking copies of IS*6110*. For example, deletion of the *plcA* gene in clinical *M. tuberculosis* strains displays a decrease capacity to cause pulmonary cavitation, clearly showing the phenotypic effects of transposition in a clinical setting (Kato-Maeda *et al.*, 2001a). Not all the insertions described in clinical isolates have deleterious or silent effects on the mycobacterial cell; some studies have reported that IS*6110* can up-regulate expression of downstream genes from an outward-directed promoter at its 3' end, conferring selective advantages. In particular, an insertion found within the *phoP* promoter region in an MDR *M.bovis* strain, which had produced outbreaks in the United States and Spain (Rivero *et al.*, 2001), was shown to increase *phoP* expression 10-fold in *M. smegmatis* and was proposed to be responsible for the high transmissibility levels of the original *M. bovis* isolate (Soto *et al.*, 2004).

Given the high variability of IS*6110* elements in the genomes of MTBC strains and the possible consequences of insertion on strain phenotype, there has been an interest in identifying the precise insertion locations in *M. tuberculosis* clinical isolates. Different methodologies developed, based on PCR, sequencing and cloning, have suggested that the IS*6110* element inserts preferentially into non-coding regions (Otal *et al.*, 2008, Warren et al., 2000, Thorne *et al.*, 2011, Kim *et al.*, 2010, McEvoy *et al.*, 2009b, Wall et al., 1999). This can be explained by the fact that insertions in functional genes essential for strain growth, maintenance and pathogen integrity would be harmful to the cell and thus, not maintained in the population. Preferential insertion loci or hotspots have also been identified, some of

from zero to 25 copies per genome (Brosch *et al.*, 2000). For this reason and due to its high degree of copy number and insertion site variation, IS*6110*- RFLP has been widely used for epidemiological purposes and is considered the "gold standard" to study the transmission dynamics of *M. tuberculosi*s (van Embden *et al.*, 1993, Small *et al.*, 1994, Safi *et al.*, 1997). The discriminatory power of the IS*6110-*RFLP method depends on there being sufficient variation and copy number to differentiate between unlinked isolates while allowing identification of specimens that are related (Wall *et al.*, 1999). Thus IS*6110*-RFLP is used to distinguish between epidemiological events but its use as marker for strain evolution is still

The consequence of IS*6110* transposition can differ depending on the position of integration, with phenotypic outcomes ranging from lethality to its bacterial host due to gene inactivation to possible benefits. There are four mutational events that can be generated by IS*6110* transposition: 1) Integration in intragenic regions; 2) Alteration of IS*6110* flanking regions; 3) Recombination/gene deletion; 4) Alteration of IS*6110* promoter activity (McEvoy et al., 2007)*.* Intragenic insertions interrupt open reading frames and can inactivate genes; this is the most frequently described event in certain clinical isolates. For these interruptions to be observed they must occur in genes that are dispensable for survival of the bacterium or are redundant in function. These insertion events can also alter immune recognition or virulence properties, as has been suggested for insertions in members of the PPE gene family or in the phospholipase C gene region (McEvoy *et al.*, 2009a, Vera-Cabrera *et al.*, 2001). It has also been observed that the regions flanking an IS*6110* insertion contain additional mutations, suggesting that this element can have a disruptive effect on the DNA

Insertion elements can mediate deletions, as has also been shown for *M. tuberculosis,* where gene deletion can occur by homologous recombination between two flanking copies of IS*6110*. For example, deletion of the *plcA* gene in clinical *M. tuberculosis* strains displays a decrease capacity to cause pulmonary cavitation, clearly showing the phenotypic effects of transposition in a clinical setting (Kato-Maeda *et al.*, 2001a). Not all the insertions described in clinical isolates have deleterious or silent effects on the mycobacterial cell; some studies have reported that IS*6110* can up-regulate expression of downstream genes from an outward-directed promoter at its 3' end, conferring selective advantages. In particular, an insertion found within the *phoP* promoter region in an MDR *M.bovis* strain, which had produced outbreaks in the United States and Spain (Rivero *et al.*, 2001), was shown to increase *phoP* expression 10-fold in *M. smegmatis* and was proposed to be responsible for the

Given the high variability of IS*6110* elements in the genomes of MTBC strains and the possible consequences of insertion on strain phenotype, there has been an interest in identifying the precise insertion locations in *M. tuberculosis* clinical isolates. Different methodologies developed, based on PCR, sequencing and cloning, have suggested that the IS*6110* element inserts preferentially into non-coding regions (Otal *et al.*, 2008, Warren et al., 2000, Thorne *et al.*, 2011, Kim *et al.*, 2010, McEvoy *et al.*, 2009b, Wall et al., 1999). This can be explained by the fact that insertions in functional genes essential for strain growth, maintenance and pathogen integrity would be harmful to the cell and thus, not maintained in the population. Preferential insertion loci or hotspots have also been identified, some of

region of insertion that results in mutations (Warren *et al.*, 2000).

high transmissibility levels of the original *M. bovis* isolate (Soto *et al.*, 2004).

under debate (McEvoy et al., 2007).

which include the phospholipase C region (Vera-Cabrera et al., 2001), members of the PPE gene family (McEvoy et al., 2009a), the *dnaA - dnaN* intergenic region (Turcios *et al.*, 2009), the RD724 gene (Kim *et al.*, 2010) and insertion into the IS*1547* element (Fang *et al.*, 1999). PPE genes are considered to be important antigens during the host-pathogen interaction and have been proposed to play a role in evasion of the immune response (Sampson, 2011). Thus variability in the PPE genes generated through IS*6110* transposition could help to evade the immune system during infection and confer advantage to strains. In contrast to these hotspot regions, some loci where integration is rare or not observed have also been identified and these represent sites where *in vivo* transposition events can be harmful to strain fitness and growth (Yesilkaya *et al.*, 2005).

We recently developed a novel high-throughput method using next-generation sequencing to identify the flanking regions of the IS*6110* insertion element in over 500 *M. tuberculosis*  isolates mainly from Latin-America and Europe. In this study we identified previously reported hotspot regions of insertion as well as novel sites (Table 1) (Reyes et al., submitted).


Table 1. Hotspot insertion sites in *M. tuberculosis* identified by high-throughput sequencing (Reyes et al., submitted). Hotspots (H) or Ancestral (A) insertions for a given lineage; A\* indicates an ancestral insertion in a locus in that more than one lineage*.* 

The copy number of IS*6110* elements in the genomes of circulating *M. tuberculosis* strains can be highly variable and is ultimately limited by the deleterious effects of IS*6110* transposition (McEvoy et al., 2007). Although most *M. tuberculosis* isolates have multiple copies of the IS*6110* element, the presence of a copy in the DR region of the MTBC strains suggests that this could be an ancestral insertion site. It has also been observed that some successful *M. tuberculosis* strains tend to have a high copy number of the IS*6110* element and that this might correlate with phenotypic properties (Alonso *et al.*, 2011). In a recent report, a Beijing family strain considered to have a high transmissibility rate was found to have 19 copies of the IS*6110* element, four of which were shown to up-regulate downstream gene expression. One of these was in the gene Rv2179c, which is normally expressed inside macrophages, suggesting that this gene could influence the infectious process and that the strain's high degree of transmissibility could be due to the up-regulation caused by the IS*6110* insertion (Alonso *et al.*, 2011). However, some clinical strains and MTBC members with a low number

Genomic Variability of *Mycobacterium tuberculosis* 49

in some cases has been associated with multidrug resistance (Hanekom *et al.*, 2011). Its capacity to spread within a population is evident from epidemiological studies and emphasizes the possibility that certain strain properties could contribute to this lineage's expansion in the population (Nicol & Wilkinson, 2008). The increased virulence of these isolates was associated with the production of a phenolic glycolipid (PGL) that affects the host immune response, and which is absent in many other *M. tuberculosis* families (Ordway *et al.*, 2007, Hanekom et al., 2011). More recent work suggests that although PGL can contribute to *M. tuberculosis* virulence, it probably requires additional bacterial factors (Sinsimer *et al.*, 2008). Other examples stem from studies of strains that have caused outbreaks, such as strains CDC1551 and HN878, the latter also a member of the Beijing family. In these and other studied cases, it appears that some of the effects observed have to do with the capacity of these strains to induce variable inflammatory responses (Coscolla & Gagneux, 2010). Despite these studies, many of the clinical outcomes associated with strain variability still need to be further examined, particularly in other *M. tuberculosis* lineages

before precise genotypic variability can be associated with phenotypic differences.

development and control of *M. tuberculosis*.

The emergence and spread of drug-resistant strains is particularly disturbing and provides additional examples where strain variability can have a profound effect on disease outcome. One particularly alarming case was the epidemic caused by an XDR strain in the KwaZulu-Natal region of South Africa that resulted in high mortality, causing the death of 52 of the 53 patients co-infected with HIV in the course of 16 days (Gandhi *et al.*, 2006). To understand more about the dynamics of appearance and dispersion of this highly virulent KZN strain, whole genome sequence analysis was carried out for XDR, MDR and drug sensitive KZN strains. The results indicated that the outbreak was most probably due to clonal expansion of a single strain and that a particular strain genetic background did not necessarily contribute to acquisition of antibiotic resistance (Ioerger *et al.*, 2009). Further work will be needed to better understand this strain's virulence and transmissibility in the community. Part of the success of *M. tuberculosis* as a human pathogen is due to its capacity to be efficiently transmitted between hosts and to persist for long periods of time despite the host's immune response. A recent study involving whole genome sequencing of 21 strains from the six main *M. tuberculosis* lineages indicated that human T cell epitopes had very little sequence variation and were highly conserved relative to the rest of the genome. It was suggested that these antigens, contrary to expectations, might be under purifying selection and be benefitting from host immune recognition (Comas *et al.*, 2010). This differs from the classical view of immune evasion due to the selective pressure imposed by the immune response and may indicate that new approaches should be considered for vaccine

The genetic variability evident in strains of the MTBC bears relevance to control of tuberculosis since treatment must work against all circulating strains. Rapid and accessible diagnostics for both *M. tuberculosis* and drug resistant isolates are still required, as is the availability of a vaccine that can be universally effective, given the variable efficacy of the currently used BCG vaccine. There are now more that 10 vaccines under phase I trial and the hope is that in the near future at least one of these will prove to be safe and protective by containing *M. tuberculosis* and preventing reactivation. However, future strategies will need to address the need to prevent or eradicate latent infections, especially in view of additional factors affecting disease and the host immune response, such as co-infection with HIV

of copies of the IS*6110* are also epidemiologically successful. In general, though, there is still insufficient information regarding the factors that influence the frequency of transposition, such as the genomic context of the insertion element within a particular strain background. The variation in the number of IS*6110* elements among *M. tuberculosis* isolates also raises the possibility that copy number is the result of the evolution of particular lineages as strains cope with IS*6110* transposition and its resulting genetic variability, and in some cases even selecting for phenotypically favorable events, while keeping genome integrity and avoiding deleterious effects.
