**4.1.2 Where IS***6110* **can be inserted**

The identification of the sites of insertion, and its relationships with the phenotype of the corresponding strain, could allow to have insights into the biological meaning of the genes targeted. The identification of those sites showed that this element could interrupt coding regions as well as be located in non-coding sequences (Fang et al., 1999a). The interruption of coding regions can be seen as a sort of natural knock-out mutation of the target gene. On the other hand, the insertion of the element in non-coding regions would have secondary consequences, such as the increasing or decreasing of the expression of the neighbouring genes (McEvoy et al., 2007).

The high variation detected in the RFLP pattern comparing multiple *M. tuberculosis* isolates showed apparent lack of preferential location of the IS in the bacterial genome. However, one of the first conclussions made evident was that the insertion was not fully at random.

Hermans and co-workers showed the first *hot-spot* integrative region in the genome described for this element, known as D*irect Repeat* (DR) locus (Hermans et al., 1991). With minor exceptions, all members of the MTBC carry a copy of the IS*6110* integrated in that locus, and that characteristic has been exploited for the development of a widely applied typing procedure called Spoligotyping (see part 3.2). Later on, another *hot-spot* site of integration was described, namely the *insertional preferential locus* (*ipl*) (Fang et al., 1997). It was shown that this corresponded to the ORF of the virulent reference strain Rv0797 that encodes for another insertion sequence, IS*1547* (see part 1.2). These preferential integration sites, are characterized by the occurrence of insertion in more than one site close each other (Sampson et al., 2001). The list of preferential sites for the insertion of this element identified at the moment rose to about half a dozen and most probably will be increased (McEvoy et al., 2009).

Appart of the identification of preferential loci for the IS*6110* insertions, the location of the insertions along the genome was not equally organized. After the complete genome sequence of the reference virulent strain, namely H37Rv, the IS*6110* was found to be inserted more often in some genome regions, on the contrary, other regions lacked in the presence of this IS (Cole et al., 1998). Up to near the 800 first kbp from the origin replication fail in carrying copies of IS*6110* in the strain H37Rv. Besides, IS*6110* was otherwise located more or less randomly along the rest of the genome. The conclusion was that this part of the genome could be more abundant in essential genes. This result was also seen when studies of other strains were accomplished.

Comparison of the IS*6110*-RFLP pattern to the corresponding list of insertion loci showed that RFLP has limited level of discriminative power. Thus, the finding of more insertion loci

applicability. Whole-genome sequences of tens of MTBC strains are currently finished or at several degrees of accomplishment, however, that number could not compete with the

New technologies are being currently under development aiming to determine the IS*6110* insertional sites of a high number of *M. tuberculosis* isolates by using high-throughput methodologies, such as the Masive-Insertion Site sequencing (IS-seq) (Sandoval et al., ESM-2010). Such a kind of procedures will surely help to unravel the IS flanking region sequences

The identification of the sites of insertion, and its relationships with the phenotype of the corresponding strain, could allow to have insights into the biological meaning of the genes targeted. The identification of those sites showed that this element could interrupt coding regions as well as be located in non-coding sequences (Fang et al., 1999a). The interruption of coding regions can be seen as a sort of natural knock-out mutation of the target gene. On the other hand, the insertion of the element in non-coding regions would have secondary consequences, such as the increasing or decreasing of the expression of the neighbouring

The high variation detected in the RFLP pattern comparing multiple *M. tuberculosis* isolates showed apparent lack of preferential location of the IS in the bacterial genome. However, one of the first conclussions made evident was that the insertion was not fully at random.

Hermans and co-workers showed the first *hot-spot* integrative region in the genome described for this element, known as D*irect Repeat* (DR) locus (Hermans et al., 1991). With minor exceptions, all members of the MTBC carry a copy of the IS*6110* integrated in that locus, and that characteristic has been exploited for the development of a widely applied typing procedure called Spoligotyping (see part 3.2). Later on, another *hot-spot* site of integration was described, namely the *insertional preferential locus* (*ipl*) (Fang et al., 1997). It was shown that this corresponded to the ORF of the virulent reference strain Rv0797 that encodes for another insertion sequence, IS*1547* (see part 1.2). These preferential integration sites, are characterized by the occurrence of insertion in more than one site close each other (Sampson et al., 2001). The list of preferential sites for the insertion of this element identified at the moment rose to about

Appart of the identification of preferential loci for the IS*6110* insertions, the location of the insertions along the genome was not equally organized. After the complete genome sequence of the reference virulent strain, namely H37Rv, the IS*6110* was found to be inserted more often in some genome regions, on the contrary, other regions lacked in the presence of this IS (Cole et al., 1998). Up to near the 800 first kbp from the origin replication fail in carrying copies of IS*6110* in the strain H37Rv. Besides, IS*6110* was otherwise located more or less randomly along the rest of the genome. The conclusion was that this part of the genome could be more abundant in essential genes. This result was also seen when studies

Comparison of the IS*6110*-RFLP pattern to the corresponding list of insertion loci showed that RFLP has limited level of discriminative power. Thus, the finding of more insertion loci

half a dozen and most probably will be increased (McEvoy et al., 2009).

thousand IS*6110*-RFLP patterns already registered at the available data-bases.

in a more feasible manner.

genes (McEvoy et al., 2007).

of other strains were accomplished.

**4.1.2 Where IS***6110* **can be inserted** 

than RFLP bands is not a rare event (Beggs et al., 2000; Warren et al., 2000; Alonso et al., 2011). This result is more evident in those strain carrying high copy number of the IS.

The influence that the insertion could have in the content of active/non active genes was considered that could give insights into the number of genes required for infection, being thus a source of information to detect which were the genes or gene content essential for virulence (McEvoy et al., 2007).

Some works were devoted to compare the insertion loci of virulent with those of avirulent strains. The attenuated vaccine strain *M. bovis* BCG has major differences on the content of IS*6110* compared to the virulent strain *M. tuberculosis* H37Rv: one and 16 copies respectively. However the IS*6110* copy number per genome not appears to be related to the attenuation of the bacilli (see part 5). In fact, the avirulent strain H37Ra has a supplementary copy compared to its parenteral strain the virulent H37Rv. Comparison of H37Rv and H37Ra genomes showed two main differences among them mediated by the insertion of IS*6110*. However these changes have not a clear role in the attenuation of the avirulent strain (Brosh et al., 1999).

Comparison of several BCG strains showed differences among them in relation to IS*6110*. The "ancestral" BCG (for example, BCG tokio) carries two copies of the IS sited in the DR region and upstream the two component system *pho*P-*pho*R (see part 4.2). This last copy was lost in the "modern" BCG (for example, BCG pasteur) that has a single copy inserted in the preferential loci mentioned, namely DR region (Brosh et al., 2007).

Identification of essential genes could be also possible through the detection of those never carrying inserted ISs, following the assessment that those mutations could be deleterious for the bacteria. An *in silico* study, based on previous experimental data, estimated that the *M. tuberculosis* genome contains 35% of essential genes (Lamichhane et al., 2003). Even though the data on genome loci with insertion identifies transposition/recombination events either in coding or in non-coding regions, generally speaking, there has been detected higher number of insertion loci inside coding region. However, the non-coding sequences represent only 10% of the genome suitable to host IS. Therefore the proportion of insertions inside non-coding region is actually higher compared to the proportion of insertions inside coding regions (Table 3). This could represent a sort of "ORF-preserving" behaviour of the genome variability mediated by IS*6110* transposition. This is consistent with the suggestion of a greater selection against intra-genic insertion in *M*. *tuberculosis* during infection *in vivo* than when grown *in vitro* (Yesilkaya et al., 2005).

In a study conducted over 161 clinical isolates of *M. tuberculosis*, the insertion sites of the IS*6110* were determined (Yesilkaya et al., 2005). Only 100 ORF were affected by insertion, and was considered by the authors that represented a global low number of non-essential genes. In conclusion most of the genes in *M. tuberculosis* might play important role for infection and transmission.

From the data obtained thus far, a high proportion of the *IS6110* coding-targeted genes correspond to the functional category containing PE-PPE group of genes (see references in Table 3). These genes are very characteristics of the MTBC members and are considered related to the antigen variability of the bacilli (McEvoy et al., 2009).

IS*6110* the Double-Edged Passenger 73

Since the first description in 1995, the *M. tuberculosis* Beijing strain becames a main health problem worldwide. It was responsible of one of the most important outbreaks due to multidrug-resistant strain in the USA during the early nineties (Moss et al., 1997). The *M. tuberculosis* Beijing identifies a family that includes highly transmissible drug resistant and drug susceptible strains, being currently responsible of about one third of the global TB

The members of the Beijing family usually are high copy number strains (HCS) of IS*6110* (between 15-25 copies per genome) suggesting the relevance of this element in the variability of their genomes. Supporting this possibility, sublineages of this family were identified to carry an important genome duplication that involves up to 8% of the genome (corresponding to more than 300 genes). Copies of IS*6110* were identified flanking that duplication, thus suggesting the occurrence of homologous recombination event mediated

The insertion sites of IS*6110* of two drug resistant Beijing strains (W and 210) were determined (Beggs et al., 2000). These strains shared up to 17 insertion sites. Several features related to IS*6110* characterize this family, such as the presence of one copy in the *ori*C region, the deletion of the right-site DR spacers (from spacers 1 to 34) and similar RFLP multiband

Recently, in a study undertaken in the laboratory of one of the authors (Alonso et al., 2011) the insertion sites of another Beijing strain were determined and compared to those from strains W and 210. A higher proportion of insertion in non-coding region was found including one locus with putative promoter-influence activity (see part 4.2). Nine loci

The presence of the IS in *ori*C, the region that control the replication of the genome, is expected to have some influence on the synchronization of the bacterial cell division (Casart et al., 2008). This site is currently considered a preferential locus, and multiple transposition events were described in several clinical isolates from patients in Caracas (Venezuela) (Turcios et al., 2009). Both the infection in the animal model as well as the *in vitro* growth rate were further analyzed for those clinical strains, and compared to strains lacking in IS at the *ori*C region (Casart et al., 2008). The presence of IS*6110* in the origin of replication enlarge the bacilli and causes slow growth rate *in vitro*; besides the IS apparently causes

To date, the data collected on *M. tuberculosis* confirm that its genome is highly conserved. This result raises the possibility that differences among isolates be more likely found through the study of regulatory and/or metabolic pathways. Taken into consideration the previous assertion, we should not forget that a big proportion of the ORFs identified in the tubercle bacilli are of unknown function. Nevertheless, following the previous statement, the insertion of IS*6110* outside ORFs even though saves the bacilli of direct knockout of one/several gene, could putatively have important consequences for gene expression and

IS insertion could interfere both the initiation and the termination of gene expression providing it inserted up- or down-stream the gene coding sequence. It is considered that the

common to all three Beijing strains, including the *ori*C, were also identified.

cases (Alonso et al., 2011).

by this IS (Domenech et al., 2010).

pattern profile (Hanekom et al., 2011).

attenuation in the animal model.

**4.2 Switching** *on* **and** *off* **genes** 

then influence in the metabolic activity of the bacilli.


Table 3. Number of IS*6110* inserted sites recorded from the literature. Percentages were approximate considering that 90% and 10% of the genome corresponded respectively to coding and non-coding sequences.

(a) With the exception of the direct repeat loci, all low copy number strains analyzed in this study have IS*6110* inserted exclusively inside coding regions.

As previously mentioned, the hallmark that identifies the transposition of IS*6110* is the presence of 3-4bp direct repeats immediately flanking the IS sequence. The current availability of annotated whole genome sequences of members of the MTBC allow to differentiate, for each of the IS copy, if the insertion was due to transposition or recombination mechanisms. According to the data derived from 81 insertions in 10 of those members, transposition is the more frequent mechanism used by this IS to be inserted into the MTBC genome regardless the number of copies carried by the genome or the target sequence (insertion into coding or no-coding regions) (Table 4). In all cases, the insertion in the direct repeat loci has been as consequence of a transposition event.


Table 4. Data collected from whole genome sequences of the corresponding strains (http://www.ncbi.nlm.nih.gov/). For each genome, the number of copies of the IS*6110* per genome was indicated as well as how many carry or not the 3-4bp direct repeat sequence.

#### **4.1.3 IS***6110* **in the genome of the Beijing family**

Efforts were addressed on the study of clinical isolates particularly relevant under microbiological, clinical or epidemiological aspects. This was the case of members of the Beijing family.

**Coding (%)** 

**Non** 

**coding (%) Comments** 

Beijing family

preferential loci

**Sites identified**

Kivi et al., 2002 8 41 28 (0.7) 13 (2.9) (a) Yesilkaya et al., 2005 161 818 491 (0.12) 327 (0.74)

study have IS*6110* inserted exclusively inside coding regions.

the direct repeat loci has been as consequence of a transposition event.

**4.1.3 IS***6110* **in the genome of the Beijing family** 

Beijing family.

*M. tuberculosis* H37Rv 16 12 4 *M. tuberculosis* H37Ra 17 13 4 *M. tuberculosis* CDC1551 4 4 0 *M. tuberculosis* KZN 14 9 5 *M. tuberculosis* F11 17 15 2 *M. africanum* 7 6 1 *"M. canettii"* 2 1 1 *M. bovis* 1 1 0 *M. bovis* BCG Pasteur 1 1 0 *M. bovis* BCG Tokio 2 2 0

Otal et al., 2008 7 12 6 (0.15) 6 (1.36) *M. bovis* 

Alonso et al., 2011 3 32 13 (0.33) 19 (4.3) *M. tuberculosis* 

Sampson et al., 2001 34 97 57 (1.44) 33 (7.5) Identified 13

Table 3. Number of IS*6110* inserted sites recorded from the literature. Percentages were approximate considering that 90% and 10% of the genome corresponded respectively to

(a) With the exception of the direct repeat loci, all low copy number strains analyzed in this

As previously mentioned, the hallmark that identifies the transposition of IS*6110* is the presence of 3-4bp direct repeats immediately flanking the IS sequence. The current availability of annotated whole genome sequences of members of the MTBC allow to differentiate, for each of the IS copy, if the insertion was due to transposition or recombination mechanisms. According to the data derived from 81 insertions in 10 of those members, transposition is the more frequent mechanism used by this IS to be inserted into the MTBC genome regardless the number of copies carried by the genome or the target sequence (insertion into coding or no-coding regions) (Table 4). In all cases, the insertion in

**MTBC copies of IS***6110* **3-4bp no bp repeat** 

TOTAL 81 64 (79%) 17 (21%) Table 4. Data collected from whole genome sequences of the corresponding strains

(http://www.ncbi.nlm.nih.gov/). For each genome, the number of copies of the IS*6110* per genome was indicated as well as how many carry or not the 3-4bp direct repeat sequence.

Efforts were addressed on the study of clinical isolates particularly relevant under microbiological, clinical or epidemiological aspects. This was the case of members of the

**Reference Isolates**

coding and non-coding sequences.

Beggs et al., 2000

Warren et al., 2000

**studied** 

Since the first description in 1995, the *M. tuberculosis* Beijing strain becames a main health problem worldwide. It was responsible of one of the most important outbreaks due to multidrug-resistant strain in the USA during the early nineties (Moss et al., 1997). The *M. tuberculosis* Beijing identifies a family that includes highly transmissible drug resistant and drug susceptible strains, being currently responsible of about one third of the global TB cases (Alonso et al., 2011).

The members of the Beijing family usually are high copy number strains (HCS) of IS*6110* (between 15-25 copies per genome) suggesting the relevance of this element in the variability of their genomes. Supporting this possibility, sublineages of this family were identified to carry an important genome duplication that involves up to 8% of the genome (corresponding to more than 300 genes). Copies of IS*6110* were identified flanking that duplication, thus suggesting the occurrence of homologous recombination event mediated by this IS (Domenech et al., 2010).

The insertion sites of IS*6110* of two drug resistant Beijing strains (W and 210) were determined (Beggs et al., 2000). These strains shared up to 17 insertion sites. Several features related to IS*6110* characterize this family, such as the presence of one copy in the *ori*C region, the deletion of the right-site DR spacers (from spacers 1 to 34) and similar RFLP multiband pattern profile (Hanekom et al., 2011).

Recently, in a study undertaken in the laboratory of one of the authors (Alonso et al., 2011) the insertion sites of another Beijing strain were determined and compared to those from strains W and 210. A higher proportion of insertion in non-coding region was found including one locus with putative promoter-influence activity (see part 4.2). Nine loci common to all three Beijing strains, including the *ori*C, were also identified.

The presence of the IS in *ori*C, the region that control the replication of the genome, is expected to have some influence on the synchronization of the bacterial cell division (Casart et al., 2008). This site is currently considered a preferential locus, and multiple transposition events were described in several clinical isolates from patients in Caracas (Venezuela) (Turcios et al., 2009). Both the infection in the animal model as well as the *in vitro* growth rate were further analyzed for those clinical strains, and compared to strains lacking in IS at the *ori*C region (Casart et al., 2008). The presence of IS*6110* in the origin of replication enlarge the bacilli and causes slow growth rate *in vitro*; besides the IS apparently causes attenuation in the animal model.

#### **4.2 Switching** *on* **and** *off* **genes**

To date, the data collected on *M. tuberculosis* confirm that its genome is highly conserved. This result raises the possibility that differences among isolates be more likely found through the study of regulatory and/or metabolic pathways. Taken into consideration the previous assertion, we should not forget that a big proportion of the ORFs identified in the tubercle bacilli are of unknown function. Nevertheless, following the previous statement, the insertion of IS*6110* outside ORFs even though saves the bacilli of direct knockout of one/several gene, could putatively have important consequences for gene expression and then influence in the metabolic activity of the bacilli.

IS insertion could interfere both the initiation and the termination of gene expression providing it inserted up- or down-stream the gene coding sequence. It is considered that the

IS*6110* the Double-Edged Passenger 75

**5.1** *M. tuberculosis* **low copy number strains (LCS) & high copy number strains (HCS)**  *M. tuberculosis* strains with less than six copies of IS*6110* are usually referred as low copy number strains (LCS) in the literature. A few clinical investigations reported the presence of LCS in regions as India, Vietnam or Tanzania. (Barlow et al., 2001; Sankar et al., 2011a). The 66% of the *M. tuberculosis* strains isolated in Tiruvallur, South India, presented a single copy of the IS*6110* or LCS (Shanmugam et al., 2011). In Kanpur district, , North India, the 17% of the *M. tuberculosis* isolates were LCS (Purwar et al., 2011). High copy number *M. tuberculosis* strains (HCS), with six or more copies of IS*6110,* were reported by a greater number of papers. One study from Brazil, reported that 93.6% of *M. tuberculosis* strains had at least six copies ranging from 1 to 18 (Suffys et al., 2000). In San Francisco, of 1,326 isolates investigated, 90% had six o more copies and only two isolates had no copies of IS*6110* (Yang

A majority (96.2%) of the 183 strains fingerprinted from Kampala were HCS. These strains were isolated from patients with known HIV sero-status. The number of IS*6110* copies ranged from 1 to 20 and the frequency of occurrence of IS*6110* bands was similar between the two serogroups. The most prevalent pattern observed had 14 copies of IS*6110* with the same distribution comparing HIV seropositive and HIV seronegative patients (Asiimwe et

Chauhan et al analyzed 308 isolates of *M. tuberculosis* from different parts of India and 56 per cent of the isolates showed HCS of IS*6110.* At the regional level, there was not much difference in the IS*6110* copy numbers of isolates from different parts of that country

A long term population based study analysing 1759 clinical strains from the state of Alabama showed that 65% corresponded with HCS. The results of this study demonstrated that clustering cases is clearly associated with different social factors and risk behaviors but

After revision of the literature looking for the origin of outbreaks including MDR cases, it was evident that both LCS and HCS were involved in outbreaks at similar proportion. Some examples of large outbreaks in population studies showing different copy number strains

The Beijing family is one of the lineages with the highest number of copies of IS*6110* (see part 4.1.3). There are controversies among the behaviour of the Beijing lineage. On the one hand, a Beijing strain named GC1237 has been responsible of epidemic outbreaks since its appeared in the community in 1993 (Caminero et al., 2001), on the other hand, one study conducted in Cape Town (South Africa) found no significant association between the *M. tuberculosis* genotype and transmissibility within the household (Marais et al., 2009). Besides, there are outbreaks reported caused by LCS, as was the extensive transmission of *M. tuberculosis* in a rural population with minimal risk factors for TB. This strain was designated as CDC 1551 and the fingerprint showed only 4 copies of the IS*6110* (Valway et

not with high or low number of copies of the IS*6110* (Kempf et al., 2005).

**5.2 Are there any clinical properties associated to LCS or HCS?** 

et al., 1998).

al., 2009).

(Chauhan et al., 2007).

are listed in Table 6.

al., 1998).

influence of the IS on the downstream genes is related to the distance among the gene and the 3'-end of the IS. Thus, a promoter influence is possible within the range of 31 to 300bp of distance among them.

This could be due to a polar effect of the IS and also due to the presence of an outward promoter that was identified close to the 3'-end of IS*6110* (Safi et al., 2004; Soto et al., 2004).

The promoter carried by IS*6110* has the relevance of being activated inside monocytes (Safi et al., 2004) and its activity was demonstrated in several genes not only in the strain H37Rv but also in other clinical strains including Beijing strains (Safi et al., 2004). Remarkably that promoter activity has been demonstrated by the upregulation of the main two-component system of this bacterium, namely *pho*P/*pho*R (Soto et al., 2004).

In this context, it is noteworthy the presence of the IS inserted between *dna*A-*dna*N proteins that control the genome replication. This insertion was identified in many strains including several belonging to the Beijing family (see part 4.1.3). Moreover the IS could be inserted in both directions in this region (Turcios et al., 2009; Casart et al., 2008) having thus putatively a variable influence on the bacterial cell division.


\* CHP: Conserved Hypothetical Protein.

Table 5. Identified genes located under the putative influence of the IS*6110* promoter activity. Changes in the gene expression were demonstrated in some of the cases.

Much effort should be used to complete the record of the loci in which IS*6110* was inserted. That knowledge will much help to our understanding of the mechanisms used by the tubercle bacilli to cause Tuberculosis so successfully.
