**4. Discussion**

188 Gene Duplication

Fig. 10. Relationship between average genetic distance and cluster size for duplicate gene families of *M. tuberculosis* H37Rv*, M. bovis, M. paratuberculosis, M. avium, M. ulcerans and M. leprae.* The X-axis represents the cluster size (total proteins in each cluster) and Y-axis shows the average genetic distance between the identified gene families of each organism. The average genetic distance appears to decrease with the cluster size, suggesting a negative

Following duplication, the proteins of this subfamily have diverged into 2 groups: SigE and SigM (Figure 11). The 2 proteins in the SigE group have further diverged following duplication and divergence. However, one of the proteins in the sigE group was identified to have no orthologs in *M. leprae* (Figure 12), and loss of various sigma factors is suggested to be the reason for *M. leprae* reductive genome evolution (Babu, 2003). Interestingly, all the paralogs of *M. tuberculosis* appear to have orthologs in *M. bovis*, but the absence of sigM proteins in *M. bovis*, and the large divergence of this protein group in *M. tuberculosis*  compared to other mycobacteria enables us to speculate on its significance in *M. tuberculosis*  evolution. Though an error in available *M. bovis* sequences could have resulted in incorrect annotation of the sigM locus as a psuedogene (Manganelli *et al.,* 2004), the extent of divergence of this protein in *M tuberculosis* compared to other mycobacteria prompts further

investigation into its possible paths of pseudogenization or neofunctionalization.

correlation between these two factors.

**3.4.1 Analysis of sigma factor proteins in subfamily A** 

The availability of complete genome sequences of many bacteria and significant progress in the development of modern computational biology methods has resulted in the evolution of a powerful platform for the comparative investigation of genome diversity across different organisms. Here, we make use of the wealth of genome information and bioinformatics tools to understand the significance of gene duplication in *M. tuberculosis* evolution. The investigation of relationships between the GC composition and duplicate gene percentages identified from the sequence and InterPro domain data provides sufficient evidence to suggest a positive correlation between them for group1 and group3 organisms. Here, the mycobacterial species are part of the group1 organisms, so the maintenance of

Analysis of Duplicate Gene Families

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 191

mutations lead to the preservation of duplicate genes with single functions (Stoltfus, 1999; Lynch & Force, 2000). Moreover, it has been suggested that duplicate genes lose one of the domains that were originally present in the ancestral molecule, and by complementation of the lost domains, both the daughter copies are reported to reflect the original ancestral function. Thus, gene complexity is suggested to be reduced after subfunctionalization of duplicate genes (He & Zhang, 2005). Further, the complementary loss of subfunctions is considered to facilitate the preservation of duplicate gene pairs, and due to relaxed evolutionary constraints following subfunctionalization, the chances of long-term evolution of new functions is enhanced (Force *et al.,* 1999). However, since deleterious mutations are more common than beneficial mutations (Cun, 2010), evolution of new and essential protein functions is considered to be a rare event (Nadeau & Sankoff, 1997; Force *et al.,* 1999). According to the predominant argument, the evolution of new domains would be favored only if they can perform a function different to that of preexisting domains or domain combinations (Lagomarsino *et al.,* 2009). Further, the majority of duplicate genes are predicted to develop new functions from the already existing ancestral gene functions, and if new functions evolve by mutation from prior domains, it is less likely that all of the domains would evolve into new domains due to the mutational bridge for new domain evolution being too far from the ancestral molecule (Lagomarsino *et al.,* 2009). Hence, evolution of new functions following subfunctionalization could be a rare event. Therefore, the presence of fewer domains in the duplicate genes compared to the single copy genes could be due to evolution of duplicate genes by subfunctionalization, where complementary loss of subfunctions is viewed to primarily facilitate preservation of the duplicate gene. Alternatively, the addition of new domains into a bacterial genome could be due to acquisition by HGT. Indeed, acquisition of one or more domains by HGT in 30 to 50 percent of bacteria has been reported (Choi & Kim, 2007). The acquired gene or gene segment is known to be beneficial only if it has some properties different to that of recipient genome (Kinsella *et al.,* 2003). Since the selection of a new domain would depend upon its ability to perform a biological function that is not covered by pre-existing domains, addition of such rare domains by HGT could be an uncommon phenomenon (Lagomarsino *et al.,* 2009). Adaptation of bacteria to new environments requires evolution of new functions (Hooper & Berg, 2003), and gene duplication is viewed to be the general mechanism of adaptation to different environmental conditions (Kondrashov, 2002). However, a recent study suggests HGT to be a far more important route to adaptation compared to gene duplication (Koonin & Wolf, 2009). Further, duplication of horizontally transferred genes with weak or no functions is suggested to accelerate the evolutionary process of gene innovation. Since both gene duplication and HGT are considered to be important routes of bacterial adaptation to changing environments, and amplification of weak ancillary functions is considered to be the easiest route to gene innovation, quick adaptation of bacteria to changing environments could be due to amplification of weak ancillary functions. Thus, reduced functional complexity of the investigated duplicate genes compared to single copy genes could be due to preservation of the majority of the paralogs by subfunctionalization, and the rare event of neofunctionalization could have been either due to divergence of subfunctions over an evolutionary period of time following preservation of subfunctionalized paralogs, or mostly due to rapid amplification of weak ancillary functions after gene duplication. To gain deeper insights into the functional complexity of duplicate genes in *M. tuberculosis*, we focussed on the evolutionary analysis of

the duplicate genes in six of the closely related mycobacterial species.

Fig. 12. Phylogenetic tree of the Sigma factor ortholog and paralog cluster inferred by the maximum likelihood method. The duplication events are marked by A's and B's. The labels Mtu, Mbo, Mav, Mul, Mpa and Mle represent proteins from *M. tuberculosis, M. bovis, M. avium, M. ulcerans, M. paratuberculosis and M. leprae* respectively.

high duplicate gene percentages in these species, with the exception of *M. leprae,* could be attributed to the high GC composition of their genomes. Unsurprisingly, the study has also shown a correlation between duplicate gene percentage and genome size, suggesting that gene duplication increases genome size. Further, our investigations on protein complexity provide deeper insights into the general trend in gene length and domain number in duplicate genes in these organisms.

He and Zhang (2005), investigating *Saccharomyces cerevisiae*, showed duplicate genes to be complex molecules with longer sequences containing more functional domains. From the investigations of the mean gene lengths of 76 pathogenic and non-pathogenic organisms, it is evident that the average length of the duplicate genes is comparatively higher than that of single copy genes. However, the analysis of mean number of domains in the duplicate and single copy genes reveals the presence of a higher number of domains in the single copy genes compared to the duplicate genes. According to Stoltfus (1999), partial loss-of-function

Fig. 12. Phylogenetic tree of the Sigma factor ortholog and paralog cluster inferred by the maximum likelihood method. The duplication events are marked by A's and B's. The labels Mtu, Mbo, Mav, Mul, Mpa and Mle represent proteins from *M. tuberculosis, M. bovis, M.* 

high duplicate gene percentages in these species, with the exception of *M. leprae,* could be attributed to the high GC composition of their genomes. Unsurprisingly, the study has also shown a correlation between duplicate gene percentage and genome size, suggesting that gene duplication increases genome size. Further, our investigations on protein complexity provide deeper insights into the general trend in gene length and domain number in

He and Zhang (2005), investigating *Saccharomyces cerevisiae*, showed duplicate genes to be complex molecules with longer sequences containing more functional domains. From the investigations of the mean gene lengths of 76 pathogenic and non-pathogenic organisms, it is evident that the average length of the duplicate genes is comparatively higher than that of single copy genes. However, the analysis of mean number of domains in the duplicate and single copy genes reveals the presence of a higher number of domains in the single copy genes compared to the duplicate genes. According to Stoltfus (1999), partial loss-of-function

*avium, M. ulcerans, M. paratuberculosis and M. leprae* respectively.

duplicate genes in these organisms.

mutations lead to the preservation of duplicate genes with single functions (Stoltfus, 1999; Lynch & Force, 2000). Moreover, it has been suggested that duplicate genes lose one of the domains that were originally present in the ancestral molecule, and by complementation of the lost domains, both the daughter copies are reported to reflect the original ancestral function. Thus, gene complexity is suggested to be reduced after subfunctionalization of duplicate genes (He & Zhang, 2005). Further, the complementary loss of subfunctions is considered to facilitate the preservation of duplicate gene pairs, and due to relaxed evolutionary constraints following subfunctionalization, the chances of long-term evolution of new functions is enhanced (Force *et al.,* 1999). However, since deleterious mutations are more common than beneficial mutations (Cun, 2010), evolution of new and essential protein functions is considered to be a rare event (Nadeau & Sankoff, 1997; Force *et al.,* 1999). According to the predominant argument, the evolution of new domains would be favored only if they can perform a function different to that of preexisting domains or domain combinations (Lagomarsino *et al.,* 2009). Further, the majority of duplicate genes are predicted to develop new functions from the already existing ancestral gene functions, and if new functions evolve by mutation from prior domains, it is less likely that all of the domains would evolve into new domains due to the mutational bridge for new domain evolution being too far from the ancestral molecule (Lagomarsino *et al.,* 2009). Hence, evolution of new functions following subfunctionalization could be a rare event. Therefore, the presence of fewer domains in the duplicate genes compared to the single copy genes could be due to evolution of duplicate genes by subfunctionalization, where complementary loss of subfunctions is viewed to primarily facilitate preservation of the duplicate gene. Alternatively, the addition of new domains into a bacterial genome could be due to acquisition by HGT. Indeed, acquisition of one or more domains by HGT in 30 to 50 percent of bacteria has been reported (Choi & Kim, 2007). The acquired gene or gene segment is known to be beneficial only if it has some properties different to that of recipient genome (Kinsella *et al.,* 2003). Since the selection of a new domain would depend upon its ability to perform a biological function that is not covered by pre-existing domains, addition of such rare domains by HGT could be an uncommon phenomenon (Lagomarsino *et al.,* 2009). Adaptation of bacteria to new environments requires evolution of new functions (Hooper & Berg, 2003), and gene duplication is viewed to be the general mechanism of adaptation to

different environmental conditions (Kondrashov, 2002). However, a recent study suggests HGT to be a far more important route to adaptation compared to gene duplication (Koonin & Wolf, 2009). Further, duplication of horizontally transferred genes with weak or no functions is suggested to accelerate the evolutionary process of gene innovation. Since both gene duplication and HGT are considered to be important routes of bacterial adaptation to changing environments, and amplification of weak ancillary functions is considered to be the easiest route to gene innovation, quick adaptation of bacteria to changing environments could be due to amplification of weak ancillary functions. Thus, reduced functional complexity of the investigated duplicate genes compared to single copy genes could be due to preservation of the majority of the paralogs by subfunctionalization, and the rare event of neofunctionalization could have been either due to divergence of subfunctions over an evolutionary period of time following preservation of subfunctionalized paralogs, or mostly due to rapid amplification of weak ancillary functions after gene duplication. To gain deeper insights into the functional complexity of duplicate genes in *M. tuberculosis*, we focussed on the evolutionary analysis of the duplicate genes in six of the closely related mycobacterial species.

Analysis of Duplicate Gene Families

**6. Acknowledgement** 

452–461.

**7. References** 

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 193

could be attributed to the high GC composition of their genomes. The study has also shown a correlation between duplicate gene percentage and genome size, suggesting that gene duplication increases genome size, which is a logical result. Interestingly, our functional complexity results were in contrast to recent finding in eukaryotes, and we show that, on average, duplicate genes have longer sequences but fewer domains than single copy genes in the investigated organisms. The reduced functional complexities of duplicate genes could

We also show that duplicate gene families of mycobacterial multiple genome clusters have not undergone complete functional divergence following gene duplication and still tend to maintain their functions. Our maximum genetic distance results suggest that multiple duplication events in a few of the duplicate copies of bigger families may result in their functional divergence from the original ancestral functions. Our paralog maximum genetic distance results suggest that the increase in genetic distance between the two most distant proteins with the size of the gene family may be due to duplication followed by paralog evolution of some of the distant genes that already have divergent functions compared to its paralog members. From the study of average genetic distance of paralogs, we suggest that slow evolution of paralogs of large families in *M. tuberculosis* could be due to preservation of

For future studies, we are investigating selection pressure and comparison between smaller and larger gene family evolution to shed light on the evolutionary fate of duplicate genes and functional innovation in *M. tuberculosis*. In addition, since the functional constraints on amino acid residues are known to differ due to the potential changes in protein function, we have studied site specific rate differences between the amino acids of closely related mycobacterial species to aid in deciphering specific subfamily evolutionary divergence following gene duplication. Such predicted critical amino acids when mapped on to protein secondary structure could help in evaluation of important structural locations in functional diversification. In addition to functional divergence, gene expression data from different experimental conditions is of use to understand the degree of expression divergence of the genes following duplication events. Overall, we are working on further investigation of *M tuberculosis* duplicate genes with the integration of phylogeny-sequence-structure-functionexpression information, which will be valuable for understanding the functional and

We thank the National Bioinformatics Network and Computational Biology Group,

Abascal, F.; Zardoya, R. & Posada, D. (2005). ProtTest: selection of best-fit models of protein evolution. *Bioinformatics Applications Note,* Vol. 21, No. 9, pp. 2104–2105. Agarwal, N.; Woolwine SC, Tyagi S, Bishai WR. (2007). Characterization of the

*Mycobacterium tuberculosis* Sigma Factor SigM by Assessment of Virulence and Identification of SigM-Dependent Genes. *Infection and Immunity,* Vol. 75, No 1, pp.

be due to their evolution by subfunctionalization following duplication.

original ancestral functions by the mechanism of subfunctionalization.

evolutionary fate of genes following gene duplication in *M. tuberculosis*.

University of Cape Town, South Africa for supporting this work.

From the analysis of maximum genetic distance between the two most distant proteins of the mycobacterial multiple genome clusters, we suggest that the divergence of at least one of the duplicate gene copies from the ancestral gene increases with the increase in cluster size. These homologous gene families consist of orthologs and paralogs. The lack of a strong correlation between the average genetic distance and cluster size of the duplicate gene copies in the multiple genome clusters indicates that the homologous gene families including proteins from different mycobacterial species have not undergone complete functional divergence. This is to be expected for orthologs, which tend to maintain their functions.

The average genetic distance estimated for single genome paralogous gene clusters, on the other hand, decreases with the increase in cluster size, suggesting that, on average, smaller families tend to diverge more rapidly than the larger families. This is apart from some members of the larger families, which have obviously diverged further as they are contributing to the increased maximum genetic distance with cluster size. Though gene duplication is considered to be an important mechanism for acquiring new genes, and creating evolutionary novelty (Torgerson and Singh, 2004), horizontal gene transfer (HGT) is also known to be a wide spread phenomenon, and a significant proportion of genes in bacteria are accepted to have been acquired by HGT (Price *et al.*, 2007). The genome of *M. tuberculosis* is known to contain 19 genes of eukaryotic origin, and it is speculated that the organism may have also acquired genes from other prokaryotes by HGT (Kinsella *et al.,* 2003). In addition, the occurrence of many intraspecies HGT events in the progenitor of *M. tuberculosis* has been reported (Rosas-Magallanes *et al.*, 2006). The ability of HGT to incorporate a new gene which is homologous to an existing gene family member is well recognized (Ochman, 2001; Kinsella *et al.*, 2003; Krzywinska, 2004), and in comparison to its gene family members, the newly introduced gene may be more divergent in sequence and function (Pushker *et al.*, 2004). Following duplication, such laterally transferred genes with already divergent functions may further diversify in the process of evolving new functions, and this could result in an increase in genetic distance between the laterally transferred duplicate gene and its paralog gene family members. The chance of this should increase with the number of members.

Phylogenetic analysis of the sigma factors in *M. tuberculosis* suggested that most of the sigma factors have orthologs in other mycobacteria. However, we could not observe orthologs in *M. leprae* for a few of the subfamilies, and this could have been due to the extensive loss of sigma factors during its reductive genome evolution. Agarwal *et al.,* 2007 reports that sigM proteins control only a small subset of genes, and their loss would not influence *M. tuberculosis* virulence (Agarwal *et al.,* 2007). The difference in the divergence of sigM in *M. tuberculosis* compared to other mycobacteria, and absence of its ortholog in *M. bovis* should be considered further to study the importance of the sigM factor in *M. tuberculosis* virulence.

### **5. Conclusions and future work**

The estimated duplicate gene percentages for *M. tuberculosis* from independent genome clustering (31%), InterPro signature methods (38%), across genome clustering (49%) and a union of the methods (51%) were all relatively high, showing the significance of gene duplication in *M. tuberculosis* genome evolution. The investigation of relationships between the GC composition and duplicate gene percentages identified from the sequence and InterPro domain data provides sufficient evidence to suggest that for the mycobacterial species, with the exception of *M. leprae*, the maintenance of high duplicate gene percentages could be attributed to the high GC composition of their genomes. The study has also shown a correlation between duplicate gene percentage and genome size, suggesting that gene duplication increases genome size, which is a logical result. Interestingly, our functional complexity results were in contrast to recent finding in eukaryotes, and we show that, on average, duplicate genes have longer sequences but fewer domains than single copy genes in the investigated organisms. The reduced functional complexities of duplicate genes could be due to their evolution by subfunctionalization following duplication.

We also show that duplicate gene families of mycobacterial multiple genome clusters have not undergone complete functional divergence following gene duplication and still tend to maintain their functions. Our maximum genetic distance results suggest that multiple duplication events in a few of the duplicate copies of bigger families may result in their functional divergence from the original ancestral functions. Our paralog maximum genetic distance results suggest that the increase in genetic distance between the two most distant proteins with the size of the gene family may be due to duplication followed by paralog evolution of some of the distant genes that already have divergent functions compared to its paralog members. From the study of average genetic distance of paralogs, we suggest that slow evolution of paralogs of large families in *M. tuberculosis* could be due to preservation of original ancestral functions by the mechanism of subfunctionalization.

For future studies, we are investigating selection pressure and comparison between smaller and larger gene family evolution to shed light on the evolutionary fate of duplicate genes and functional innovation in *M. tuberculosis*. In addition, since the functional constraints on amino acid residues are known to differ due to the potential changes in protein function, we have studied site specific rate differences between the amino acids of closely related mycobacterial species to aid in deciphering specific subfamily evolutionary divergence following gene duplication. Such predicted critical amino acids when mapped on to protein secondary structure could help in evaluation of important structural locations in functional diversification. In addition to functional divergence, gene expression data from different experimental conditions is of use to understand the degree of expression divergence of the genes following duplication events. Overall, we are working on further investigation of *M tuberculosis* duplicate genes with the integration of phylogeny-sequence-structure-functionexpression information, which will be valuable for understanding the functional and evolutionary fate of genes following gene duplication in *M. tuberculosis*.
