**3.3 Functional and evolutionary analysis of expanded genes in** *M. tuberculosis*

The protein sequence and signature data for the 76 genomes were clustered into related sets of duplicate genes for the study of relationships between percentage duplication and the GC content, genome size and gene complexity described above. However, since the comparison of closely related organisms is better for inferring evolutionary relationships, we separately clustered six of the closely related mycobacterial genomes and identified 390 duplicate gene clusters in *M. tuberculosis*. The results were represented as a phylogenetic profile and a summary is shown in Table 5.


Table 5. Protein sequence clustering of the mycobacterial group. The columns of the table represent the selected organisms, total number of genes in the genome, total number of identified gene clusters (TGC), total number of single copy genes (SCG), total number of duplicate gene clusters (DGC) in the organism, total number of duplicate genes in the duplicate gene clusters identified, and the percentage of duplicate genes estimated for each organism.

The biggest expanded family in *M. tuberculosis* was the PE/PPE/PGRS family with 164 members, followed by a family of alcohol dehydrogenases and oxidoreductases with 44 members, the fatty-acid-CoA ligase family with 33 members, then acyl-CoA dehydrogenase with 27 members. A manual assignment of high-level functional classes was done previously in the laboratory for all *M. tuberculosis* proteins. This was used here to determine the functional distribution of all 390 expanded families in this organism. Figure 8 shows the number of families and number of proteins belonging to each of the functional classes. The biggest class is made up of enzymes or proteins involved in metabolism, followed by proteins of unknown function. From the data, we selected 116 gene clusters which showed gene family expansions in *M. tuberculosis* and *M. leprae*, as well as other mycobacteria. We are interested in expansion in *M. leprae*, as this is a highly reduced mycobacterial genome, so expanded genes that have been maintained are likely to be important. When considering only the 116 families that are also expanded in *M. leprae*, the distribution of functional classes is similar, except for a large reduction in the number of unknown protein families. For each of the 116 clusters of interest, we calculated the genetic distance between family members and investigated the relationship between genetic distance and gene family size. Our results suggest that the genetic distance between two of the most distant proteins in the clusters increases with an increase in cluster size. In addition, the correlation coefficient value of 0.87 at a p-value of 2.2 x 10-16 is indicative of strong positive correlation between these factors. These sets of duplicate copies of proteins are clustered from different mycobacterial genomes, and since the estimated maximum genetic distance between the 2

The protein sequence and signature data for the 76 genomes were clustered into related sets of duplicate genes for the study of relationships between percentage duplication and the GC content, genome size and gene complexity described above. However, since the comparison of closely related organisms is better for inferring evolutionary relationships, we separately clustered six of the closely related mycobacterial genomes and identified 390 duplicate gene clusters in *M. tuberculosis*. The results were represented as a phylogenetic profile and a

TGC SCG DGC Total

duplicate Genes

Estimated Duplicate Gene Percentages

**3.3 Functional and evolutionary analysis of expanded genes in** *M. tuberculosis*

*M. tuberculosis* 3947 2815 2425 390 1521 38.53 *M. bovis* 3910 2817 2439 378 1471 37.62 *M. paratuberculosis* 4316 2807 2343 464 1973 45.71 *M. avium* 5040 3199 2679 520 2361 46.84 *M. ulcerans* 4206 2755 2359 396 1847 43.91 *M. leprae* 1036 1603 1261 119 342 21.33

Table 5. Protein sequence clustering of the mycobacterial group. The columns of the table represent the selected organisms, total number of genes in the genome, total number of identified gene clusters (TGC), total number of single copy genes (SCG), total number of duplicate gene clusters (DGC) in the organism, total number of duplicate genes in the duplicate gene clusters identified, and the percentage of duplicate genes estimated for each

The biggest expanded family in *M. tuberculosis* was the PE/PPE/PGRS family with 164 members, followed by a family of alcohol dehydrogenases and oxidoreductases with 44 members, the fatty-acid-CoA ligase family with 33 members, then acyl-CoA dehydrogenase with 27 members. A manual assignment of high-level functional classes was done previously in the laboratory for all *M. tuberculosis* proteins. This was used here to determine the functional distribution of all 390 expanded families in this organism. Figure 8 shows the number of families and number of proteins belonging to each of the functional classes. The biggest class is made up of enzymes or proteins involved in metabolism, followed by proteins of unknown function. From the data, we selected 116 gene clusters which showed gene family expansions in *M. tuberculosis* and *M. leprae*, as well as other mycobacteria. We are interested in expansion in *M. leprae*, as this is a highly reduced mycobacterial genome, so expanded genes that have been maintained are likely to be important. When considering only the 116 families that are also expanded in *M. leprae*, the distribution of functional classes is similar, except for a large reduction in the number of unknown protein families. For each of the 116 clusters of interest, we calculated the genetic distance between family members and investigated the relationship between genetic distance and gene family size. Our results suggest that the genetic distance between two of the most distant proteins in the clusters increases with an increase in cluster size. In addition, the correlation coefficient value of 0.87 at a p-value of 2.2 x 10-16 is indicative of strong positive correlation between these factors. These sets of duplicate copies of proteins are clustered from different mycobacterial genomes, and since the estimated maximum genetic distance between the 2

summary is shown in Table 5.

Organism Total

organism.

Genes

Fig. 8. Distribution of functions in *M. tuberculosis* (all 390 clusters) and *M. leprae*-shared (116 clusters) expanded families.

most distant proteins in each of these sets increases with an increase in cluster size, it was inferred that some of the duplicate copies show a tendency to diverge from the original ancestral functions after multiple duplication events in bigger families. To investigate the average divergence of proteins in these clusters, the relationship between average genetic distance and cluster size was determined. The results suggest that the average genetic distance between the gene families does not increase with the cluster size, except perhaps for the few larger families. In order to statistically verify the results, correlation coefficient values were estimated using the Pearson's product-moment correlation. The correlation coefficient value of 0.43 at a p-value of 1.17 x 10-6 indicates the presence of moderate

Analysis of Duplicate Gene Families

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 187

*M. tuberculosis* 64 -0.44 0.0001966 *M. bovis* 62 -0.5 2.38e-05 *M. paratuberculosis* 59 -0.41 0.0007649 *M. avium* 56 -0.43 0.0007819 *M. ulcerans* 59 -0.43 0.000633 *M. leprae* 36 -0.44 0.005978 Table 7. Pearson's correlation coefficient results for the relationship between the average genetic distance and cluster size. The columns include degrees of freedom (df), Pearson's

Fig. 9. Relationship between maximum genetic distance and cluster size for families of *M. tuberculosis* H37Rv*, M. bovis, M. paratuberculosis, M. avium, M. ulcerans* and *M. leprae.* The Xaxis represents the cluster size (total proteins in each cluster) and Y-axis shows the genetic distance between the two most distant proteins in the clusters of each organism. The genetic distance appears to increase with the cluster size, suggesting a correlation between them.

identified by our ortholog and paralog clustering methods. From the analysis of the sigma factor phylogenetic trees of *M. tuberculosis* (Figure 11), we infer that gene duplication events followed by divergence could have resulted in the bifurcation of the sigma factor class of

proteins into two subfamilies (marked as A and B in the Figure).

correlation

P-value

Organism df Pearson's

correlation coefficient values, and the corresponding P-values.

correlation between the average genetic distance and cluster size. It suggests that the majority of homologous gene families identified from these mycobacterial species have not undergone significant functional divergence and still show close evolutionary relatedness, but these results may be skewed by the fact that the clusters contain orthologs and paralogs. Orthologs are generally predicted to maintain similar functions, while paralogs are known to have diverged functions.

The relationship between cluster size and genetic distance was also studied for 66 paralogous families only (within genome clustering). Within each of the selected mycobacterial species, the estimated tree topologies were used to investigate the genetic divergence of the identified paralogous gene families by computing the maximum genetic distance between two of the most distant paralogs in each of the clusters. A scatter plot analysis of the computed maximum genetic distances and cluster sizes was performed (Figure 9), and correlation coefficient values were estimated for studying the genetic divergence of these paralog gene families. From the analysis of the scatter plot (Figure 9) and correlation coefficient values (Table 6), we inferred that the genetic distance between the two most distant proteins increases with the cluster size.


Table 6. Results of the correlation calculations for maximum genetic distance versus cluster size, including degrees of freedom (df), Pearson's correlation coefficient values, and the corresponding p-values.

In addition to maximum genetic distance, the average genetic distance for each of the paralog gene families was computed to investigate the evolutionary relationships between the members within the selected mycobacterial genomes. To provide statistical significance for the scatter plot observations, correlation coefficient values were estimated using the Pearson's product-moment correlation (Table 7). The scatter plot (Figure 10) and correlation coefficient values (Table 7), suggest a moderate negative correlation between the average genetic distance and cluster size of the paralogous gene families.

### **3.4 Further analysis of one example expanded gene family in** *M. tuberculosis*

While we have evolutionary data for all the orthologous and paralogous families of *M. tuberculosis*, we cannot show all the results, so we have selected an important class of regulatory proteins as an example. The adaptability of *M. tuberculosis* to enable successful survival of the stressful conditions in the host during infection is attributed to the existence of a diverse class of sigma factors in the organism (Fontan, 2009). The organism is suggested to contain numerous sigma factors that bind to the core subunit of RNA polymerase to provide promoter specificity (Fontan, 2008). To investigate the phylogenetic diversification of sigma factors in *M. tuberculosis* and other mycobacteria, we studied the sigma factors

correlation between the average genetic distance and cluster size. It suggests that the majority of homologous gene families identified from these mycobacterial species have not undergone significant functional divergence and still show close evolutionary relatedness, but these results may be skewed by the fact that the clusters contain orthologs and paralogs. Orthologs are generally predicted to maintain similar functions, while paralogs are known

The relationship between cluster size and genetic distance was also studied for 66 paralogous families only (within genome clustering). Within each of the selected mycobacterial species, the estimated tree topologies were used to investigate the genetic divergence of the identified paralogous gene families by computing the maximum genetic distance between two of the most distant paralogs in each of the clusters. A scatter plot analysis of the computed maximum genetic distances and cluster sizes was performed (Figure 9), and correlation coefficient values were estimated for studying the genetic divergence of these paralog gene families. From the analysis of the scatter plot (Figure 9) and correlation coefficient values (Table 6), we inferred that the genetic distance between the

correlation

P-value

to have diverged functions.

corresponding p-values.

two most distant proteins increases with the cluster size.

genetic distance and cluster size of the paralogous gene families.

**3.4 Further analysis of one example expanded gene family in** *M. tuberculosis*

Organism df Pearson's

*M. tuberculosis* 64 0.88 2.20e-16 *M. bovis* 62 0.81 4.4e-16 *M. paratuberculosis* 59 0.93 2.20e-16 *M. avium* 56 0.95 2.20e-16 *M. ulcerans* 59 0.93 2.20e-16 *M. leprae* 36 0.66 5.94e-06 Table 6. Results of the correlation calculations for maximum genetic distance versus cluster size, including degrees of freedom (df), Pearson's correlation coefficient values, and the

In addition to maximum genetic distance, the average genetic distance for each of the paralog gene families was computed to investigate the evolutionary relationships between the members within the selected mycobacterial genomes. To provide statistical significance for the scatter plot observations, correlation coefficient values were estimated using the Pearson's product-moment correlation (Table 7). The scatter plot (Figure 10) and correlation coefficient values (Table 7), suggest a moderate negative correlation between the average

While we have evolutionary data for all the orthologous and paralogous families of *M. tuberculosis*, we cannot show all the results, so we have selected an important class of regulatory proteins as an example. The adaptability of *M. tuberculosis* to enable successful survival of the stressful conditions in the host during infection is attributed to the existence of a diverse class of sigma factors in the organism (Fontan, 2009). The organism is suggested to contain numerous sigma factors that bind to the core subunit of RNA polymerase to provide promoter specificity (Fontan, 2008). To investigate the phylogenetic diversification of sigma factors in *M. tuberculosis* and other mycobacteria, we studied the sigma factors


Table 7. Pearson's correlation coefficient results for the relationship between the average genetic distance and cluster size. The columns include degrees of freedom (df), Pearson's correlation coefficient values, and the corresponding P-values.

Fig. 9. Relationship between maximum genetic distance and cluster size for families of *M. tuberculosis* H37Rv*, M. bovis, M. paratuberculosis, M. avium, M. ulcerans* and *M. leprae.* The Xaxis represents the cluster size (total proteins in each cluster) and Y-axis shows the genetic distance between the two most distant proteins in the clusters of each organism. The genetic distance appears to increase with the cluster size, suggesting a correlation between them.

identified by our ortholog and paralog clustering methods. From the analysis of the sigma factor phylogenetic trees of *M. tuberculosis* (Figure 11), we infer that gene duplication events followed by divergence could have resulted in the bifurcation of the sigma factor class of proteins into two subfamilies (marked as A and B in the Figure).

Analysis of Duplicate Gene Families

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 189

Fig. 11. Phylogenetic tree of the Sigma factor paralog cluster inferred by the maximum likelihood method. The duplication events are marked by A's and B's. The figure displays the

phylogenetic diversification of sigma factors (sigE, sigM, sigL, sigK, sigC and sigD).

From the analysis of the sigma factor class of paralogs in GroupB (Figure 11), we infer that duplication events followed by divergence resulted in the 2 groups of sigma factor subfamilies (marked as B1 and B2 in Figure 11). The sigma factor proteins in each of the subfamilies have significantly diverged after gene duplication. For the two sigma proteins (sigK and sigL) in one of the subfamilies (Figure 12), sigK was identified to have no orthologs in *M, avium, M paratuberculosis* or *M. leprae*, and sigL was noted to have no orthologs in *M. leprae*. These results are inconsistent with the published reports of Manganelli *et al*, 2003. For the other subfamily of sigma proteins (sigC and sigD) on the phylogenetic tree (Figure 12), we did not identify orthologs for sigC in *M. paratuberculosis* or

The availability of complete genome sequences of many bacteria and significant progress in the development of modern computational biology methods has resulted in the evolution of a powerful platform for the comparative investigation of genome diversity across different organisms. Here, we make use of the wealth of genome information and bioinformatics tools to understand the significance of gene duplication in *M. tuberculosis* evolution. The investigation of relationships between the GC composition and duplicate gene percentages identified from the sequence and InterPro domain data provides sufficient evidence to suggest a positive correlation between them for group1 and group3 organisms. Here, the mycobacterial species are part of the group1 organisms, so the maintenance of

**3.4.2 Analysis of the sigma factor proteins in subfamily B** 

sigD in *M. leprae*.

**4. Discussion** 

Fig. 10. Relationship between average genetic distance and cluster size for duplicate gene families of *M. tuberculosis* H37Rv*, M. bovis, M. paratuberculosis, M. avium, M. ulcerans and M. leprae.* The X-axis represents the cluster size (total proteins in each cluster) and Y-axis shows the average genetic distance between the identified gene families of each organism. The average genetic distance appears to decrease with the cluster size, suggesting a negative correlation between these two factors.

### **3.4.1 Analysis of sigma factor proteins in subfamily A**

Following duplication, the proteins of this subfamily have diverged into 2 groups: SigE and SigM (Figure 11). The 2 proteins in the SigE group have further diverged following duplication and divergence. However, one of the proteins in the sigE group was identified to have no orthologs in *M. leprae* (Figure 12), and loss of various sigma factors is suggested to be the reason for *M. leprae* reductive genome evolution (Babu, 2003). Interestingly, all the paralogs of *M. tuberculosis* appear to have orthologs in *M. bovis*, but the absence of sigM proteins in *M. bovis*, and the large divergence of this protein group in *M. tuberculosis*  compared to other mycobacteria enables us to speculate on its significance in *M. tuberculosis*  evolution. Though an error in available *M. bovis* sequences could have resulted in incorrect annotation of the sigM locus as a psuedogene (Manganelli *et al.,* 2004), the extent of divergence of this protein in *M tuberculosis* compared to other mycobacteria prompts further investigation into its possible paths of pseudogenization or neofunctionalization.

Fig. 11. Phylogenetic tree of the Sigma factor paralog cluster inferred by the maximum likelihood method. The duplication events are marked by A's and B's. The figure displays the phylogenetic diversification of sigma factors (sigE, sigM, sigL, sigK, sigC and sigD).
