**3. Results and discussion**

## **3.1 Identification of expanded gene families and relation to GC content and genome size**

We used sequence clustering and protein signature data to identify expanded genes families within and across several different microbial genomes. The across-genome clustering of protein sequence data yielded 1,984 expanded genes in 441 clusters for *M. tuberculosis*  H37Rv. The protein signature method allowed us to group 30,885 proteins into 2238 clusters from all the organisms. InterPro signatures usually match between 50% and 80% of a genome, so data is not available for every protein. Since signature data enables identification

Analysis of Duplicate Gene Families

investigated organisms.

Residuals 73

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 177

Fig. 1. Scatter plot analysis of the relationship between GC content and duplicate gene percentages of the selected organisms. The percent GC content is plotted on the X-axis and duplicate gene percentage on the Y-axis. The graph suggests that a positive correlation between GC content and duplicate gene percentages exists for the majority of the

Groups 2 13174.8 6587.4 417.28 <2.2e-16\*\*\*

Table 2. One-Way ANOVA Results. The columns of the table display the degrees of freedom (df), sum square values (Sum Sq), Mean square values (Mean Sq), F value and p-value

Tukey's Multiple Comparison of Means Groups Diff Lwr Upr G2-G1 -16.36 -19.08 -13.64 G3-G1 -31.54 -34.15 -28.92 G3-G2 -15.18 -17.88 -12.48 Table 3. The table displays the differences between the mean values of the groups. The Groups column represents the investigated groups: group2 and group1 (G2-G1), group3-group1 (G3- G1), and group3-group2 (G3-G2). The differences in the means of the groups are given by the difference (diff) column, and the lower (lwr) and upper (upr) columns represent the lower and

Signif Codes 0 '\*\*\*' 0.001 "\*\*' 0.01 '\*' 0.05 '.'0.1 ' ' 1

upper boundaries for the estimated mean difference between the groups.

(Pr(>F)) reported by One-Way ANOVA for the data.

Df Sum Sq Mean Sq F value Pr(>F)

of more distantly related members of a cluster, but loses data where proteins do not match InterPro, the sequence and signature-based cluster data was merged, and used to generate a phylogenetic profile, which reflected the number of copies in each expanded family for each organism. From this, we identified 2011 duplicate/expanded genes in 461 clusters for *M. tuberculosis*, confirming previous reports that the duplicate genes make up approximately half of the *M. tuberculosis* genome (Tekaia *et al.,* 1999). The percentages derived from the 2 methods are shown in Table 1 for the 6 mycobacteria studied. The 461 clusters in the merged data include gene families that are also expanded in different organisms.


Table 1. Percentage of the genome belonging to expanded gene families for the mycobacteria. Data was generated using sequence clustering, protein signatures and a combination of the two (union).

Next, we investigated the GC composition of different bacteria in relation to the duplicate gene percentages (Figure 1) to understand the characteristic features of genomes maintaining high percentages of duplicate genes. A statistical analysis of the data using Pearson's correlation, revealed a moderate correlation between the GC content and estimated duplicate gene percentages. However, the analysis of the trend lines of the scatter plot in figure 1 reveals the presence of three different kinds of relationships in the data: i) an initial increase of the trend line, ii) the initial increase is followed by a phase of neutrality, iii) and a steady increase of the trend line follows the phase of neutrality. We analyzed histograms of GC content and duplicate gene percentages and observed differences in the modality of the data distribution; GC percentages followed a trimodal distribution, while the duplicate gene percentages followed a unimodal distribution. Thus, although a positive correlation could be inferred from the analysis of the scatter plot, the correlation coefficient could be subdued by the differences in the modality of the data distributions. Hence, based on the analysis of the scatter plot and trimodal distribution of the GC percentage histogram, the organisms were grouped into three categories based on their GC compositions:

**Group 1:** Organisms having GC content greater than 54 percent. **Group 2:** Organisms having GC content greater than 44 percent and less than 54 percent.

**Group 3:** Organisms having GC content less than 44 percent.

We then performed a one-way ANOVA on the data (Table 2). Taking into consideration the mean square values (Mean Sq) and the calculated p-value of 2 x 10-16, we predicted that the mean variance between groups is significant compared to the within sample variance. These results indicate the existence of differences in the means of the three groups of organisms, and hence, we reject the null hypothesis and accept the alternative. Further, to estimate how significantly different the means of each group are compared to one another, a Tukey's Honest Significant Difference (Tukey's HSD) test was performed, and significant differences in the mean values of group2 and group1, group3 and group1, and group3 and group2 were found. From the results table of the Tukey's multiple comparison test (Table 3), it can be

of more distantly related members of a cluster, but loses data where proteins do not match InterPro, the sequence and signature-based cluster data was merged, and used to generate a phylogenetic profile, which reflected the number of copies in each expanded family for each organism. From this, we identified 2011 duplicate/expanded genes in 461 clusters for *M. tuberculosis*, confirming previous reports that the duplicate genes make up approximately half of the *M. tuberculosis* genome (Tekaia *et al.,* 1999). The percentages derived from the 2 methods are shown in Table 1 for the 6 mycobacteria studied. The 461 clusters in the merged

**S.No Organism Sequence Signature Union**  *M. tuberculosis* 31.47% 38% 50.96% *M. bovis* 30.28% 42% 48.69% *M. paratuberculosis* 39.75% 49% 56.46% *M. avium* 42.06% 49% 55.19% *M. ulcerans* 36.82% 46% 53.51% *M. leprae* 12.03% 20% 30.44%

Next, we investigated the GC composition of different bacteria in relation to the duplicate gene percentages (Figure 1) to understand the characteristic features of genomes maintaining high percentages of duplicate genes. A statistical analysis of the data using Pearson's correlation, revealed a moderate correlation between the GC content and estimated duplicate gene percentages. However, the analysis of the trend lines of the scatter plot in figure 1 reveals the presence of three different kinds of relationships in the data: i) an initial increase of the trend line, ii) the initial increase is followed by a phase of neutrality, iii) and a steady increase of the trend line follows the phase of neutrality. We analyzed histograms of GC content and duplicate gene percentages and observed differences in the modality of the data distribution; GC percentages followed a trimodal distribution, while the duplicate gene percentages followed a unimodal distribution. Thus, although a positive correlation could be inferred from the analysis of the scatter plot, the correlation coefficient could be subdued by the differences in the modality of the data distributions. Hence, based on the analysis of the scatter plot and trimodal distribution of the GC percentage histogram,

data include gene families that are also expanded in different organisms.

Table 1. Percentage of the genome belonging to expanded gene families for the mycobacteria. Data was generated using sequence clustering, protein signatures and a

the organisms were grouped into three categories based on their GC compositions:

**Group 2:** Organisms having GC content greater than 44 percent and less than 54 percent.

We then performed a one-way ANOVA on the data (Table 2). Taking into consideration the mean square values (Mean Sq) and the calculated p-value of 2 x 10-16, we predicted that the mean variance between groups is significant compared to the within sample variance. These results indicate the existence of differences in the means of the three groups of organisms, and hence, we reject the null hypothesis and accept the alternative. Further, to estimate how significantly different the means of each group are compared to one another, a Tukey's Honest Significant Difference (Tukey's HSD) test was performed, and significant differences in the mean values of group2 and group1, group3 and group1, and group3 and group2 were found. From the results table of the Tukey's multiple comparison test (Table 3), it can be

**Group 1:** Organisms having GC content greater than 54 percent.

**Group 3:** Organisms having GC content less than 44 percent.

combination of the two (union).

Fig. 1. Scatter plot analysis of the relationship between GC content and duplicate gene percentages of the selected organisms. The percent GC content is plotted on the X-axis and duplicate gene percentage on the Y-axis. The graph suggests that a positive correlation between GC content and duplicate gene percentages exists for the majority of the investigated organisms.


Table 2. One-Way ANOVA Results. The columns of the table display the degrees of freedom (df), sum square values (Sum Sq), Mean square values (Mean Sq), F value and p-value (Pr(>F)) reported by One-Way ANOVA for the data.


Table 3. The table displays the differences between the mean values of the groups. The Groups column represents the investigated groups: group2 and group1 (G2-G1), group3-group1 (G3- G1), and group3-group2 (G3-G2). The differences in the means of the groups are given by the difference (diff) column, and the lower (lwr) and upper (upr) columns represent the lower and upper boundaries for the estimated mean difference between the groups.

Analysis of Duplicate Gene Families

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 179

**3.2 Investigation of functional complexity of the duplicate and single copy genes**  In eukaryotes, single copy genes were reported to be shorter and to contain fewer domains than duplicate genes (He & Zhang, 2005). Here, we investigated the gene lengths and complexity (using domain number) of both pathogenic and non-pathogenic bacteria to determine the role of gene duplication in enhancing genome complexity in prokaryotes. A preliminary determination of the functional complexity of the expanded genes in the selected organisms was graphically analyzed by plotting the mean gene lengths of both the duplicate and single copy genes. Figure 3 shows that the average gene length in the majority of the organisms is comparatively higher for duplicate genes than for single copy genes. The difference in the mean gene lengths of duplicate and single copy genes was statistically analyzed using the Mann-Whitney U test. An observed W value of 4880 at a p-value of 2 x 10-13 estimated from the Mann-Whitney U test confirms that the mean gene lengths of the

duplicate genes are significantly higher than that of the single copy genes.

Fig. 3. Comparison of Functional Complexity of the Expanded and Single Copy Gene Families Based on Nucleotide Sequence Data. The graph displays the mean gene lengths of duplicate and single copy genes in the investigated organisms. The organisms are plotted on

We went on to investigate the domain complexity of single and duplicate copy genes to further enhance our understanding of the functional complexity of these organisms. As a measure of domain complexity, the number of domains present in each of the

the X-axis and the corresponding mean gene length on the Y-axis.

inferred that both groups, G2-G1 and G3-G2 exhibit similar mean differences (-16.36 and - 15.18). However, the mean differences of both these groups are higher than the mean difference (-31.54) of group G3-G1. Therefore, group2 organisms, which have higher mean differences compared with group1 and group3, could be responsible for the reduced correlation coefficient values. Hence, their elimination from the list of investigated organisms could result in the prediction of strong positive correlation between the GC composition and duplicate gene percentages of group1 and group3 organisms. Thus, we suggest that gene duplication events may be a characteristic feature of GC rich bacterial genomes. Since all of the selected mycobacterial species in the present study are representatives of group1, the phenomenon of gene duplication in this genus could be attributed to its high GC content.

In addition to GC compositions, we analyzed the influence of duplicate genes on the physical expansion of the genomes (Figure 2). An observed correlation coefficient value of 0.84 at a p-value of 2 x 10-16 between the genome size and duplicate genes provides sufficient evidence to prove the contribution of duplicate genes to genome expansion of these organisms. This is not surprising, since the addition of genes through gene duplication will obviously increase genome size unless some genes are lost in the process.

Fig. 2. The graph displays the relationship between duplicate gene percentage and genome size for the selected organisms. The identified duplicate gene percentages are plotted on the X-axis and genome size on the Y-axis. From the graph, a positive correlation can be observed between duplicate gene percentages and genome size.

duplicate genes are significantly higher than that of the single copy genes.

178 Gene Duplication

inferred that both groups, G2-G1 and G3-G2 exhibit similar mean differences (-16.36 and - 15.18). However, the mean differences of both these groups are higher than the mean difference (-31.54) of group G3-G1. Therefore, group2 organisms, which have higher mean differences compared with group1 and group3, could be responsible for the reduced correlation coefficient values. Hence, their elimination from the list of investigated organisms could result in the prediction of strong positive correlation between the GC composition and duplicate gene percentages of group1 and group3 organisms. Thus, we suggest that gene duplication events may be a characteristic feature of GC rich bacterial genomes. Since all of the selected mycobacterial species in the present study are representatives of group1, the phenomenon of gene duplication in this genus could be

In addition to GC compositions, we analyzed the influence of duplicate genes on the physical expansion of the genomes (Figure 2). An observed correlation coefficient value of 0.84 at a p-value of 2 x 10-16 between the genome size and duplicate genes provides sufficient evidence to prove the contribution of duplicate genes to genome expansion of these organisms. This is not surprising, since the addition of genes through gene duplication will

Fig. 2. The graph displays the relationship between duplicate gene percentage and genome size for the selected organisms. The identified duplicate gene percentages are plotted on the

X-axis and genome size on the Y-axis. From the graph, a positive correlation can be

observed between duplicate gene percentages and genome size.

obviously increase genome size unless some genes are lost in the process.

attributed to its high GC content.
