**3.2 Investigation of functional complexity of the duplicate and single copy genes**

In eukaryotes, single copy genes were reported to be shorter and to contain fewer domains than duplicate genes (He & Zhang, 2005). Here, we investigated the gene lengths and complexity (using domain number) of both pathogenic and non-pathogenic bacteria to determine the role of gene duplication in enhancing genome complexity in prokaryotes. A preliminary determination of the functional complexity of the expanded genes in the selected organisms was graphically analyzed by plotting the mean gene lengths of both the duplicate and single copy genes. Figure 3 shows that the average gene length in the majority of the organisms is comparatively higher for duplicate genes than for single copy genes. The difference in the mean gene lengths of duplicate and single copy genes was statistically analyzed using the Mann-Whitney U test. An observed W value of 4880 at a p-value of 2 x 10-13 estimated from the Mann-Whitney U test confirms that the mean gene lengths of the

Fig. 3. Comparison of Functional Complexity of the Expanded and Single Copy Gene Families Based on Nucleotide Sequence Data. The graph displays the mean gene lengths of duplicate and single copy genes in the investigated organisms. The organisms are plotted on the X-axis and the corresponding mean gene length on the Y-axis.

We went on to investigate the domain complexity of single and duplicate copy genes to further enhance our understanding of the functional complexity of these organisms. As a measure of domain complexity, the number of domains present in each of the

Analysis of Duplicate Gene Families

number.

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 181

For each of these three organisms, the total number of genes in the genome was retrieved, and for every gene, we estimated the total number of domains to determine the relationship between gene length and domain number (Figure 5 -Whole Genome Analysis). In addition, the number of domains for each of the duplicate (Figure 6) and single copy genes (Figure 7) were also estimated. The preliminary analysis of the relationships using scatter plots suggested that the number of domains per gene does not necessarily increase with an increase in the gene length. Further, correlation coefficient values estimated from the Pearson moment correlation were used for statistical confirmation of the relationships.

Fig. 5. Investigation of number of domains per gene in the *M. tuberculosis* H37Rv, *E. coli* and *L. interrogans* genomes (Whole Genome Analysis). The graph displays the relationships between gene length and number of domains. The sequence lengths of each gene are plotted on the X-axis and the corresponding number of domains per gene on the Y-axis. From the graph, it can be inferred that the gene length is not necessarily dependent on domain

corresponding proteins of the genes was computed from InterPro data, and the mean for the total number of domains was estimated for the duplicate and single copy genes using Perl scripts. From figure 4, we can see that the mean number of domains per duplicate gene is lower than that of the single copy genes. This suggests that single copy genes should be more complex due to the presence of more domains. We further analyzed the results, by statistically comparing the difference in the mean domain numbers of duplicate and single copy genes using the Mann-Whitney U test. From the resulting W value of 5717 at a p-value of 2 x 10-16, we inferred that the mean number of domains per single copy genes is significantly higher than that of duplicate genes. Thus, these studies suggest that the single copy genes are functionally more complex than duplicate genes. This was a surprising result, given that the mean length of duplicate genes was found to be higher than that of the single copy genes. Therefore, we specifically investigated the influence of gene lengths on the domain complexity of *M. tuberculosis*, and compared this statistic in two other organisms, *Leptospira interrogans* and the model organism, *Escherichia coli.* 

Fig. 4. Comparison of Functional Complexity of the Expanded and Single Copy Genes Based on InterPro Signature Data in Selected organisms. The graph displays the mean number of domains per protein in the duplicate and single copy genes of each organism. The organisms investigated are plotted on the X-axis and the corresponding mean number of domains per protein for each organism on the Y-axis.

corresponding proteins of the genes was computed from InterPro data, and the mean for the total number of domains was estimated for the duplicate and single copy genes using Perl scripts. From figure 4, we can see that the mean number of domains per duplicate gene is lower than that of the single copy genes. This suggests that single copy genes should be more complex due to the presence of more domains. We further analyzed the results, by statistically comparing the difference in the mean domain numbers of duplicate and single copy genes using the Mann-Whitney U test. From the resulting W value of 5717 at a p-value of 2 x 10-16, we inferred that the mean number of domains per single copy genes is significantly higher than that of duplicate genes. Thus, these studies suggest that the single copy genes are functionally more complex than duplicate genes. This was a surprising result, given that the mean length of duplicate genes was found to be higher than that of the single copy genes. Therefore, we specifically investigated the influence of gene lengths on the domain complexity of *M. tuberculosis*, and compared this statistic in two other

Fig. 4. Comparison of Functional Complexity of the Expanded and Single Copy Genes Based on InterPro Signature Data in Selected organisms. The graph displays the mean number of

domains per protein in the duplicate and single copy genes of each organism. The organisms investigated are plotted on the X-axis and the corresponding mean number of

domains per protein for each organism on the Y-axis.

organisms, *Leptospira interrogans* and the model organism, *Escherichia coli.* 

For each of these three organisms, the total number of genes in the genome was retrieved, and for every gene, we estimated the total number of domains to determine the relationship between gene length and domain number (Figure 5 -Whole Genome Analysis). In addition, the number of domains for each of the duplicate (Figure 6) and single copy genes (Figure 7) were also estimated. The preliminary analysis of the relationships using scatter plots suggested that the number of domains per gene does not necessarily increase with an increase in the gene length. Further, correlation coefficient values estimated from the Pearson moment correlation were used for statistical confirmation of the relationships.

Fig. 5. Investigation of number of domains per gene in the *M. tuberculosis* H37Rv, *E. coli* and *L. interrogans* genomes (Whole Genome Analysis). The graph displays the relationships between gene length and number of domains. The sequence lengths of each gene are plotted on the X-axis and the corresponding number of domains per gene on the Y-axis. From the graph, it can be inferred that the gene length is not necessarily dependent on domain number.

Analysis of Duplicate Gene Families

in these genomes.

Organism

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 183

Fig. 7. Investigation of number of domains per single copy gene in the *M. tuberculosis*  H37Rv, *E. coli* and *L. interrogans* genomes (Single Copy Gene Analysis). The graph displays the relationship between sequence length and number of domains of the single copy genes

Pearson's product-moment correlation

and P-value E. coli P-value H37Rv P-Value L. interrogans P-Value All Proteins 0.48 2.2e-16 0.58 2.2e-16 0.39 2.2e-16 Duplicate 0.47 2.2e-16 0.62 2.2e-16 0.25 2.64e-10 Single 0.49 2.2e-16 0.59 2.2e-16 0.53 2.2e-16

Table 4. Correlation Coefficient and P-values of the whole genome, duplicate and single copy gene analysis in *E. coli, M. tuberculosis* H37Rv and *L. interrogans.* The table displays the

results of Pearson's product-moment correlation.

Fig. 6. Investigation of number of domains per duplicate gene in the *M. tuberculosis* H37Rv, *E. coli* and *L. interrogans* genomes (Duplicate Gene Analysis). The graph displays the relationship between sequence length (X-axis) and number of domains (Y-axis) of the duplicate genes.

The correlation coefficient values for the whole genome, duplicate gene and single copy gene analysis in *E. coli* were 0.48, 0.47, and 0.49, respectively, while the values of 0.39, 0.25, and 0.53 were reported for *L. interrogans* (Table 4). The results from these two organisms suggest that the number of domains does not increase significantly with the increase in gene length and hence, domain complexity may be independent of the gene length or vice versa. Although the reported correlation coefficient values of 0.58, 0.62 and 0.59 corresponding to the whole genome, duplicate and single copy gene analyses, respectively, in *M. tuberculosis*  are higher than those for *E. coli* and *L. interrogans*, these correlation coefficient values still do not suggest significant positive correlation between domain complexity and gene lengths. Thus, the specific protein complexity studies of these three genomes show that it is not necessarily surprising that while the duplicate genes are generally longer than single copy genes, they tend to contain fewer domains.

Fig. 6. Investigation of number of domains per duplicate gene in the *M. tuberculosis* H37Rv, *E. coli* and *L. interrogans* genomes (Duplicate Gene Analysis). The graph displays the relationship between sequence length (X-axis) and number of domains (Y-axis) of the

The correlation coefficient values for the whole genome, duplicate gene and single copy gene analysis in *E. coli* were 0.48, 0.47, and 0.49, respectively, while the values of 0.39, 0.25, and 0.53 were reported for *L. interrogans* (Table 4). The results from these two organisms suggest that the number of domains does not increase significantly with the increase in gene length and hence, domain complexity may be independent of the gene length or vice versa. Although the reported correlation coefficient values of 0.58, 0.62 and 0.59 corresponding to the whole genome, duplicate and single copy gene analyses, respectively, in *M. tuberculosis*  are higher than those for *E. coli* and *L. interrogans*, these correlation coefficient values still do not suggest significant positive correlation between domain complexity and gene lengths. Thus, the specific protein complexity studies of these three genomes show that it is not necessarily surprising that while the duplicate genes are generally longer than single copy

duplicate genes.

genes, they tend to contain fewer domains.

Fig. 7. Investigation of number of domains per single copy gene in the *M. tuberculosis*  H37Rv, *E. coli* and *L. interrogans* genomes (Single Copy Gene Analysis). The graph displays the relationship between sequence length and number of domains of the single copy genes in these genomes.


Table 4. Correlation Coefficient and P-values of the whole genome, duplicate and single copy gene analysis in *E. coli, M. tuberculosis* H37Rv and *L. interrogans.* The table displays the results of Pearson's product-moment correlation.

Analysis of Duplicate Gene Families

clusters) expanded families.

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 185

Fig. 8. Distribution of functions in *M. tuberculosis* (all 390 clusters) and *M. leprae*-shared (116

most distant proteins in each of these sets increases with an increase in cluster size, it was inferred that some of the duplicate copies show a tendency to diverge from the original ancestral functions after multiple duplication events in bigger families. To investigate the average divergence of proteins in these clusters, the relationship between average genetic distance and cluster size was determined. The results suggest that the average genetic distance between the gene families does not increase with the cluster size, except perhaps for the few larger families. In order to statistically verify the results, correlation coefficient values were estimated using the Pearson's product-moment correlation. The correlation coefficient value of 0.43 at a p-value of 1.17 x 10-6 indicates the presence of moderate
