**1. Introduction**

172 Gene Duplication

Vision, T.J., Brown, D.G., & Tanksley, S.D. (2000) The origins of genomic duplications in

Wallrapp, C., Verrier, S.B., Zhouravleva, G., Philippe, H., Philippe, M., Gress, T.M., & Jean-

Wang, Y., Chai, B., Wang, W., & Liang, A. (2010) Functional characterization of polypeptide release factor 1b in the ciliate Euplotes. *Biosci.Rep.*, Vol.30, pp. 425-431. Wang, Y. & Gu, X. (2000) Evolutionary patterns of gene families generated in the early stage

Wong, S., Butler, G., & Wolfe, K.H. (2002) Gene order evolution and paleopolyploidy in hemiascomycete yeasts. *Proc.Natl.Acad.Sci.U.S.A*., Vol.99, pp. 9272-9277. Zhouravleva, G., Schepachev, V., Petrova, A., Tarasov, O., & Inge-Vechtomov, S. (2006)

retrotransposition of GSPT1's mRNA? *IUBMB.Life*., Vol.58, pp. 199-202.

but does not carry eRF3-like activity. *FEBS Lett.*, Vol.440, pp. 387-392. Wang, W., Czaplinski, K., Rao, Y., & Peltz, S.W. (2001) The role of Upf proteins in

Wolfe, K. (2000) Robustness-it's not where you think it is. *Nat.Genet*., Vol.25, pp. 3-4.

Jean, O. (1998) The product of the mammalian orthologue of the *Saccharomyces cerevisiae HBS1* gene is phylogenetically related to eukaryotic release factor 3 (eRF3)

modulating the translation read-through of nonsense-containing transcripts. *EMBO* 

Evolution of translation termination factor eRF3: Is GSPT2 generated by

Arabidopsis. *Science*, Vol.290, pp. 2114-2117.

of vertebrates. *J.Mol.Evol.*, Vol.51, pp. 88-96.

*J.*, Vol.20, pp. 880-890.

Though considerable sequence information from different organisms was available prior to the recent advances in genome sequencing technology, the foundation for our current understanding of the mechanisms of bacterial pathogenesis was laid by the release of the first complete genome sequence of *Heamophilus influenza* in 1995 (Fraser-Liggett, 2005). Ever since, significant progress in the availability of data for different genomes has been possible due to the contribution of various genome sequencing projects (Koonin & Wolf, 2008). Despite the complete genome sequences of many pathogenic organisms being available, the mortality rates due to these infectious agents still remains a problem, highlighting the need to decipher the complex molecular mechanisms responsible for survival of the bacteria. The wealth of complete genome information for pathogens can be effectively explored using comparative genomic tools for the identification of common and unique sets of genes involved in the propagation of virulence. Sequence comparison tools have been developed to identify homologous genes from the complete genomes of microorganisms. Homologous genes which arise from speciation tend to maintain functions similar to that of their ancestral molecule and are known as orthologs, while the genes originating from duplication events often evolve new functions and are defined as paralogs (Tatusov *et al.,* 1997).

The world of microbes is highly diverse with genome complexity differing across a wide range of microorganisms. In general, the difference in the complexity of genomes is dictated by the life style and environment of the organism (Cordero & Hogeweg, 2009). Life style plays an important role in regulating the genome dynamics of an organism, and functional novelty provided by gene duplication is thought to enhance the adaptation capability of the organism. In addition, horizontal transfer of operons or functional units of genes from external sources may provide an immediate functional benefit to the organism, thereby adding to the functional complexity of the genomes. The availability of complete genome sequences of important mycobacteria such as *Mycobacterium tuberculosis, Mycobacterium ulcerans, Mycobacterium bovis, Mycobacterium leprae, Mycobacterium paratuberculosis, Mycobacterium avium* and others, can be used to gain deeper insights into possible

Analysis of Duplicate Gene Families

**2.2 Evolutionary analysis** 

**3. Results and discussion** 

**size** 

BlastClust in two separate clustering procedures:

in Microbial Genomes and Application to the Study of Gene Duplication in *M. tuberculosis* 175

signature data from InterPro enabled the identification of approximately 27,827 proteins which exhibited complete domain identity (same InterPro matches) over their entire length to one or more proteins in *M. tuberculosis* strain H37Rv. Within each organism and across all organisms, the proteins showing complete domain identity were grouped together as duplicate gene sets or ortholog and paralog sets, respectively, and those with no common signature matches were considered to be single copies. In addition to the identification of expanded families using InterPro data, homologous sequences were clustered using

a. **Independent Genome Clustering**: This involves within genome clustering to generate clusters of paralogs or protein families for each genome. BlastClust was executed at a wide range of percentage identities over varying lengths of the sequence to select the optimum parameters. Amongst the tested parameters, a 30% similarity over 60% sequence length cut-off was chosen, as it generated a suitable number of clusters (in line

For evolutionary studies, in addition to 66 paralogous gene clusters, 116 multiple genome clusters from the phylogenetic matrix of six of the closely related mycobacterial genomes that showed gene family expansions in both *M. tuberculosis* and *M. leprae*, as well as other mycobacteria, were selected. The proteins in each of the clusters were aligned with T-coffee (Notredame *et al.*, 2000), and poorly aligned regions were edited using the Gblocks program (Castresana**,** 2000) with adjustments in the default settings for the generation of optimal sequence alignments. For each of these protein alignments, selection of the best-fitting amino acid substitution model was performed according to the Akaike informational criterion, and the gamma correction factor (alpha), the proportion of invariable sites (I), and observed amino acid frequencies (F) were estimated and selected for subsequent phylogenetic analysis using ProtTest (Abascal *et al.*, 2005). Since, PhyML is a maximum likelihood method with the ability to incorporate the estimated values of alpha, proportion of invariable sites, and observed frequencies, the tree topologies for the gene sets in the identified clusters were constructed using this program (Guindon & Gascuel, 2003). The genetic distance measures from each of the estimated tree topologies were used to compute

**3.1 Identification of expanded gene families and relation to GC content and genome** 

We used sequence clustering and protein signature data to identify expanded genes families within and across several different microbial genomes. The across-genome clustering of protein sequence data yielded 1,984 expanded genes in 441 clusters for *M. tuberculosis*  H37Rv. The protein signature method allowed us to group 30,885 proteins into 2238 clusters from all the organisms. InterPro signatures usually match between 50% and 80% of a genome, so data is not available for every protein. Since signature data enables identification

with previously reported numbers of duplicated families for *M. tuberculosis*). b. **Multiple Genome Clustering**: In this, all of the 76 genomes were appended together for the clustering of related proteins (orthologs and paralogs). In addition, the clustering of six of the mycobacterial species was performed separately for the

evolutionary analysis of expanded gene families in *M. tuberculosis*.

average and maximum genetic distance using Perl scripts.

mechanisms of prokaryotic gene innovation. Genome data has been used in recent studies to compare different species of mycobacteria, as well as different strains, to understand the evolution and pathogenesis of *M. tuberculosis* (Marri *et al.,* 2006). In our study, duplicate gene sets from different mycobacteria were investigated to identify the distribution of important functional classes of protein families, and the evolution of these functional classes was further analyzed by comparing their genetic divergence following duplication.

The importance of gene duplication in prokaryotic gene innovation is well established and comparative analysis of duplicate genes with basic characteristic features of genomes like GC content and genome size may aid in deciphering their contributions. In contrast to eukaryotes, GC content varies widely across different bacterial genomes (Mann & Chen, 2010), and analysis of GC variations between related bacteria could be useful in establishing evolutionary relationships (Mann *et al.,* 2010). The focus of the majority of earlier studies was on deciphering the role of GC composition in HGT (Nelson *et al.,* 1999; Hamady *et al.*, 2006), transcription start and stop sites (Zhang *et al.,* 2004), nucleotide substitution rates (DeRose-Wilson & Gaut, 2007), optimal growth temperature (Basak & Ghosh, 2005; Musto, 2006) and metabolic characteristics (Naya *et al.,* 2002). Furthermore, genome size has been reported to increase with an increase in number of genes in duplicate gene families (Snel *et al.,* 2002; Pushker *et al.,* 2004). In this study we analyzed the genomes of 56 pathogenic and 20 non-pathogenic microorganisms to identify and characterize the expanded gene families across these organisms. In addition to the GC content, we investigated the relationship between genome size and duplicate gene percentage. On finding sufficient evidence for a correlation between genome size and extent of gene duplication, we further investigated the significance of duplicate genes in enhancing genome complexity. He and Zhang (2005) previously reported the importance of gene duplication in enhancing genome and organism complexity in eukaryotes. However, due to the difference in the selective pressures operating on prokaryotic and eukaryotic genomes, we used the duplicate and single copy genes to investigate the influence of protein lengths on genome and organism complexity of prokaryotic organisms, with a specific focus on investigating the role of duplicate genes in enhancing the genome complexity of *M. tuberculosis*.
