**2.2 Evolutionary analysis**

174 Gene Duplication

mechanisms of prokaryotic gene innovation. Genome data has been used in recent studies to compare different species of mycobacteria, as well as different strains, to understand the evolution and pathogenesis of *M. tuberculosis* (Marri *et al.,* 2006). In our study, duplicate gene sets from different mycobacteria were investigated to identify the distribution of important functional classes of protein families, and the evolution of these functional classes

The importance of gene duplication in prokaryotic gene innovation is well established and comparative analysis of duplicate genes with basic characteristic features of genomes like GC content and genome size may aid in deciphering their contributions. In contrast to eukaryotes, GC content varies widely across different bacterial genomes (Mann & Chen, 2010), and analysis of GC variations between related bacteria could be useful in establishing evolutionary relationships (Mann *et al.,* 2010). The focus of the majority of earlier studies was on deciphering the role of GC composition in HGT (Nelson *et al.,* 1999; Hamady *et al.*, 2006), transcription start and stop sites (Zhang *et al.,* 2004), nucleotide substitution rates (DeRose-Wilson & Gaut, 2007), optimal growth temperature (Basak & Ghosh, 2005; Musto, 2006) and metabolic characteristics (Naya *et al.,* 2002). Furthermore, genome size has been reported to increase with an increase in number of genes in duplicate gene families (Snel *et al.,* 2002; Pushker *et al.,* 2004). In this study we analyzed the genomes of 56 pathogenic and 20 non-pathogenic microorganisms to identify and characterize the expanded gene families across these organisms. In addition to the GC content, we investigated the relationship between genome size and duplicate gene percentage. On finding sufficient evidence for a correlation between genome size and extent of gene duplication, we further investigated the significance of duplicate genes in enhancing genome complexity. He and Zhang (2005) previously reported the importance of gene duplication in enhancing genome and organism complexity in eukaryotes. However, due to the difference in the selective pressures operating on prokaryotic and eukaryotic genomes, we used the duplicate and single copy genes to investigate the influence of protein lengths on genome and organism complexity of prokaryotic organisms, with a specific focus on investigating the role of duplicate genes in

was further analyzed by comparing their genetic divergence following duplication.

enhancing the genome complexity of *M. tuberculosis*.

**2.1 Data selection and identification of homologous sequences** 

Comparative sequence analysis of different genomes is the most common approach for identifying orthologs and paralogs. However, here we used both the sequence and protein signature data as the latter could substantiate the former, and enables identification of more distantly related members of a protein family. We collected non-redundant protein sets for 76 microorganisms, including pathogens and non-pathogens, to identify expanded gene families in these organisms. The selection of the non-pathogenic bacteria in this study is of value, since many of these may also contain virulent genes which could act as barriers conferring protection against the defense mechanisms of the host, thus enhancing the survival capabilities and adapting the organism to intracellular conditions. In addition, acquisition of specific virulent

gene clusters can transform these non-pathogenic agents to pathogenic microorganisms.

For the selected organisms, approximately 1,91,497 protein signatures, 2,47,858 protein sequences, Genome size and G+C composition data were retrieved from the InterPro (http://www.ebi.ac.uk/interpro) (Apweiler *et al.,* 2001; Mulder *et al.,* 2007) and Integr8 (http://www.ebi.ac.uk/integr8) (Kersey *et al.,* 2005) databases respectively. The protein

**2. Materials and methods** 

For evolutionary studies, in addition to 66 paralogous gene clusters, 116 multiple genome clusters from the phylogenetic matrix of six of the closely related mycobacterial genomes that showed gene family expansions in both *M. tuberculosis* and *M. leprae*, as well as other mycobacteria, were selected. The proteins in each of the clusters were aligned with T-coffee (Notredame *et al.*, 2000), and poorly aligned regions were edited using the Gblocks program (Castresana**,** 2000) with adjustments in the default settings for the generation of optimal sequence alignments. For each of these protein alignments, selection of the best-fitting amino acid substitution model was performed according to the Akaike informational criterion, and the gamma correction factor (alpha), the proportion of invariable sites (I), and observed amino acid frequencies (F) were estimated and selected for subsequent phylogenetic analysis using ProtTest (Abascal *et al.*, 2005). Since, PhyML is a maximum likelihood method with the ability to incorporate the estimated values of alpha, proportion of invariable sites, and observed frequencies, the tree topologies for the gene sets in the identified clusters were constructed using this program (Guindon & Gascuel, 2003). The genetic distance measures from each of the estimated tree topologies were used to compute average and maximum genetic distance using Perl scripts.
