**2. Materials and methods**

### **2.1 Data selection and identification of homologous sequences**

Comparative sequence analysis of different genomes is the most common approach for identifying orthologs and paralogs. However, here we used both the sequence and protein signature data as the latter could substantiate the former, and enables identification of more distantly related members of a protein family. We collected non-redundant protein sets for 76 microorganisms, including pathogens and non-pathogens, to identify expanded gene families in these organisms. The selection of the non-pathogenic bacteria in this study is of value, since many of these may also contain virulent genes which could act as barriers conferring protection against the defense mechanisms of the host, thus enhancing the survival capabilities and adapting the organism to intracellular conditions. In addition, acquisition of specific virulent gene clusters can transform these non-pathogenic agents to pathogenic microorganisms.

For the selected organisms, approximately 1,91,497 protein signatures, 2,47,858 protein sequences, Genome size and G+C composition data were retrieved from the InterPro (http://www.ebi.ac.uk/interpro) (Apweiler *et al.,* 2001; Mulder *et al.,* 2007) and Integr8 (http://www.ebi.ac.uk/integr8) (Kersey *et al.,* 2005) databases respectively. The protein signature data from InterPro enabled the identification of approximately 27,827 proteins which exhibited complete domain identity (same InterPro matches) over their entire length to one or more proteins in *M. tuberculosis* strain H37Rv. Within each organism and across all organisms, the proteins showing complete domain identity were grouped together as duplicate gene sets or ortholog and paralog sets, respectively, and those with no common signature matches were considered to be single copies. In addition to the identification of expanded families using InterPro data, homologous sequences were clustered using BlastClust in two separate clustering procedures:

