**2.7. Clades and superclades**

Due to biological, historical, and sampling reasons, microbial organisms have very different levels of strain variation within species. Using the genome data available in public archives we have calculated the diameter of the species defined by NCBI Taxonomy (see Figure 2).

**Figure 2.** Distribution of Taxonomy-defined species diameter. Y axes – diameter of species, X axes – species numbered in the descending diameter order.

Instead of using one fixed threshold, we utilize a taxonomy-aware algorithm that allows increasing the size of a genomic group in certain circumstances. Two distance threshold, the lower threshold d\_lower and the upper threshold d\_upper, are established (currently, we use values d\_lower = 0.015 and d\_upper = 0.025). Genomes with the lowest common ancestor with height d\_lower or below are always in the same group, while genomes with the lowest common ancestor with height above d\_upper are never placed together. In between d\_lower and d\_upper, taxonomic information is used: two subgroups are merged in a larger group if any pair of species in a group is already together in one of two subgroups (i.e., there are no new merges of species). Species are defined according to the NCBI taxonomic records [16].

Phylum-level trees are not practical for presentation and evaluation of closely related genomes. However, it is important to see the relationships (distance) between close clades (see Figure 3).

#### **2.8. Genome groups**

Species-level clades are further refined by whole-genome alignments using megablast with default parameters [18]. The genome groups are defined by clustering the genomes at 95% identity and 90% coverage. An example of genome groups for *Klebsiella pneumonia* clade is shown in Figure 4. For each group a representative genome with the highest level of assembly and annotation quality is selected.

Dealing with the Data Deluge – New Strategies in Prokaryotic Genome Analysis http://dx.doi.org/10.5772/62125 237

**Figure 3.** Superclade tree for three abundant groups: A – Salmonella, B – Bacillus, C – Streptococcus. Green boxes rep‐ resent clades; box size is proportional to the number of genome in a clade.
