**2.4. Marker to genome alignment**

with the organisms registered in the NCBI Taxonomy Database. For the first 10 years of microbial genome sequencing, each species has a unique genome representation in public sequence archives. When sequencing costs decreased, researchers began to explore microbial population structure and the intraspecies differences. NCBI Taxonomy group began assigning Taxonomy ID for strain level nodes as proxies of unique genome identifiers. More recently, next-generation sequencing and rapid pathogen detection approaches have shifted the paradigm from a single isolate representing an organism to multiisolate projects often representing almost identical isolates from the outbreak analysis. These closely related genomes differ by metadata only: patient information, date, and place of sample collection. NCBI has created new resources that capture the sequence data and metadata information: BioProject, BioSample, and Assembly [16]. A triplet of these identifiers uniquely defines a

NCBI internal database UniCol is used to store collections of the nucleotide and protein sequence data associated with every BioProject, BioSample, Assembly triplet. The database provides a tracking history for a given snapshot with the sequence assembly and metadata

Clade\_id **Name Genomes Clonal groups Taxonomy** *Staphylococcus aureus* 4182 118 species *Escherichia, Shigella* 2479 986 multiple *Mycobacterium tuberculosis* 1844 11 species *Salmonella* 971 139 genus *Acinetobacter* 846 306 genus *Helicobacter pylori* 432 258 species *Streptococcus* 394 154 genus *Enterobacter, Klebsiella* 384 149 multiple *Enterococcus* 354 161 genus *Brucella* 335 9 genus

**Table 1.** Calculated clades may include a single species, a single genus, or multiple genera for closely related species.

The N50/L50 metrics are automatically calculated for each genome. Acceptable values are dependent on genome size, and genomes which do not meet the criteria are not processed for Refseq. For known clades, the genome size is expected to fall within 2 standard deviations from the mean for clades, which have at least 10 members. This standard allows for the identification of partial genomes and unusually large genomes, which may indicate a bad

There are several criteria that are used to evaluate the quality of genome assembly.

genome with the metadata that can be used for further comparative analysis.

232 Next Generation Sequencing - Advances, Applications and Challenges

available at the time.

**2.3. Genome quality assessment**

assembly or contamination.

Genome distance is defined as an average of pairwise protein distances of universally conserved single-copy proteins as defined in [8] (Table 2).


**Table 2.** List of genomic markers used in genomic analysis. Escherichia coli K-12 accessions are given as an example. Each marker has a corresponding protein cluster which is used in the analysis.

#### **2.5. Genome distance**

Protein marker distances and genomic distance are designed to be robust while remaining appropriately sensitive. Protein distance measuring dissimilarity between markers of the same type is designed to ignore differences in protein lengths and tuned to measure dissimilarity in internal parts of the sequences. The subsequent genomic distance averages over the majority of marker-distances, ignoring the outliers.

#### *2.5.1. Protein distances*

Consider proteins i and j, having the best aggregated BLAST alignment of length *L ij* with aggregated score *Sij* . Assume that the proteins have lengths *L <sup>i</sup>* and *L <sup>j</sup>* and self-scores *Sii* and *S jj* . Define normalized scores: *sij* =*Sij* / *L ij* , *sii* =*Sii* / *L <sup>i</sup>* , *sjj* =*S jj* / *L <sup>j</sup>* .

Then define protein distances:

$$d\_{\vec{\eta}} = 1 - \min\left(1, \frac{s\_{\vec{\eta}}}{\min\left(s\_{\vec{\iota}\cup\prime}, s\_{\vec{\eta}}\right)}\right) \tag{1}$$

Distance (1) is an identity-like characteristic calculated from the aggregated BLAST [17] scores (using positives based on BLOSUM62 matrix [22]). For full-length alignment, it can be reduced to 1<sup>−</sup> *Sij* min(*Sii* , *S jj* ) . However, when lengths are different; distance (1) avoids penalizing nona‐ ligned ends of the proteins, taking into account only mutation events.

#### *2.5.2. Genomic distances*

Suppose that genomes *i* and *j* have *Nij <sup>a</sup>* types of markers found in both genomes, with *Nij <sup>h</sup>* of them having acceptable BLAST hits.

Define the offset Δ*ij* =max(3, *Nij h* <sup>4</sup> , 1 + *Nij <sup>a</sup>* <sup>−</sup> *Nij <sup>h</sup>* ). Order marker distances in the ascending order: *dij* (0) ≤*dij* (1) ≤...≤*dij* (*Nij <sup>h</sup>* −1) . Then robust genomic distance is defined by the formula:

$$D\_{ij} = \frac{\sum\_{p=\Lambda\_{\eta}}^{p-N\_{\eta}^{k}-\Lambda\_{\eta}-1} d\_{ij}^{(p)} l\_{ij}^{(p)}}{\sum\_{p=\Lambda\_{\eta}}^{p-N\_{\eta}^{k}-\Lambda\_{\eta}-1} l\_{ij}^{(p)}} \,. \tag{2}$$

where *l ij* ( *p*) are corresponding alignment length. The marker-protein distances are weighted by alignment lengths *l ij* ( *p*) in order to provide where possible results similar to the original method in [8] based on concatenation of proteins. However, the use of offset Δ*ij* allows filtering out outliers since the averaging in (2) is performed over *Nij <sup>h</sup>* <sup>−</sup>2Δ*ij* distances in the middle. For each phylum level group, an agglomerative hierarchical clustering tree is built using the complete linkage clustering algorithm [19, 20].
