**2.6. Genome clustering pipeline**

type is designed to ignore differences in protein lengths and tuned to measure dissimilarity in internal parts of the sequences. The subsequent genomic distance averages over the majority

Consider proteins i and j, having the best aggregated BLAST alignment of length *L ij* with

, *sii* =*Sii* / *L <sup>i</sup>*

( ) 1 min 1, min , *ij*

Distance (1) is an identity-like characteristic calculated from the aggregated BLAST [17] scores (using positives based on BLOSUM62 matrix [22]). For full-length alignment, it can be reduced

= - ç ÷

. Assume that the proteins have lengths *L <sup>i</sup>* and *L <sup>j</sup>* and self-scores *Sii*

*ii jj*

) . However, when lengths are different; distance (1) avoids penalizing nona‐

. Then robust genomic distance is defined by the formula:

( ) ( )

*d l*

*l*

( )

<sup>1</sup> ,

are corresponding alignment length. The marker-protein distances are weighted by

in order to provide where possible results similar to the original method

*<sup>h</sup>* <sup>−</sup>2Δ*ij*

*s s*

*s*

æ ö

ç ÷ è ø

, *sjj* =*S jj* / *L <sup>j</sup>*

.

*<sup>a</sup>* types of markers found in both genomes, with *Nij*

*<sup>h</sup>* ). Order marker distances in the ascending

and

(1)

*<sup>h</sup>* of

(2)

allows filtering out

distances in the middle. For each

of marker-distances, ignoring the outliers.

234 Next Generation Sequencing - Advances, Applications and Challenges

. Define normalized scores: *sij* =*Sij* / *L ij*

*ij*

ligned ends of the proteins, taking into account only mutation events.

*h*

*D*

= å

in [8] based on concatenation of proteins. However, the use of offset Δ*ij*

<sup>4</sup> , 1 + *Nij*

*<sup>a</sup>* <sup>−</sup> *Nij*

1

phylum level group, an agglomerative hierarchical clustering tree is built using the complete

*p N p p ij ij p ij p N <sup>p</sup> ij p*

*h ij ij ij h ij ij ij*

å

= -D - =D = -D - =D

*d*

*2.5.1. Protein distances*

aggregated score *Sij*

Then define protein distances:

*S jj*

to 1<sup>−</sup> *Sij* min(*Sii* , *S jj*

order: *dij*

where *l ij* ( *p*)

(0) ≤*dij* (1) ≤...≤*dij* (*Nij <sup>h</sup>* −1)

alignment lengths *l*

*2.5.2. Genomic distances*

Suppose that genomes *i* and *j* have *Nij*

*ij* ( *p*)

linkage clustering algorithm [19, 20].

outliers since the averaging in (2) is performed over *Nij*

them having acceptable BLAST hits.

Define the offset Δ*ij* =max(3, *Nij*

The pipeline for calculating genome clades consists of three major components (see Figure 1). The first is the collection of the input data from NCBI main sequence repositories. The genomic data are dynamic: hundreds of new genomes and assembly updates are submitted to NCBI each day. We create a snapshot of all live genome assemblies and their nucleotide sequence components (chromosomes, scaffolds, and contigs) and store them in an internal relational database: UniCol. The genome data set is organized into large groups (phyla and superphyla defined by NCBI Taxonomy). The assemblies are then filtered by quality and passed to the processing script. Ribosomal protein markers are predicted in every genome to overcome problems with the genome annotations (missing and/or incorrect annotations) and to normal‐ ize markers' data set. Marker predictions are performed by aligning reference protein markers against full genome assemblies. Assemblies with at least 17 markers are passed to the next step. Genome distance is calculated as an average of pairwise protein distances of markers shared in a pair of genomes. Finally, agglomerative hierarchical clustering trees are built within phylum-level groups. Clades at the species level are calculated using species-aware algorithm. Superclade trees are constructed by sectioning the trees at the distance of 0.25.

**Figure 1.** Dataflow of ribosomal-marker-based clade (genome group) processing. Ribosomal markers (in green) are maintained outside of the main pipeline (in blue). Clades and markers are available on NCBI FTP site: ftp:// ftp.ncbi.nlm.nih.gov/genomes/GENOME\_REPORTS/CLADES/ ftp://ftp.ncbi.nlm.nih.gov/genomes/ GENOME\_REPORTS/MARKERS/
