**Dealing with the Data Deluge – New Strategies in Prokaryotic Genome Analysis**

Leonid Zaslavsky, Stacy Ciufo, Boris Fedorov, Boris Kiryutin, Igor Tolstoy and Tatiana Tatusova

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/62125

#### **Abstract**

[74] Hilker R, Stadermann KB, Doppmeier D, Kalinowski J, Stoye J, Straube J, et al. Read‐ Xplorer—visualization and analysis of mapped sequences. Bioinformatics

[75] Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De no‐ vo transcript sequence reconstruction from RNA-seq using the Trinity platform for

[76] Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform [Internet]. 2013 Dec 2 [cit‐ ed 2014 Apr 30]; Available from: http://bib.oxfordjournals.org/cgi/doi/10.1093/bib/

[77] Herbig A, Nieselt K. nocoRNAc: characterization of non-coding RNAs in prokar‐

[78] Amman F, Wolfinger MT, Lorenz R, Hofacker IL, Stadler PF, Findei S. TSSAR: TSS annotation regime for dRNA-seq data. BMC Bioinformatics 2014;15(1):89.

[79] Dugar G, Herbig A, Förstner KU, Heidrich N, Reinhardt R, Nieselt K, et al. High-res‐ olution transcriptome maps reveal strain-specific regulatory features of multiple Campylobacter jejuni isolates. 2013 [cited 2015 Jul 14]; Available from: http://

[80] Chuang L-Y, Chang H-W, Tsai J-H, Yang C-H. Features for computational operon

[81] Warren AS, Aurrecoechea C, Brunk B, Desai P, Emrich S, Giraldo-Calderón GI, et al. RNA-Rocket: an RNA-Seq analysis resource for infectious disease research. Bioinfor‐

[82] Van Verk MC, Hickman R, Pieterse CM, Van Wees SC. RNA-Seq: revelation of the

[83] Dai L, Gao X, Guo Y, Xiao J, Zhang Z, others. Bioinformatics clouds for big data ma‐

prediction in prokaryotes. Brief Funct Genomics 2012;els024.

reference generation and analysis. Nat Protoc 2013;8(8):1494–512.

yotes. BMC Bioinformatics 2011;12(1):40.

228 Next Generation Sequencing - Advances, Applications and Challenges

dx.plos.org/10.1371/journal.pgen.1003495

messengers. Trends Plant Sci 2013;18(4):175–9.

nipulation. Biol Direct 2012;7(1):43.

matics 2015;btv002.

2014;btu205.

bbt086

Recent technological innovations have ignited an explosion in microbial genome se‐ quencing that has fundamentally changed our understanding of biology of microbes and profoundly impacted public health policy. This huge increase in DNA sequence data presents new challenges for the annotation, analysis, and visualization bioinformatics tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data. Genomes are organized in a hi‐ erarchical distance tree using single-copy ribosomal protein marker distances for distance calculation. Protein distance measures dissimilarity between markers of the same type and the subsequent genomic distance averages over the majority of marker-distances, ig‐ noring the outliers. More than 30,000 genomes from public archives have been organized in a marker distance tree resulting in 6,438 species-level clades representing 7,597 taxo‐ nomic species. This computational infrastructure provides a foundation for prokaryotic gene and genome analysis, allowing easy access to pre-calculated genome groups at vari‐ ous distance levels. One of the most challenging problems in the current data deluge is the presentation of the relevant data at an appropriate resolution for each application, eliminating data redundancy but keeping biologically interesting variations.

**Keywords:** Genome analysis, clusters, proteins, bacteria, prokaryotes
