**Meet the editor**

Dr. Mihai Mareș received his PhD degree in Microbiology at "Gr. T. Popa" University of Medicine and Pharmacy from Iași, Romania (2005), and had the postgraduate training at the University VII Denis-Diderot, Pasteur Institute, Pitié-Salpêtrière Hospital, and École du Valde-Grâce, Paris (France); Complutense University of Madrid (Spain); Instituto de Salud Global, Barcelona

(Spain); Karolinska Institute, Stockholm (Sweden); and Danish Technical University, Lyngby (Denmark). His areas of interest are medical mycology, antimicrobial resistance, mycobacteria, food microbiology, biofilms and biomedical applications of plasma discharges, and cold plasma–activated water. Currently, Dr. Mareș is a professor of Microbiology and the head of the Antimicrobial Chemotherapy Laboratory at the Ion Ionescu de la Brad University, Iași (Romania). He has served as consultant for several pharmaceutical companies during the past few years.

## Contents

**Preface XI**


Chapter 6 **Factors Contributing to the Emergence and Spread of Antibiotics Resistance in Salmonella Species 97** Kabiru Olusegun Akinyemi and Samuel Oluwasegun Ajoseh


## Preface

Chapter 7 **Quinolone Resistance in Non-typhoidal Salmonella 115** Siriporn Kongsoi, Chie Nakajima and Yasuhiko Suzuki

Hamadi and Nouredine Chaouqy

Chapter 9 **Dynamics of Salmonella Infection 151**

**Section 4 Risk Factors and Control Strategies 169**

**Section 3 Salmonellosis in Animals 149**

**VI** Contents

Fathalla A. Rihan

**Strategies 193**

C. Koutoulis

**Infections 235**

Chapter 8 **Salmonella in Wastewater: Identification, Antibiotic Resistance and the Impact on the Marine Environment 137**

Chapter 10 **Interaction between Salmonella and Plants: Potential Hosts**

Chapter 11 **Preharvest Salmonella Risk Contamination and the Control**

Chapter 12 **Prevalence, Risks and Antibiotic Resistance of Salmonella in**

Chapter 13 **Effects of Environment and Socioeconomics on Salmonella**

Rebeca Zamora-Sanabria and Andrea Molina Alvarado

Eva Fornefeld, Jasper Schierstaedt, Sven Jechalke, Rita Grosch,

Niki Mouttotou, Shakeel Ahmad, Zahid Kamran and Konstantinos

**and Vectors for Human Infection 171**

Kornelia Smalla and Adam Schikora

**Poultry Production Chain 215**

Hafiz Anwar Ahmad and Luma Akil

Chapter 14 **Application of Ionizing Radiation for Control of**

**Salmonella in Food 253** Małgorzata E. Szczawińska

Abdellah El Boulani, Rachida Mimouni, Hasna Mannas, Fatima

The genus *Salmonella* comprises an important number of bacterial species able to colonize and infect numerous animal species and humans. Although more than a hundred years passed since its discovery, *Salmonella* still represents a redoubtable and successful microor‐ ganism, difficult to deal with. Whether we discuss about typhoid fever or food poisoning, the public health and financial consequences are practically incalculable. The costs attributa‐ ble to *Salmonella* contamination of meat, eggs, and vegetables are also very high worldwide. Antimicrobial resistance in *Salmonella* isolates is an emerging threat not only in humans, and special measures should be addressed to this global problem.

The book *Current Topics in Salmonella and Salmonellosis* contains a series of reviews about allimportant issues concerning these subjects. It comprises 14 chapters grouped in 4 sections emphasizing new insights into pathogenesis, bacterial detection and antibiotic resistance, in‐ fections in animals, risk factors, and control strategies. The new genomic data and the ex‐ haustive presentation of molecular pathogenesis bring novelty to the book and can help to improve our knowledge about *Salmonella*-induced diseases.

More than 40 international specialists have contributed as coauthors to this book, resulting in an interdisciplinary view on the topic. I would like to express my gratitude and apprecia‐ tion to all of them and, last but not least, to all those who assisted me in this editorial project.

> **Mihai Mareș, PhD** Professor of Microbiology, Department of Public Health Head, Laboratory of Antimicrobial Chemotherapy Ion Ionescu de la Brad University Iasi, Romania

**New Insights in Pathogenesis**

**Provisional chapter**

## **Insights from Comparative Genomics of the Genus** *Salmonella* **Insights from Comparative Genomics of the Genus**  *Salmonella*

Trudy M. Wassenaar, Se-Ran Jun, Visanu Wanchai, Preecha Patumcharoenpol, Intawat Nookaew, Katrina Schlum, Michael R. Leuze and David W. Ussery Trudy M. Wassenaar, Se-Ran Jun, Visanu Wanchai, Preecha Patumcharoenpol, Intawat Nookaew, Katrina Schlum, Michael R. Leuze and David W. Ussery

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/67131

#### **Abstract**

Comparative genomics have become a standard approach to gain insights into the interrelationships of microorganisms. Here, we have applied variable bioinformatic techniques to compare over 200 *Salmonella* genomes. First, we present a tree of all sequenced different members of the *Enterobacteriaceae* family, based on comparison of average amino acid identities. This technique was also applied to zoom in on the genomes of the genus *Salmonella*. The pan and core genomes of this genus were established and compared to experimental data available on the literature that identified essential genes. Difficulties and shortcomings of both approaches are discussed. Metabolic pathways unique for *Salmonella* were identified. Finally, we present an analysis of genes coding for small RNAs, an important part of the genetic repertoire of bacteria that is often ignored. The findings reported here are discussed and compared with available literature.

**Keywords:** comparative genomics, *Salmonella*, core genome, small RNA, AAI tree

#### **1. Introduction**

The genus *Salmonella* belongs to the *Enterobacteriaceae*, a large family within the gamma-proteobacteria to which *E. coli* also belongs. Since its first characterization in 1884 from diseased pigs by scientists working in the group of Daniel Salmon (after whom the genus is named), *Salmonella* species have been known to cause disease, notably typhoid fever and food poisoning. Pathogenic *Salmonella* types can be found in a wide range of animal hosts and often infect humans via contaminated food; they are responsible for more than a million infections in the

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

United States every year. Infections vary from (long-term) asymptomatic carriage and selflimiting salmonellosis to life-threatening conditions and fatal typhoidal fever [1].

Historically, many species of this genus were recognized, at first based on the clinical symptoms typical for their infections and it was soon recognized that these correlated with their serotype. However, based on sequence analysis, in 1973, it was proposed that all these *Salmonella* serotypes belonged to the same species [2]. This resulted, in 2005, to the designation of *Salmonella enterica* as the type species for the genus, as described by the International Committee on Systematics of Procaryotes [3]. Only one other species is currently formally recognized within the genus: *Salmonella bongori*, which lives in cold-blooded reptiles. *S. enterica* is further divided into six subspecies, of which *S. enterica* subsp. *enterica* is clinically most relevant. The names originally used to describe clinically distinct 'species' live on as serovars or serotypes. All *Salmonella* bacteria are none spore-forming, chemotrophic, facultative anaerobes, which survive in their host intracellularly [1].

The number of *Salmonella* genome sequences available in GenBank is constantly increasing. At the time of writing their number reached five thousand, the vast majority of which were obtained from *S. enterica*. As of September 15, 2016, there were 4934 genomes of this species in GenBank, with three additional genomes from *S. bongori*. Only a small fraction of these genomes are submitted as complete sequences without gaps and fulfilling all criteria set by GenBank for a genome to be listed as 'complete' (201 genomes at the time of writing, corresponding to 4% of the total). In this chapter, we employ whole-genome methods to compare complete *Salmonella* genomes in order to produce insights into the genomic diversity of this genus.

## **2.** *Salmonella* **comparative genome analyses**

#### **2.1. Genome-based trees**

The first approach was aimed to show the overall relatedness of all species belonging to the *Enterobacteriaceae* family, based on their (completely sequenced) genomes. For this, we collected up to ten genome sequences per species, as far as these were available, which led to 255 genome sequences to be compared. The comparison was based on average amino acid identity (AAI) comparison, a method that uses all annotated protein genes in a given genome, producing more robust trees than methods based on direct alignments or concatenated protein sequence alignments [4]. The resulting tree is presented with collapsed branches for redundant species (**Figure 1**). The *Salmonella* genus, shown in red, is positioned on a cluster together with *Citrobacter*, with *Escherichia*/*Shigella* as the closest neighbors. These genera are supposed to have been separated for tens of millions of years [5]. The close relationship between *Citrobacter* and *Salmonella* has been observed before, and it was proposed that recombination between these and to a lesser extent with *Escherichia*, has been frequent in the past, during a process of fragmented speciation [5].

Next, we extracted all 201 complete genomes from the *Salmonella* genus (in May 2016), combined with 164 'nearly completed' genomes. The latter were extracted from GenBank as good quality draft sequences only, retrieved from GenBank when selecting for genomes of 'chromosome' quality; all contained one contiguous sequence, without gaps. These 365 *Salmonella* genomes represent only a tiny fraction of what is available. Apart from the nearly 5000 *Salmonella* genomes available in GenBank, there are currently more than 62,000 *Salmonella enterica* genomes stored in the Sequence Read Archive. However, in principle, the complete genome sequences should be of high quality and reliable in terms of annotation; therefore, we restricted the analysis to complete genomes.

United States every year. Infections vary from (long-term) asymptomatic carriage and self-

Historically, many species of this genus were recognized, at first based on the clinical symptoms typical for their infections and it was soon recognized that these correlated with their serotype. However, based on sequence analysis, in 1973, it was proposed that all these *Salmonella* serotypes belonged to the same species [2]. This resulted, in 2005, to the designation of *Salmonella enterica* as the type species for the genus, as described by the International Committee on Systematics of Procaryotes [3]. Only one other species is currently formally recognized within the genus: *Salmonella bongori*, which lives in cold-blooded reptiles. *S. enterica* is further divided into six subspecies, of which *S. enterica* subsp. *enterica* is clinically most relevant. The names originally used to describe clinically distinct 'species' live on as serovars or serotypes. All *Salmonella* bacteria are none spore-forming, chemotrophic, facultative anaerobes, which sur-

The number of *Salmonella* genome sequences available in GenBank is constantly increasing. At the time of writing their number reached five thousand, the vast majority of which were obtained from *S. enterica*. As of September 15, 2016, there were 4934 genomes of this species in GenBank, with three additional genomes from *S. bongori*. Only a small fraction of these genomes are submitted as complete sequences without gaps and fulfilling all criteria set by GenBank for a genome to be listed as 'complete' (201 genomes at the time of writing, corresponding to 4% of the total). In this chapter, we employ whole-genome methods to compare complete *Salmonella*

The first approach was aimed to show the overall relatedness of all species belonging to the *Enterobacteriaceae* family, based on their (completely sequenced) genomes. For this, we collected up to ten genome sequences per species, as far as these were available, which led to 255 genome sequences to be compared. The comparison was based on average amino acid identity (AAI) comparison, a method that uses all annotated protein genes in a given genome, producing more robust trees than methods based on direct alignments or concatenated protein sequence alignments [4]. The resulting tree is presented with collapsed branches for redundant species (**Figure 1**). The *Salmonella* genus, shown in red, is positioned on a cluster together with *Citrobacter*, with *Escherichia*/*Shigella* as the closest neighbors. These genera are supposed to have been separated for tens of millions of years [5]. The close relationship between *Citrobacter* and *Salmonella* has been observed before, and it was proposed that recombination between these and to a lesser extent with *Escherichia*, has been frequent in the past,

Next, we extracted all 201 complete genomes from the *Salmonella* genus (in May 2016), combined with 164 'nearly completed' genomes. The latter were extracted from GenBank as good quality draft sequences only, retrieved from GenBank when selecting for genomes

genomes in order to produce insights into the genomic diversity of this genus.

**2.** *Salmonella* **comparative genome analyses**

during a process of fragmented speciation [5].

limiting salmonellosis to life-threatening conditions and fatal typhoidal fever [1].

vive in their host intracellularly [1].

4 Current Topics in Salmonella and Salmonellosis

**2.1. Genome-based trees**

**Figure 1.** Tree based on average amino acid identity (AAI) of 255 genomes from members of the *Enterobacteraceae*. Branches were collapsed at the species level. The branch with the two *Salmonella* species is colored and some distinct genus clusters are labeled.

An AAI tree was constructed to establish the interrelationship of the 365 complete genomes, representing 33 different serovars including 36 Typhimurium and 6 Typhi genomes. The branches of the AAI tree were collapsed at serovar level. This produced a tree with 62 branches, as shown in **Figure 2**. As can be observed, by and large the tree clustered the genomes according to serovars, though the separation is not absolute and some serovars end up in mixed clusters. This was to be expected, as the analysis is based on the complete annotated proteome (capturing all protein-coded sequences), while the phenotypic characteristics that determine a serovar are determined by a limited number of genes only, that produce the surface antigens captured by serotyping. Of the 36 *S. enterica* sv. Typhimurium genomes (represented on 13 branches, blue in the figure), 32 cluster together on 10 branches (together with four branches of non-specified serovars), while four are placed on three branches outside the Typhimurium cluster. A distinct cluster is also observed containing the serovars Enteritidis, Pullorum, Gallinarum and Dublin (colored green in the figure) which together are known as 'group D *Salmonella*' [6]. The first three of these are adapted to the chicken host, but serovar Dublin is mostly colonizing cattle, and other serovars frequently found in chickens are placed outside the group D cluster. It has been suggested that the serovars Paratyphi and Choleraesuis, both with a narrow host range (for humans and pigs, respectively) are phylogenetically related, a conclusion that was based on SNP analysis [6]. Indeed, we observe that one Paratyphi genome clusters with a Choleraesuis, but two other Paratyphi and another Choleraesuis genome are more distinct (colored red in **Figure 2**).

**Figure 2.** AAI tree of 365 *Salmonella* genomes representing 33 serovars of *S. enterica* (abbreviated as 'SE') subsp *enterica*. Indentical branches were collapsed per serotype. For explanation of the colors, see text.

#### **2.2. Essential genes based on published gene inactivation studies**

to serovars, though the separation is not absolute and some serovars end up in mixed clusters. This was to be expected, as the analysis is based on the complete annotated proteome (capturing all protein-coded sequences), while the phenotypic characteristics that determine a serovar are determined by a limited number of genes only, that produce the surface antigens captured by serotyping. Of the 36 *S. enterica* sv. Typhimurium genomes (represented on 13 branches, blue in the figure), 32 cluster together on 10 branches (together with four branches of non-specified serovars), while four are placed on three branches outside the Typhimurium cluster. A distinct cluster is also observed containing the serovars Enteritidis, Pullorum, Gallinarum and Dublin (colored green in the figure) which together are known as 'group D *Salmonella*' [6]. The first three of these are adapted to the chicken host, but serovar Dublin is mostly colonizing cattle, and other serovars frequently found in chickens are placed outside the group D cluster. It has been suggested that the serovars Paratyphi and Choleraesuis, both with a narrow host range (for humans and pigs, respectively) are phylogenetically related, a conclusion that was based on SNP analysis [6]. Indeed, we observe that one Paratyphi genome clusters with a Choleraesuis, but two other Paratyphi and another Choleraesuis genome are more distinct (colored red in **Figure 2**).

6 Current Topics in Salmonella and Salmonellosis

**Figure 2.** AAI tree of 365 *Salmonella* genomes representing 33 serovars of *S. enterica* (abbreviated as 'SE') subsp *enterica*.

Indentical branches were collapsed per serotype. For explanation of the colors, see text.

What makes a *Salmonella* a *Salmonella*? There are of course particular biochemical characteristics that can be used for identification, but can we recognize a set of genes that are always conserved, required and necessary for a *Salmonella* to be called that? And how many of those genes would be essential for growth and survival of the bacteria? These questions are addressed in this and the next session. Here, we start with genes proposed to be essential for survival under laboratory conditions, based on experimental data.

Traditionally, targeted mutagenesis has been used to determine if a gene from a given *Salmonella* strain were essential for infection, an approach that restricted the analyses to low numbers of genes only. An alternative approach was published in 2004 (based on previously developed techniques) to identify larger numbers of essential genes, by insertion of conditional lethal mutations into random gene fragments in a *S. typhimurium* strain [7]. The conditional switch used here was growth temperature, while tetracycline-dependent expression was used by others [8], although they only reported findings for four essential genes. A few years later, transposon (Tn) mutagenesis combined with high-throughput sequencing became available and this was applied to *S. enterica* strains [9–12]. Typically, in this approach mutants are screened for growth in LB broth. With a sufficiently high density of transposon insertions, genes that have not received insertions can be considered essential, as their inactivation had resulted in mutants unable to multiply under the conditions applied. Yet another approach was followed by Thiele and coworkers, who used metabolic reconstruction (MR) to extract a list of essential genes in *S*. Typhimurium that could be possible drug targets [13].

The experimental approaches reported in the literature are not without difficulties, as realized by their authors. For instance, polarity of transposon insertions in operons containing multiple genes can result in genes being scored as essential only because they are positioned downstream of an inactivated essential gene; attempts have been made to correct for this. Gene orthologs can further complicate findings, whereby one copy of an essential gene can be inactivated as long as a second copy remains intact. When an obtained mutant library is cultured for several generations, some mutants that originally survived will be removed from the population because their deletions are disadvantageous though not directly lethal. Such genes are typically scored as being under strong selection, an analysis that has been performed for *S*. Typhimurium strain ATCC 14028 and *S*. Thyphi strain Ty2 [11].

That experimental wet-laboratory data can be controversial is demonstrated by the fact that 26 of the 28 genes in *S*. Typhimurium strain ATCC 14028 that Knuth and coworkers reported as essential [7] could nevertheless be inactivated by site-directed mutagenesis [14].

Some research groups selected for conditions more closely resembling natural conditions of infection, for instance growth at 42°C instead of 37°C, to resemble the body temperature of mice that *S*. Typhimurium would typically encounter, or in the presence of bile acid ([10], work conducted with strain ATCC 14028). Exposure to low pH has also been tested [8]. Moreover, even 'essential' genes can often endure a transposon insertion without complete loss of function. If only those genes would be scored as essential that were truly resistant to Tn insertions from high-throughput mutagenesis, the essential gene pool would be very small indeed: only 96 genes from *S*. Typhi strain Ty2 and 57 genes from *S*. Typhimurium strain SL3261 remained free of Tn insertions under conditions that were considered to have reached Tn saturation [12]. Thus, a small number of insertions can be permitted, even in genes considered essential for life in laboratory medium. Since the chance to receive a Tn insertion depends on gene length, a highly variable parameter, the number of observed insertions needs to be corrected for gene length [9]. This produces an insertion index, where the number of observed insertions is divided by gene length. In addition, a likelihood can be calculated from the ratio of observed versus expected number of Tn insertions, to predict the chance of a gene being essential [9, 12]. For this approach, a cutoff value is required, to bin genes as either essential or not. The problem with this is that the used parameter (likelihood P value, Tn-insertion index or both) is a continuously increasing value. This makes the choice of the cutoff inevitably arbitrarily: There is no biological reason why genes bordering this cutoff would or would not be essential.

To illustrate the difficulty, we plotted the P value reported by Barquist and colleagues [12], who provided the most elaborate list of Tn mutants available to date (**Figure 3**). Panel A of the figure shows how the P value of all genes of *S*. Typhimurium steadily increases. Similar results are obtained for *S*. Typhi (not shown), and even for those genes that have very low P values, there is a continuous increase, as shown in Panel B. Note that in this figure, the log10 value was plotted for clarity, and the cutoff value corresponding to a P value of <0.05 is indicated by the red line. Clearly, this value is artificial, since there is no noticeable increment around this value.

**Figure 3.** The continuous increase of P values of Tn insertions. In Panel A, P values of all 4463 genes of *S*. Typhimurium are plotted. In Panel B, a selection of 2675 *S*. Typhimurium genes is shown with P values >0 but <0.1, plotted for the exponent (log10) of the P values for clarity. The red line indicates the cutoff of P < 0.05, corresponding with a log10 value of −1.3 that was used by the authors. Data after Ref. [12].

A slightly different picture emerges when the Tn-insertion index is plotted, as shown in **Figure 4**. Although the increase in this index is also continuous, the shape of the obtained curve is slightly sigmoidal at the beginning, suggesting a trend toward saturation of the index value around 0.03, before it increases again. This trend is stronger for *S*. Typhi (Panel 4A) than for *S*. Typhimurium (Panel 4B). Based on these findings, a cutoff value of 0.25 and 0.03 for the Tn index, respectively, might be appropriate for these species. We therefore recorded genes with a Tn index <0.25 for *S*. Typhi (n = 545 genes) and with a Tn index <0.30 for *S*. Typhimurium (n = 445), based on the data from Barquist and coworkers [12]. The Tn index of these genes is shown in Panels C and D of **Figure 4**. We further recorded the genes that Barquist and colleagues had originally selected (301 genes from *S*. Typhi and 299 for *S*. Typhimurium) which contained a reanalysis of the data from Langridge [9], as well as all genes previously identified as 'essential' by Knuth [7], Khatiwari [10], Canals [11] and Thiele [13], regardless of whether such genes were successfully inactivated by others. This produced an 'all inclusive' list of 847 genes putatively essential for growth and survival, or under strong selection, in LB medium. Relatively few genes were consistently recorded as essential by all or most authors; most genes were found in two independent approaches or were single findings (results not shown).

free of Tn insertions under conditions that were considered to have reached Tn saturation [12]. Thus, a small number of insertions can be permitted, even in genes considered essential for life in laboratory medium. Since the chance to receive a Tn insertion depends on gene length, a highly variable parameter, the number of observed insertions needs to be corrected for gene length [9]. This produces an insertion index, where the number of observed insertions is divided by gene length. In addition, a likelihood can be calculated from the ratio of observed versus expected number of Tn insertions, to predict the chance of a gene being essential [9, 12]. For this approach, a cutoff value is required, to bin genes as either essential or not. The problem with this is that the used parameter (likelihood P value, Tn-insertion index or both) is a continuously increasing value. This makes the choice of the cutoff inevitably arbitrarily: There is no biological reason why genes bordering this cutoff would or would not be essential. To illustrate the difficulty, we plotted the P value reported by Barquist and colleagues [12], who provided the most elaborate list of Tn mutants available to date (**Figure 3**). Panel A of the figure shows how the P value of all genes of *S*. Typhimurium steadily increases. Similar results are obtained for *S*. Typhi (not shown), and even for those genes that have very low P values, there is a continuous increase, as shown in Panel B. Note that in this figure, the log10 value was plotted for clarity, and the cutoff value corresponding to a P value of <0.05 is indicated by the red line. Clearly, this value is artificial, since there is no noticeable increment

A slightly different picture emerges when the Tn-insertion index is plotted, as shown in **Figure 4**. Although the increase in this index is also continuous, the shape of the obtained curve is slightly sigmoidal at the beginning, suggesting a trend toward saturation of the index value around 0.03, before it increases again. This trend is stronger for *S*. Typhi (Panel 4A) than for

**Figure 3.** The continuous increase of P values of Tn insertions. In Panel A, P values of all 4463 genes of *S*. Typhimurium are plotted. In Panel B, a selection of 2675 *S*. Typhimurium genes is shown with P values >0 but <0.1, plotted for the exponent (log10) of the P values for clarity. The red line indicates the cutoff of P < 0.05, corresponding with a log10 value

around this value.

8 Current Topics in Salmonella and Salmonellosis

of −1.3 that was used by the authors. Data after Ref. [12].

**Figure 4.** Analysis of transposon insertion frequency for genes of *S*. Typhi (left) and *S*. Typhimurium (right), based on data published by [12]. In Panels A and B, all genes are sorted for Tn index. The bottom Panels C and D show an enlargement of the part in the red square of A and B, respectively. For more explanation, see text.

A word of caution is needed here. It turned out to be rather cumbersome to identify the genes mentioned in the original published data (mostly using the supplementary tables provided with the publications) and to compare the findings with those of others, because genes were mostly described by gene names, which are by no means suitable as unique identifiers. For instance, the large operon for LPS-biosynthesis is called *waa* in *S*. Typhi but *rfb* in *S*. Typhimurium; the essential gene *mrdA* of *E. coli* is called that in *S*. typhimurium, but it is *pbpA* in *S*. Typhimurium. The gene that is called *ribE* in both *Salmonella* genomes is essential, but it is called *ribC* in *E. coli*, while *ribE* in the latter species is called *ribH* in *Salmonella* (also essential). This makes it very risky to assume two genes are the same if they have the same name, or different if they do not. In most reports, a short protein functional description is provided, which can assist in correct identification, but many genes have very general functional characteristics, or are of unknown function. In such cases, the only way to identify which gene was meant is to use the gene location, but even that information does not always prove to be sufficient, for instance, when authors have re-annotated a genome but did not make this annotation public.

In conclusion, it is tedious and sometimes impossible to connect the findings from one study to those of another. Genes scored as 'essential' by one group can be inactivated without consequences on viability by another group. Moreover, most so-called essential genes endure a low number of transposon insertions without the loss of viability.

#### **2.3. Conserved genes found in the core genome of** *Salmonella enterica*

The second approach to identify essential genes in *Salmonella* is based on bioinformatical analysis of published genome sequences. If a gene is essential for growth, one can expect it to be strictly conserved between genomes, so a comparison on gene conservation can identify possible candidates. This is also not a completely unambiguous approach and depends on a number of choices that have to be made. For instance, one must define homologs between genomes in order to assess if genes are conserved, but this requires a defined percentage of homology that must be allowed and required for genes to be combined into a gene family. In addition, how should one deal with very short open reading frames, in other words, what is the minimum length of genes included, without adding too many artificial short open reading frames? And should one use original gene annotations, which is a transparent procedure that is easily reproducible, or is it better to re-annotate genomes using a standardized procedure to reduce variation? The latter approach produces more robust data as it no longer depends on variable gene calling, but it is less transparent when the used re-annotations are not made public. When core genomes are being defined from a set of highly different organisms, it may be required to allow for genes that are missing in a low number of analyzed genomes. However, when dealing with a single species, one could apply a strict requirement of presence in all genomes to produce a realistic core, especially if only fully sequenced genomes, re-annotated with a standardized algorithm, are included.

For this chapter, we decided to use publically available annotations, to aim for maximum transparency, and we further illustrate the effect of different core genome definitions. The core genome was established based on the annotations of the 362 completely sequenced *Salmonella enterica* genomes that were used to construct **Figure 2**, complemented with the three *S. bongori* genomes. Protein-coding genes were binned into gene families by the use of the program USEARCH [15] such that members of each family have at least 50% sequence identity and at least 50% alignment length of the best hit against the centroid of the family. Using a strict definition of required presence in all analyzed genomes, a so-called 100% core genome could be identified that consisted of 1061 gene families. Although this seems an impressive number, it is lower than expected, probably because of variations in the used gene annotations. Based on our experience with core-genome determination from many bacterial genera, we were expecting the core genome of *S. enterica* to be larger, as the species contains relatively closely related organisms. Thus, we relaxed the requirement to allow gene presence in 344 or 95% of the investigated genomes. This produced a core genome of 3499 gene families, a size that is comparable with the preliminary core established for thousands of sequenced *Salmonella* genomes (S-R Jun and DW Ussery, unpublished data). We also constructed the core genome for *S. bongori*, but with only three genomes available, this core is relatively large, as a core genome usually decreases with an increasing number of included genomes. For the core genome of the complete *Salmonella* genus, these two datasets were combined. The results are summarized in **Table 1**.

**Table 1** further lists that 11 genes from the 95% core were not annotated in the reference genome of the species typestrain *S. enterica* subsp *enterica* Typhimurium LT2. Originally, this number was much higher: There appeared to be 141 of the 3499 core genes missing in the annotated *S*. Typhimurium LT2 genome. However, when the DNA sequences of these genes were checked against the reference genome, 130 were actually present but not annotated. Thus, only 11 core genes remained that appear to be truly missing in the reference genome. This number did not change for core gene families based on *S. enterica* or the complete *Salmonella* genome (**Table 1**).


**Table 1.** Core genome analysis based on 365 *Salmonella* genome sequences.

*S*. Typhimurium; the essential gene *mrdA* of *E. coli* is called that in *S*. typhimurium, but it is *pbpA* in *S*. Typhimurium. The gene that is called *ribE* in both *Salmonella* genomes is essential, but it is called *ribC* in *E. coli*, while *ribE* in the latter species is called *ribH* in *Salmonella* (also essential). This makes it very risky to assume two genes are the same if they have the same name, or different if they do not. In most reports, a short protein functional description is provided, which can assist in correct identification, but many genes have very general functional characteristics, or are of unknown function. In such cases, the only way to identify which gene was meant is to use the gene location, but even that information does not always prove to be sufficient, for instance, when authors have re-annotated a genome but did not make this

In conclusion, it is tedious and sometimes impossible to connect the findings from one study to those of another. Genes scored as 'essential' by one group can be inactivated without consequences on viability by another group. Moreover, most so-called essential genes endure a

The second approach to identify essential genes in *Salmonella* is based on bioinformatical analysis of published genome sequences. If a gene is essential for growth, one can expect it to be strictly conserved between genomes, so a comparison on gene conservation can identify possible candidates. This is also not a completely unambiguous approach and depends on a number of choices that have to be made. For instance, one must define homologs between genomes in order to assess if genes are conserved, but this requires a defined percentage of homology that must be allowed and required for genes to be combined into a gene family. In addition, how should one deal with very short open reading frames, in other words, what is the minimum length of genes included, without adding too many artificial short open reading frames? And should one use original gene annotations, which is a transparent procedure that is easily reproducible, or is it better to re-annotate genomes using a standardized procedure to reduce variation? The latter approach produces more robust data as it no longer depends on variable gene calling, but it is less transparent when the used re-annotations are not made public. When core genomes are being defined from a set of highly different organisms, it may be required to allow for genes that are missing in a low number of analyzed genomes. However, when dealing with a single species, one could apply a strict requirement of presence in all genomes to produce a realistic core, especially if only fully sequenced genomes,

For this chapter, we decided to use publically available annotations, to aim for maximum transparency, and we further illustrate the effect of different core genome definitions. The core genome was established based on the annotations of the 362 completely sequenced *Salmonella enterica* genomes that were used to construct **Figure 2**, complemented with the three *S. bongori* genomes. Protein-coding genes were binned into gene families by the use of the program USEARCH [15] such that members of each family have at least 50% sequence identity and at least 50% alignment length of the best hit against the centroid of the family. Using a strict definition of required presence in all analyzed genomes, a so-called 100% core genome could

low number of transposon insertions without the loss of viability.

re-annotated with a standardized algorithm, are included.

**2.3. Conserved genes found in the core genome of** *Salmonella enterica*

annotation public.

10 Current Topics in Salmonella and Salmonellosis

It was further checked if core gene families in the reference genome contained multiple entries, in other words, whether those core gene families contained orthologs or paralogs. This was the case for 120 gene families. When the function of these gene copies is interchangeable, these orthologs can be considered as 'back-up' copies, possibly maintained in the genome to protect against loss of essential function; alternatively, the genome can contain orthologs to allow for a higher production of the gene product. The multiple copies of the ribosomal RNA genes would be a nice example of the latter, though they are not captured in our core genome analysis, which was restricted to protein-coding genes only. To give another example, multiple copies of ferric enterobactin (enterochelin) transporters were found. Such orthologs of essential genes can complicate the outcome of in vitro mutagenesis analyses, as discussed above. However, not all orthologous genes are duplicated because they are essential, so it is not a predictive characteristic.

The genomes used for **Table 1** were not only used to select conserved core genomes, but also to define the pan genome, containing all gene families of the *Salmonella* genus. This is visually represented in **Figure 5**. The pan genome increases in size until approximately 180 genomes have been added, at which stage it reaches a plateau and is hardly affected by addition of further *S. enterica* genomes. It increases again when *S. enterica* Infantis and especially when *S. bongori* genomes are added, as these introduce novel gene families to the pan genome. Panel B of **Figure 5** illustrates the validity of defining a 95% core, instead of applying the strict requirement of presence in 100% of all genomes. The 100% core genome steadily decreases with the cumulative addition of the genomes analyzed here (the order of the genomes is the same as for Panel A) and decreases sharply to approximately 1000 gene families after addition of the *S. bongori* genomes. Instead, in the 95%, core genome is quite robust and remains more or less constant at around 3470 gene families (**Figure 5**).

**Figure 5.** Pan-core plots based on 365 *Salmonella* genomes. Panel A shows the pan genome of *Salmonella*, with *S. bongori* added last. Panel B shows the core genome of the 365 *Salmonella* genomes with 95 and 100% conservation.

As was discussed in the previous section, the literature findings on essential genes are often controversial, for reasons discussed, while core genome determination is also not without caveats. Importantly, one can assume that all genes required for growth in LB medium must be conserved in all genomes and thus be part of the core, though the reverse may not be true: Not all core genes will be essential for growth and survival under these laboratory conditions. Therefore, we checked which of the essential genes reported in the literature were actually present in the core genome. For this, we used the 95% core genome, though core genes missing in the original annotation of the reference genome of *S*. Typhimurium LT2 were added manually. A total of 683 core genes could with reasonable confidence be identified that at least by one approach was found as putatively essential (results not shown). Conversely, of the 870 genes that were identified as essential by any of the methods discussed in the previous section, 694 were identified as part of the 95% core. The least reliable prediction of 'essential' genes turned out to be a low P value of Tn insertion, as this contained the highest fraction of genes that were not part of the core.

#### **2.4. How close is** *S***. Typhimurium to** *E. coli***?**

The genomes used for **Table 1** were not only used to select conserved core genomes, but also to define the pan genome, containing all gene families of the *Salmonella* genus. This is visually represented in **Figure 5**. The pan genome increases in size until approximately 180 genomes have been added, at which stage it reaches a plateau and is hardly affected by addition of further *S. enterica* genomes. It increases again when *S. enterica* Infantis and especially when *S. bongori* genomes are added, as these introduce novel gene families to the pan genome. Panel B of **Figure 5** illustrates the validity of defining a 95% core, instead of applying the strict requirement of presence in 100% of all genomes. The 100% core genome steadily decreases with the cumulative addition of the genomes analyzed here (the order of the genomes is the same as for Panel A) and decreases sharply to approximately 1000 gene families after addition of the *S. bongori* genomes. Instead, in the 95%, core genome is quite robust and remains more

As was discussed in the previous section, the literature findings on essential genes are often controversial, for reasons discussed, while core genome determination is also not without caveats. Importantly, one can assume that all genes required for growth in LB medium must be conserved in all genomes and thus be part of the core, though the reverse may not be true: Not all core genes will be essential for growth and survival under these laboratory conditions. Therefore, we checked which of the essential genes reported in the literature were actually present in the core genome. For this, we used the 95% core genome, though core genes missing in the original annotation of the reference genome of *S*. Typhimurium LT2 were added manually. A total of 683 core genes could with reasonable confidence be identified that at least by one approach was found as putatively essential (results not shown). Conversely, of the

**Figure 5.** Pan-core plots based on 365 *Salmonella* genomes. Panel A shows the pan genome of *Salmonella*, with *S. bongori*

added last. Panel B shows the core genome of the 365 *Salmonella* genomes with 95 and 100% conservation.

or less constant at around 3470 gene families (**Figure 5**).

12 Current Topics in Salmonella and Salmonellosis

This chapter started with a comparison of all *Enterobacteriaceae*, to illustrate the close relationship between *Salmonella, Citrobacter* and *Escherichia*. But how close are *Salmonella* and *Escherichia*, in terms of conserved proteins? To address this question, the core genes of *S. enterica* Typhimurium LT2 (the type strain of the species) were compared to the core genes recently defined for *E. coli* (using the same definitions and parameters) [16], which we applied to the species typestrain *E. coli* DSM 30083. As reported in **Table 1**, the 95% core genome of all *Salmonella* comprises 3470 gene families, of which 11 are missing in Typhimurium LT2. This strain thus contains 3459 core gene families, while the *E. coli* typestrain contains 3100 core gene families. When these were compared, it was found that 2615 of these are shared, which corresponds to 75.6% of the *S*. Typhimurium LT2 core gene families, 84.4% of *E. coli* DSM 30083 and 66.3% of the total gene families assessed for these two species. This is illustrated in Panel A of **Figure 6**. The definition for gene families applied here is the same as for **Table 1** and **Figure 5**, but as explained above, this requires a defined cutoff for sequence similarity. The biological function of proteins is mostly defined by their functional domains, which is sometimes only a fraction of the total protein sequence. Thus, we narrowed this analysis down, to define the common core genome based on functional domains only, using Pfam domains. Since a Pfam domain is not described for all core genes, there were fewer domains captured in this comparison (2416 for *S*. typhimurium LT2 and 2263 for *E. coli* DSM 30083). Panel B of **Figure 6** shows that there are 2142 shared protein domains, corresponding to 88.7% of the *S*. Typhimurium LT2 core proteins, 94.7% of the *E. coli* DSM 30083 core proteins, and 84.4% of the total number of functional domains captured here. Interestingly, the fractions of shared core genes and shared functional domains are larger for the *E. coli* typestrain than for the *Salmonella enterica* typestrain. We believe this is caused by the larger diversity of the *E. coli* species, compared to *S. enterica*. As a consequence, the core genome of *E. coli* is smaller, even at 95%, which means a larger fraction of these is shared with *S. enterica*.

We further investigated the functions of the *Salmonella* core gene families in *S. Typhimurium* LT2 and found that most of them related to cellular metabolism. The core genome of *S. Typhimurium* LT2 was mapped to the genome-scale metabolic model SMT\_v1.0 [13], which resulted in a total of 1271 genes and 2545 metabolic reactions. As shown in Panel C of **Figure 6**, 1012 genes from the *S. Typhimurium* LT2 core genome have a metabolic function (~80% of total genes in the model) and these account for 2358 metabolic reactions (93% of total reactions in the model). When comparing this with the *E. coli* core genome, *S. Typhimurium* LT2 has 156 unique metabolic genes, responsible for 452 metabolic reactions. The unique metabolic reactions that were identified here are mostly involved in transport systems across the inner membrane as well as the outer membrane (porins), specific transport of inorganic ions, and the recycling of lipopolysaccharide biosynthesis components. Such analyses can share light on the biochemical and metabolic properties that *Salmonella* is specialized in, related to its intracellular lifestyle.

**Figure 6.** Comparison of *Salmonella* and *E. coli* core genes, using the type strains for both species. Panel A shows the size and overlap of the core gene families. Panel B shows the comparison using PfamA domains. Panel C summarizes how many metabolic pathways are shared in the *Salmonella* and *E. coli* cores.

#### **2.5. Conserved RNAs across 201** *Salmonella* **genomes**

So far, all analyses were based on the annotated proteomes of the *Salmonella* genomes, but genes that code for RNA as the final product should not be ignored. A genome annotation would not be complete without its ribosomal RNA genes, coding for 5S, 16S and 23S RNA, as well as the tRNA genes. *Salmonella enterica* contains 7 *rrn* operons, which is more than can be found in many bacterial species but certainly is not a maximum, as some soil bacteria can contain up to 15 copies of the rRNA genes. The number of *rrn* copies of bacterial species has been related to their capacity to change their metabolism to use available resources [17]. Although it is often assumed that these gene duplications are all identical, in fact some degree of sequence variation can be observed, even within a genome. For *Salmonella*, it was reported that the gene encoding 16S rRNA (which is typically used for taxonomic description) is conserved for 97% only [18]. The gene coding for 23S rRNA is also not strictly conserved in *Salmonella*, as it contains both point mutations and indels [19].

The number of tRNA genes present in the *Salmonella* reference genome is 85, representing 47 different tRNA molecules that together cover the 40 required anticodons [20]. These numbers can vary between genomes and serovars. But these are not the only bacterial genes that are never translated into protein. In addition to essential RNA genes such as the gene coding for tmRNA (transfer-messenger RNA, required for correct protein translation), it is now recognized that bacterial genomes contain a large number of small RNA genes (sRNA) that are not always annotated. These are often involved in post-transcriptional regulation of gene expression [21]. As a final analysis, we decided to assess the conservation of these, incorrectly neglected, RNA genes.

The bioinformatic analysis performed was based on a publication where transcription start sites were identified from 31 *Salmonella* genomes [22]. We analyzed those 113 RNA genes in the 201 completely sequenced genomes. For this analysis, we excluded the nearly completed sequences that had been included in the analyses resulting in **Figures 2** and **5**, because genome assembly is biased toward protein-coding regions, so that regions on which sRNA genes may reside are likely to be missed, unless a genome is truly completed. For comparison, eight other *Enterobacteriaceae* were included. The results are presented in a matrix heat map (**Figure 7**). Based on their sRNA content, most of the genomes neatly clustered according to their serotype, with only few exceptions. Interestingly, the genomes of strains FORC-015 and FORC-020, which are annotated as Typhimurium, are placed outside the Typhimurium cluster in **Figure 7**, and these were also placed outside the main Typhimurium cluster in the AAI tree of **Figure 2**. Thus, it can be questioned if the serotype of these two strains was correctly identified. That most of the *Salmonella* genomes are nicely clustered according to their serotype in **Figure 7** is surprising, as the nonprotein coding sRNA genes analyzed here do not have a specific role in expression of surface antigens. The correlation identified here is in line with a publication that sRNA genes can be used as targets for serotype-specific PCR detection of Typhi and Paratyphi [23]. It was recently described that some sRNA genes of *S*. Typhimurium are under regulation of Sigma 28, and there is extensive cross talk between genes of the *Salmonella* pathogenicity pathways SPI1 and SPI2 and particular sRNA genes [24]. In this context, it is surprising that the sRNA genes are so strongly conserved throughout the *Salmonella* genomes (illustrated by the dominant red in **Figure 7**), whereas the presence of SPIs widely varies across serotypes [24]. This suggests that sRNA genes are strongly conserved and may well belong to the collection of essential genes, though this has not yet been experimentally demonstrated. The analysis further showed that the sRNA genes are specific for the *Salmonella* genus, and bear relatively little resemblance with the other *Enterobacteriacea* members included at the bottom of the figure.

**2.5. Conserved RNAs across 201** *Salmonella* **genomes**

many metabolic pathways are shared in the *Salmonella* and *E. coli* cores.

14 Current Topics in Salmonella and Salmonellosis

So far, all analyses were based on the annotated proteomes of the *Salmonella* genomes, but genes that code for RNA as the final product should not be ignored. A genome annotation would not be complete without its ribosomal RNA genes, coding for 5S, 16S and 23S RNA, as well as the tRNA genes. *Salmonella enterica* contains 7 *rrn* operons, which is more than can be found in many bacterial species but certainly is not a maximum, as some soil bacteria can contain up to 15 copies of the rRNA genes. The number of *rrn* copies of bacterial species

**Figure 6.** Comparison of *Salmonella* and *E. coli* core genes, using the type strains for both species. Panel A shows the size and overlap of the core gene families. Panel B shows the comparison using PfamA domains. Panel C summarizes how

**Figure 7.** Conserved sRNAs across 201 *Salmonella* genomes. The tree to the left mostly clusters serotypes together, based on their sRNA genes. Two wrongly placed *S*. Typhimurium genomes are pointed out by the arrows to the right. The tree at the top identifies clusters of related sRNA genes. The eight genomes at the bottom are from other *Enterobacteriaceae*.

## **3. Conclusions**

Based on genomic average amino acid identity (AAI), *Salmonella* genomes appear as a distinct clade within the enterics, closely related to the *Citrobacter* genus. The serovars of *S. enterica* subsp. *enterica* generally cluster together when analyzed for AAI. There is a stable core set of about 3400 gene families, found in nearly all *Salmonella enterica* genomes, and these genes are on average 99% or more identical to each other across all the *Salmonella* genomes. Further, many of these genes seem to be involved in metabolic processes, and the core genes account for about 80% of the total genes of the *Salmonella* genome-scale metabolic model. Finally, we examined small RNA conservation and found the same clustering of outlier genomes (e.g., particular *S*. Typhimurium strains) that were observed in the AAI analysis.

## **Acknowledgements**

This work has been funded in part by The Arkansas Research Alliance and UAMS.

## **Author details**

Trudy M. Wassenaar¹\*, Se-Ran Jun², Visanu Wanchai², Preecha Patumcharoenpol², Intawat Nookaew², Katrina Schlum³, Michael R. Leuze³ and David W. Ussery²

\*Address all correspondence to: trudy@mmgc.eu

1 Molecular Microbiology and Genomics Consultants, Zotzenheim, Germany

2 Department of BioMedical Informatics, University of Arkansas for Medical Sciences, LittleRock, AR, USA

3 Computing Science and Mathematics Division, Oak Ridge National Labs, Oak Ridge, Tennessee, USA

## **References**

**Figure 7.** Conserved sRNAs across 201 *Salmonella* genomes. The tree to the left mostly clusters serotypes together, based on their sRNA genes. Two wrongly placed *S*. Typhimurium genomes are pointed out by the arrows to the right. The tree at the top identifies clusters of related sRNA genes. The eight genomes at the bottom are from other *Enterobacteriaceae*.

16 Current Topics in Salmonella and Salmonellosis


[15] Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;**26**:2460–2461. doi:10.1093/bioinformatics/btq461

[4] Konstantinidis KT, Tiedje JM. Towards a genome-based taxonomy for prokaryotes.

[5] Retchless AC, Lawrence JG. Phylogenetic incongruence arising from fragmented speciation in enteric bacteria. Proc Natl Acad Sci USA. 2010;**107**:11453–11458. doi:10.1073/

[6] Foley SL, Johnson TJ, Ricke SC, Nayak R, Danzeisen J. *Salmonella* pathogenicity and host adaptation in chicken-associated serovars. Microbiol Mol Biol Rev. 2013;**77**:582–607.

[7] Knuth K, Niesalla H, Hueck CJ, Fuchs TM. Large-scale identification of essential *Salmonella* genes by trapping lethal insertions. Mol Microbiol. 2004;**51**:1729–1744.

[8] Hidalgo AA, Trombert AN, Castro-Alonso JC, Santiviago CA, Tesser BR, Youderian P, Mora GC. Insertions of mini-Tn10 transposon T-POP in *Salmonella enterica* sv. typhi.

[9] Langridge GC, Phan MD, Turner DJ, Perkins TT, Parts L, Haase J, Charles I, Maskell DJ, Peters SE, Dougan G, Wain J, Parkhill J, Turner AK. Simultaneous assay of every *Salmonella* Typhi gene using one million transposon mutants. Genome Res. 2009;

[10] Khatiwara A, Jiang T, Sung SS, Dawoud T, Kim JN, Bhattacharya D, Kim HB, Ricke SC, Kwon YM. Genome scanning for conditionally essential genes in *Salmonella enterica* Serotype Typhimurium. Appl Environ Microbiol. 2012;**78**:3098–3107. doi:10.1128/

[11] Canals R, Xia XQ, Fronick C, Clifton SW, Ahmer BM, Andrews-Polymenis HL, Porwollik S, McClelland M. High-throughput comparison of gene fitness among related bacteria.

[12] Barquist L, Langridge GC, Turner DJ, Phan MD, Turner AK, Bateman A, Parkhill J, Wain J, Gardner PP. A comparison of dense transposon insertion libraries in the *Salmonella* serovars Typhi and Typhimurium. Nucleic Acids Res; 2013;**41**:4549–4564. doi:10.1093/

[13] Thiele I, Hyduke DR, Steeb B, Fankam G, Allen DK, Bazzani S, Charusanti P, Chen FC, Fleming RM, Hsiung CA, De Keersmaecker SC, Liao YC, Marchal K, Mo ML, Özdemir E, Raghunathan A, Reed JL, Shin SI, Sigurbjörnsdóttir S, Steinmann J, Sudarsan S, Swainston N, Thijs IM, Zengler K, Palsson BO, Adkins JN, Bumann D. A community effort towards a knowledge-base and mathematical model of the human pathogen

*Salmonella Typhimurium* LT2. BMC Syst Biol. 2011;**5**:8. doi:10.1186/1752-0509-5-8

[14] Santiviago CA, Reynolds MM, Porwollik S, Choi SH, Long F, Andrews-Polymenis HL, McClelland M. Analysis of pools of targeted *Salmonella* deletion mutants identifies novel genes affecting fitness during competitive infection in mice. PLoS Pathog.

J Bacteriol. 2005;**187**:6258–6264. doi:10.1128/JB.187.18.6258-6264.2005

Genetics. 2004;**167**:1069–1077. doi:10.1534/genetics.104.026682

BMC Genomics. 2012;**13**:212. doi:10.1186/1471-2164-13-212

2009;**5**(7):e1000477. doi:10.1371/journal.ppat.1000477

pnas.1001291107

18 Current Topics in Salmonella and Salmonellosis

AEM.06865-11

nar/gkt148

doi:10.1128/MMBR.00015-13

**19**:2308–2316. doi:10.1101/gr.097097.109


**Provisional chapter**

## **Computational Identification of Indispensable Virulence Proteins of** *Salmonella* **Typhi CT18 Computational Identification of Indispensable Virulence Proteins of** *Salmonella* **Typhi CT18**

Shrikant Pawar, Izhar Ashraf, Kondamudi Manobhai Mehata and Chandrajit Lahiri Shrikant Pawar, Izhar Ashraf, Kondamudi Manobhai Mehata and Chandrajit Lahiri

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66489

#### **Abstract**

Typhoid infections have become an alarming concern with the increase of multidrug resistant strains of *Salmonella* serovars. The new pathogenic Gram-negative strains are resistant to most antibiotics such as chloramphenicol, ampicillin, trimethoprim, ciprofloxacin and even co-trimoxazole and their derivatives thereby causing numerous outbreaks in the Indian subcontinent, Southeast Asian and African countries. Conventional and modern methods of typing had been adopted to differentiate outbreak strains. However, identifying the most indispensable proteins from the complete set of proteins of the whole genome of *Salmonella* sp., comprising the *Salmonella* pathogenicity islands (SPI) responsible for virulence, has remained an ever challenging task. We have adopted a network-based method to figure out, albeit theoretically, the most significant proteins which might be involved in the resistance to antibiotics of the *Salmonella* sp. An understanding of the above will provide insight into conditions that are encountered by this pathogen during the course of infection, which will further contribute in identifying new targets for antimicrobial agents.

**Keywords:** *Salmonella*, *Salmonella* pathogenicity island, SicA, eigen vector centrality, k-core analysis

## **1. Introduction**

Food-borne infections are quite common and widely distributed worldwide, though there can be several sources of such diseases. Human Salmonellosis or typhoid, causing systemic infection of the human gastrointestinal tract and diarrhoea, is one such common disease caused by *Salmonella enterica* serovar Typhi. With a prevalence of probably 10 millions of

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

cases and hundreds of thousands of deaths every year [1], the disease has turned out to be a major cause for concern with the emergence of multidrug-resistant (MDR) *Salmonella* strains [2]. Such new strains are resistant to chloramphenicol, ampicillin, trimethoprim, ciprofloxacin and even co-trimoxazole and their derivatives, thereby causing numerous outbreaks in the Indian subcontinent, Southeast Asian and African countries [3, 4]. Thus, newer drugs like cephalosporins and quinolone derivatives needed to be explored to combat the situation [5].

To deal with the threats of multidrug resistance, several health intervention strategies have been undertaken. However, the prospects for finding new antibiotics for several classes of Gram-negative pathogens are especially poor due to the blockades provided by their outer membrane to the entry of some existing antibiotics and expulsion of many of the remainder by their efflux pumps [6]. It has become imperative that the conventional strategies for dealing with such pathogens are less effective or even at times, ineffective completely, to emerge victorious against the strategies for the war waged out by them. In such cases, the complexities posed can be solved by adopting some non-conventional approaches of finding the drug targets for these pathogens. Proteins, being the functional unit of the cell of any living organism, have always been good targets for combating diseases. Diseases, on the other hand, serve as interesting examples of complex protein interactions among several other heterogeneous entities of and between organisms. However, understanding the complexity of such interacting protein partners, especially with respect to the combat against the pathogens, has always been elusive. Thus, analyses of the mosaic mesh or network of interacting proteins, commonly known as protein interaction networks (PINs) can provide sufficient insight to reveal the indispensable virulent proteins for valuable drug targets [7].

Analyses of a PIN, to highlight important and/or indispensable proteins, can be as simple as centrality measurements with respect to the biological scenario. These can start by determining the number of interacting partners of a particular protein to identify its *degree* centrality (DC) which correlates with its biological importance. Thus, high-degree proteins (or hubs) are known to correspond to proteins that are essential [8]. As a protein can be affected locally while interacting with its other partners in the global network, other centrality measures are also given importance based on their relevance. Thus, we have discussed the importance of the measures like *closeness centrality* (CC), *betweenness centrality* (BC) and *eigenvector centrality* (EC) [8] parameters for PIN comprising the *Salmonella* pathogenicity islands (SPI) harbouring the specialized virulent proteins characterized by the type III secretion system (T3SS) among others. Till date, 17 such discrete sets have been reported for *S. Typhi* [9] along with the five SPI (1 till 5) characterized experimentally [10] among which SicA has been identified as the indispensable one in the phylogenetically closest neighbour, *S. enterica* serovar Typhimurium strain LT2 [11].

Again, extracting knowledge of the most indispensable virulence proteins from among the stipulated sets of SPI proteins could be quite insufficient. Thus, we have carried out further analyses of the whole genome of *S*. Typhi CT18 encompassing the decomposition of the whole genome protein interactome to a core of highly interacting proteins through the k-core analysis approach [12]. We have performed cartographic analyses further to identify the functional modules in the network [13] and predicted the indispensability of certain sets of proteins, which have been shown to be sharing similar functional modules empirically important for drug targets.

## **2. Approach**

cases and hundreds of thousands of deaths every year [1], the disease has turned out to be a major cause for concern with the emergence of multidrug-resistant (MDR) *Salmonella* strains [2]. Such new strains are resistant to chloramphenicol, ampicillin, trimethoprim, ciprofloxacin and even co-trimoxazole and their derivatives, thereby causing numerous outbreaks in the Indian subcontinent, Southeast Asian and African countries [3, 4]. Thus, newer drugs like cephalosporins and quinolone derivatives needed to be explored to combat the

To deal with the threats of multidrug resistance, several health intervention strategies have been undertaken. However, the prospects for finding new antibiotics for several classes of Gram-negative pathogens are especially poor due to the blockades provided by their outer membrane to the entry of some existing antibiotics and expulsion of many of the remainder by their efflux pumps [6]. It has become imperative that the conventional strategies for dealing with such pathogens are less effective or even at times, ineffective completely, to emerge victorious against the strategies for the war waged out by them. In such cases, the complexities posed can be solved by adopting some non-conventional approaches of finding the drug targets for these pathogens. Proteins, being the functional unit of the cell of any living organism, have always been good targets for combating diseases. Diseases, on the other hand, serve as interesting examples of complex protein interactions among several other heterogeneous entities of and between organisms. However, understanding the complexity of such interacting protein partners, especially with respect to the combat against the pathogens, has always been elusive. Thus, analyses of the mosaic mesh or network of interacting proteins, commonly known as protein interaction networks (PINs) can provide sufficient insight to reveal the indispensable virulent proteins for valuable drug

Analyses of a PIN, to highlight important and/or indispensable proteins, can be as simple as centrality measurements with respect to the biological scenario. These can start by determining the number of interacting partners of a particular protein to identify its *degree* centrality (DC) which correlates with its biological importance. Thus, high-degree proteins (or hubs) are known to correspond to proteins that are essential [8]. As a protein can be affected locally while interacting with its other partners in the global network, other centrality measures are also given importance based on their relevance. Thus, we have discussed the importance of the measures like *closeness centrality* (CC), *betweenness centrality* (BC) and *eigenvector centrality* (EC) [8] parameters for PIN comprising the *Salmonella* pathogenicity islands (SPI) harbouring the specialized virulent proteins characterized by the type III secretion system (T3SS) among others. Till date, 17 such discrete sets have been reported for *S. Typhi* [9] along with the five SPI (1 till 5) characterized experimentally [10] among which SicA has been identified as the indispensable one in the phylogenetically closest neighbour, *S. enterica* serovar Typhimurium

Again, extracting knowledge of the most indispensable virulence proteins from among the stipulated sets of SPI proteins could be quite insufficient. Thus, we have carried out further analyses of the whole genome of *S*. Typhi CT18 encompassing the decomposition of the whole genome protein interactome to a core of highly interacting proteins through the k-core

situation [5].

22 Current Topics in Salmonella and Salmonellosis

targets [7].

strain LT2 [11].

#### **2.1. Dataset collection**

Proteins for 17 *Salmonella* pathogenicity islands (SPIs) were collected from an in silico study of SPI for *S. enterica* serovar Typhi strain CT18 [9]. The locus tag of all the proteins of SPI for *S*. Typhi CT18 was fed as queries to the STRING 10.0 biological meta-database [14] to get all the possible interactions of a particular protein (date and time of access: Jul 28 2016 13:07:15). Detailed protein links file under the accession number 220341 in STRING was used to collect all the interactions of the whole genome proteins of *S*. Typhi.

The number of proteins from the different genomic islands starting from SPI-1 till -13 and -15 till -18 were 54, 43, 8, 7, 10, 55, 144, 12, 4, 23, 16, 4, 14, 9, 7, 2 and 97, respectively, with all the combined SPI amounting to a total of 502. The total number of protein interactions obtained from STRING v10 were 334, 339, 3, 21, 9, 192, 1193, 12, 6, 69, 19, 1, 19, 5, 3, 1, 343, for the 17 SPI loci mentioned above and 2570 interactions for all of these combined together. The whole genome of *S*. Typhi had 1041274 interaction information arising out of 4529 unique proteins.

#### **2.2. Interactome construction**

All individual protein interaction data, with medium confidence values obtained by default from String 10.0, were imported into Cytoscape version 3.3.0 [15] to integrate and build the interactomes of network comprising SPI-1 till -13 and -15 till -18, individually and all these 17 SPI collectively (AS). The interaction information, weighted by their strength as per STRING, of all the proteins of *S*. Typhi genome was imported into Gephi 0.9.1 [16] to construct and visualize the interactome of the whole genome. An interactome of proteins can be perceived as the protein interaction network (PIN) and can be represented as an undirected graph *G* = (*V, E*) consisting of a finite set of *V* vertices (or nodes) and *E* edges. An edge *e* = (*u, v*) connects two vertices (nodes) *u* and *v*. Each protein in the above PIN is represented as a vertex/node. The number of connections/interactions/associations/links a node has with other nodes comprises its degree *d* (*v*) [17].

#### **2.3. Network analyses**

#### *2.3.1. SPI-PIN*

All the interactomes of SPI-PIN have been viewed by Cytoscape version 3.3.0 in the form of graphs of aforementioned interconnected proteins. The networks were subsequently analysed via the Cytoscape integrated java plugin CytoNCA [18] to compute values for the network centrality parameters namely EC, DC, CC and BC. Combined scores from different parameters considered in STRING were taken as edge weights for computing CytoNCA scores. Top 20 proteins for each of the centrality measures were taken for drawing Venn diagrams to find common proteins from each measure.

#### *2.3.2. WhoG-PIN*

As few (21) nodes out of the whole genome were isolated from the major part of network, these were considered to have less impact on the overall topology and thus ignored. Further analyses were based on the large connected component (LCC) of network comprising 4508 protein partners having 1041182 interactions. The analytical study has been done by using MATLAB version 7.11, a programming language developed by MathWorks [19].

For the primary understanding of the network, the distributions of network degree (k) were plotted by Complementary Cumulative Distribution Function (CCDF). To extract significant information from the topology of the large and complex Whole Genome Protein Interaction Network (WhoG-PIN), knowledge of the role of each protein was derived from the cartographic representation of within-module degree z-score of the protein versus its participation coefficient as per the methodology described by Guimera et al. [20]. Participation of each protein reflected its positioning within own module and with respect to other modules, where modules were calculated based on Rosvall method [21]. To have an idea of the core group of the very specific proteins which might have variety of role to play in the whole genome context, a k-core analysis was performed following the network decomposition (pruning) techniques to produce a sequence of subgraph of gradually increasing cohesion [12].

#### **3. Features of the 17 SPIs**

The virulence proteins of *Salmonella* are spread across the 17 *Salmonella* pathogenicity islands (SPIs) in *S*. Typhi as implied by Ong et al. [9]. Among these, five have been well characterized and reported to have SicA as the most indispensable one as identified computationally by Lahiri et al. [11]. A detailed insight into these SPI proteins would reveal SPI-1 and -2 to encode the proteins of the type III secretion systems (T3SSs), while SPI-4 encodes those of type I secretion system (T1SS) mediated by a giant non-fimbrial adhesin, which is co-regulated by the invasion genes encoded by the SPI-1 [22]. The *sit* gene cluster proteins of SPI-1 T3SS, encoding an iron uptake system, are involved in the invasion into the eukaryotic host non-phagocytic cells mediated by the delivery of effectors that directly engage host cell signalling pathways [10]. For the systemic phase of infection, proteins of the SPI-2 cluster are essential for the survival and replication in eukaryotic host cells [23], which are aided by the high-affinity magnesium uptake system encoded by *mgtCB*, harboured by SPI-3 [24]. The effector proteins of enteropathogenesis are harboured by SPI-5 and are induced by distinct regulatory cues and targeted to different TTSS, namely, SopB, secreted by SPI1 T3SS and PipB, translocated by SPI-2 T3SS to the *Salmonella*-containing vacuole and *Salmonella*-induced filaments.

parameters considered in STRING were taken as edge weights for computing CytoNCA scores. Top 20 proteins for each of the centrality measures were taken for drawing Venn dia-

As few (21) nodes out of the whole genome were isolated from the major part of network, these were considered to have less impact on the overall topology and thus ignored. Further analyses were based on the large connected component (LCC) of network comprising 4508 protein partners having 1041182 interactions. The analytical study has been done by using

For the primary understanding of the network, the distributions of network degree (k) were plotted by Complementary Cumulative Distribution Function (CCDF). To extract significant information from the topology of the large and complex Whole Genome Protein Interaction Network (WhoG-PIN), knowledge of the role of each protein was derived from the cartographic representation of within-module degree z-score of the protein versus its participation coefficient as per the methodology described by Guimera et al. [20]. Participation of each protein reflected its positioning within own module and with respect to other modules, where modules were calculated based on Rosvall method [21]. To have an idea of the core group of the very specific proteins which might have variety of role to play in the whole genome context, a k-core analysis was performed following the network decomposition (pruning) techniques to produce a sequence of subgraph of gradually

The virulence proteins of *Salmonella* are spread across the 17 *Salmonella* pathogenicity islands (SPIs) in *S*. Typhi as implied by Ong et al. [9]. Among these, five have been well characterized and reported to have SicA as the most indispensable one as identified computationally by Lahiri et al. [11]. A detailed insight into these SPI proteins would reveal SPI-1 and -2 to encode the proteins of the type III secretion systems (T3SSs), while SPI-4 encodes those of type I secretion system (T1SS) mediated by a giant non-fimbrial adhesin, which is co-regulated by the invasion genes encoded by the SPI-1 [22]. The *sit* gene cluster proteins of SPI-1 T3SS, encoding an iron uptake system, are involved in the invasion into the eukaryotic host non-phagocytic cells mediated by the delivery of effectors that directly engage host cell signalling pathways [10]. For the systemic phase of infection, proteins of the SPI-2 cluster are essential for the survival and replication in eukaryotic host cells [23], which are aided by the high-affinity magnesium uptake system encoded by *mgtCB*, harboured by SPI-3 [24]. The effector proteins of enteropathogenesis are harboured by SPI-5 and are induced by distinct regulatory cues and targeted to different TTSS, namely, SopB,

MATLAB version 7.11, a programming language developed by MathWorks [19].

grams to find common proteins from each measure.

24 Current Topics in Salmonella and Salmonellosis

*2.3.2. WhoG-PIN*

increasing cohesion [12].

**3. Features of the 17 SPIs**

The 59 kb SPI-6 consists of a type VI secretion system (T6SS), the *safABCD* fimbrial gene cluster, the invasin *pagN*, two pseudogenes as transposase remnants (*STY0343* and *STY0344*), the fimbrial operon *tcfABCD* and the genes *tinR* and *tioA* [25–29]. The largest SPI identified till date is that of SPI-7 with 134 kb size [25, 30, 31] and 150 genes inserted between duplicated *pheU* tRNA sequences [30, 32] containing the Vi capsule biosynthesis genes [33], a type IVB pilus operon [34] and the *SopE* prophage (*ST44*) [35]. SPI-9 is a 16 kb locus containing three genes encoding for a T1SS and one for a large protein [36]. SPI-10 is an island found next to the *leuX* tRNA gene at centisome 93. It is a 33 kb fragment [25] carrying a full P4-related prophage, termed *ST46* [37–39]. ST46 harbours the *prpZ* cluster as cargo genes encoding eukaryotic-type Ser/Thr protein kinases and phosphatases involved in *S*. Typhi survival in macrophages [40]. SPI-11 is a 10 kb fragment in *S*. Typhi and includes phoP-activated genes *pagD* and *pagC* involved in intramacrophage survival [41, 42]. The 6.3 kb SPI-12 contains the effector *SspH2* [43] along with the three ORFs are pseudogenes (*STY2466a, STY2468* and *STY2469*). SPI-13 was initially identified in serovar Gallinarum [44]. In *S*. Typhi, it is a 25-kb gene cluster found next to the *pheV* tRNA gene on centrosome 67. The 8-kb portion of this island corresponds to SPI-8 whose virulence function is unknown, and it harbours two bacteriocin immunity proteins (*STY3281* and *STY3283*) and four pseudogenes [25]. SPI-14 is absent in *S*. Typhi [36, 44]. SPI-15 in *S*. Typhi is a 6.5 kb island of five ORFs encoding hypothetical proteins [44]. SPI-16 is a 4.5 kb fragment inserted next to an *argU* tRNA site, and encodes five or seven Open reading frames (ORFs), four of which are pseudogenes, the three remaining ORFs show a high level of identity with P22 phage genes involved in seroconversion [45]. SPI-17 is a 5-kb island encoding six ORFs inserted next to an *argW* tRNA site [45]. SPI-18 was recently identified in *S*. Typhi as a 2.3 kb fragment harbouring only two ORFs: *STY1498* (*clyA*) and *STY1499* [46] of which the former encodes a 34 kDa pore-forming secreted cytolysin [46, 47].

## **4. The individual and the combined SPI-PINs**

To focus upon the most indispensable proteins of the highly complex virulent phenotype as that of *Salmonella*, an integrated picture comprising the involvement of all the SPI and the connected associated proteins must be taken into account. Thus, with an ultimate goal to identify the indispensable virulent proteins for potential candidates of therapeutic targets, we have constructed the PINs or interactomes of the 17 individual SPI mentioned above, along with and a combined network of all of these SPI-PINs (AS). These were then analysed to identify the most important proteins among a group of highest number of interacting partners. This was done by utilizing the four important concepts of centrality applied to biological networks, namely eigenvector centrality (EC), degree centrality (DC), closeness centrality (CC) and betweenness centrality (BC) [48–50].

Amongst the four centrality measures being mentioned above, DC is the most basic as it brings out the involvement of the protein in a large number of interactions in a network. However, in a biological scenario of *Salmonella* infection, having the primary stages as attachment and invasion, the interactions of those proteins may not be in a sequential order so as to carry out a particular function as reflected through DC parametric analyses. In such cases, analyses of CC could be a good measure, which would reveal the close proximities of the proteins expected to communicate sequentially with other network proteins essential for a particular function. Again, a one-to-many type simultaneous interaction of a protein, rendering different functions, is imperative from the complexities of biological phenotype like virulence. Thus, the protein with a high proportion of interactions lying 'in between' and thereby connecting many other proteins in the network would be revealed through BC measures. This could have reflected to be quite an important protein, though it lacks the idea of connecting other important proteins in the network. EC measures the last concept and reflects the indispensable protein connecting other important proteins. A comparative picture of the parametric values of the top 20 rank holders in their descending order have been consolidated and put in a tabular form (**Table 1**). These rankers in either of the cases have the proteins reflected to be important.

There have been three clear trends observed across the topmost rankers of the SPI-PINs for the measures of DC, BC, CC and EC, respectively. In most of the cases, there is a unanimous decision for the top ranking protein showing its utmost importance nearing to indispensability. SPI-PINs of these categories are -1, -3, -4, -5, -7, -8, -9, -10 to -13 and -15 to -17. The other categories have either three or two of the centrality measures conforming to the unanimosity of the top ranking proteins. SPI-2, -18 and the all SPI (AS-PIN) have BC differing in the top ranking position whereas SPI-6 and -10 have segregation of DC and EC against CC and BC for the top ranking positions. The common top ranking proteins across these 17 SPI and the AS has been reflected in **Figure 1** with Venn diagrams.

It has been observed that with SPI-1, protein HilA is ranked highest. HilA is the central regulator in SPI-1, which activates the sip operon that is responsible in encoding secreted proteins, as well as the *inv/spa* and *prg* operons encoding components of the secretion apparatus [51, 52]. SPI-2 till -4 has all the secretion apparatus inner membrane proteins SsaG, FidL and STY4452 as the top rankers, respectively. Among the other top rankers, the inositol phosphate phosphatase, SopB, of SPI-5, an atypical fimbria chaperone protein SafB and ImpA-related N-family protein, STY0286, of SPI-6, the pilin protein, PilL, of SPI-7, bacteriocin immunity protein, STY3281, of SPI-8, a large repetitive protein with six Bacterial\_Ig-like domains, t2643, of SPI-9, bacteriophage gene regulatory protein, STY4826, of SPI-10, cytolethal distending toxin protein, CdtB, of SPI-11, uronate isomerase, UxaC, of SPI-13 and the sensory histidine kinase protein, having role in motility and virulence, BarA, of SPI-18 are noteworthy.

With respect to the above analyses of the individual interactomes of the SPI, an idea about the importance of these proteins in their individual SPI and finally across all SPI could be obtained. However, for a drug to be effective, the indispensability issue of these proteins needs to be taken care of. Thus, a broader picture with respect to the whole genome proteins of *S*. Typhi is then delineated to address the concern.


Amongst the four centrality measures being mentioned above, DC is the most basic as it brings out the involvement of the protein in a large number of interactions in a network. However, in a biological scenario of *Salmonella* infection, having the primary stages as attachment and invasion, the interactions of those proteins may not be in a sequential order so as to carry out a particular function as reflected through DC parametric analyses. In such cases, analyses of CC could be a good measure, which would reveal the close proximities of the proteins expected to communicate sequentially with other network proteins essential for a particular function. Again, a one-to-many type simultaneous interaction of a protein, rendering different functions, is imperative from the complexities of biological phenotype like virulence. Thus, the protein with a high proportion of interactions lying 'in between' and thereby connecting many other proteins in the network would be revealed through BC measures. This could have reflected to be quite an important protein, though it lacks the idea of connecting other important proteins in the network. EC measures the last concept and reflects the indispensable protein connecting other important proteins. A comparative picture of the parametric values of the top 20 rank holders in their descending order have been consolidated and put in a tabular form (**Table 1**). These rankers in either of the cases have the

There have been three clear trends observed across the topmost rankers of the SPI-PINs for the measures of DC, BC, CC and EC, respectively. In most of the cases, there is a unanimous decision for the top ranking protein showing its utmost importance nearing to indispensability. SPI-PINs of these categories are -1, -3, -4, -5, -7, -8, -9, -10 to -13 and -15 to -17. The other categories have either three or two of the centrality measures conforming to the unanimosity of the top ranking proteins. SPI-2, -18 and the all SPI (AS-PIN) have BC differing in the top ranking position whereas SPI-6 and -10 have segregation of DC and EC against CC and BC for the top ranking positions. The common top ranking proteins across these 17 SPI and the

It has been observed that with SPI-1, protein HilA is ranked highest. HilA is the central regulator in SPI-1, which activates the sip operon that is responsible in encoding secreted proteins, as well as the *inv/spa* and *prg* operons encoding components of the secretion apparatus [51, 52]. SPI-2 till -4 has all the secretion apparatus inner membrane proteins SsaG, FidL and STY4452 as the top rankers, respectively. Among the other top rankers, the inositol phosphate phosphatase, SopB, of SPI-5, an atypical fimbria chaperone protein SafB and ImpA-related N-family protein, STY0286, of SPI-6, the pilin protein, PilL, of SPI-7, bacteriocin immunity protein, STY3281, of SPI-8, a large repetitive protein with six Bacterial\_Ig-like domains, t2643, of SPI-9, bacteriophage gene regulatory protein, STY4826, of SPI-10, cytolethal distending toxin protein, CdtB, of SPI-11, uronate isomerase, UxaC, of SPI-13 and the sensory histidine

kinase protein, having role in motility and virulence, BarA, of SPI-18 are noteworthy.

With respect to the above analyses of the individual interactomes of the SPI, an idea about the importance of these proteins in their individual SPI and finally across all SPI could be obtained. However, for a drug to be effective, the indispensability issue of these proteins needs to be taken care of. Thus, a broader picture with respect to the whole genome proteins

proteins reflected to be important.

26 Current Topics in Salmonella and Salmonellosis

AS has been reflected in **Figure 1** with Venn diagrams.

of *S*. Typhi is then delineated to address the concern.


**Table 1.** Details of the 17 groups of SPI proteins involved in the network. Computational Identification of Indispensable Virulence Proteins of *Salmonella* Typhi CT18 http://dx.doi.org/10.5772/66489 29

**Figure 1.** Venn diagram representation for the top rankers of DC, CC, BC and EC parametric analyses of 17 SPI-PINs and AS-PIN.

#### **5. Feature of the WhoG-PIN**

**SPI**

9 10

STY4826,STY4832,STY4830,

STY4822,STY4852,STY4821,

STY4849,t4521,STY4834,

STY4828,STY4833,STY4829,

t2655,STY4851,STY4825,

STY4827,STY4823,sefC,sefB,

11

cdtB,pagC,envE,STY1879,

STY1880,pagD,STY1889,

STY1890,STY1891,cspH,

msgA,STY1887,

12 13

sspH2,STY2468,

uxaC,ordL,STY3296,STY3294,

STY3295,STY3298,STY3293,

uxuA,uxuB,exuT,STY3302,

STY3303,

15 16 17 18

gtrA2,STY2629,

barA,cpxR,csrA,flag,flhA,flhB,

fliA,fliF,fliH,fliJ,fliP,fliQ,fliR,

fliZ,phoQ,rcsB,rcsC,rpoS,

STY1297,yojN,

All SPI

pilL,STY4521,STY4523,

STY4526,STY4528,STY4530,

STY4534,STY4562,STY4564,

STY4569,STY4571,STY4572,

STY4573,STY4575,STY4576,

STY4577,STY4579,STY4665,

STY4666,t4268,

**Table 1.**

Details of the 17 groups of SPI proteins involved in the network.

STY0605, gtrB, gtrA

STY0605,gtrB,gtrA,STY3188,

STY0605,gtrB,gtrA,STY3188,

STY0605,gtrB,gtrA,STY3188,

STY3189,STY3192,STY3193,

STY0605, gtrB, gtrA

gtrA2,STY2629,

barA,clpP,cpxR,csrA,dnaK,flag,

fliA,groL,hns,ompR,phoB,phoQ,

rcsB,rcsC,rpoN,rpoS,sirA,

STY1297,STY1678,yojN,

pilL,STY4521,STY4523,

STY4528,STY4530,STY4534,

STY4561,STY4562,STY4564,

STY4569,STY4571,STY4572,

STY4573,STY4575,STY4576,

STY4577,STY4586,STY4665,

STY4666,t4268,

STY3189,STY3192,STY3193,

STY0605, gtrB, gtrA

gtrA2,STY2629,

acrR,baeR,barA,clpP,csrA,dnaK,

flag,fliA,hns,mgtA,mntH,

ompF,phoQ,rcsB,rcsC,rpoN,rpoS,soxS,

STY1297,STY1678,

barA,pilL,pilV,rpoS,sicA,

STY4521,STY4523,STY4526,

STY4561,STY4586,STY4592,

STY4618,STY4622,STY4644,

STY4645,STY4658,STY4664,

STY4666,t4317,tviD,

STY3189,STY3192,STY3193,

t2643,STY2876,STY2877,

STY2878,

**Degree**

**Betweenness** t2643,STY2876,STY2877,

STY2878,

STY4832,sefC,STY4826,

STY4830,STY4843,STY4822,

STY4849,STY4852,sefB,

STY4821,t4521,STY4834,

STY4828,STY4851,STY4833,

STY4829,t2655,STY4825,

STY4827,STY4823,

cdtB,pagC,envE,STY1879,

cdtB,pagC,envE,STY1879,

STY1880,pagD,STY1889,

STY1890,STY1891,cspH,

msgA,STY1887,

sspH2,STY2468,

uxaC,ordL,STY3296,STY3294,

STY3295,STY3298,STY3293,

uxuA,uxuB,exuT,STY3302,

STY3303,

STY1880,pagD,STY1889,

STY1890,STY1891,cspH,

msgA,TY1887,

sspH2,STY2468,

uxaC,ordL,STY3296,STY3294,

STY3295,STY3298,STY3293,

uxuA,uxuB,exuT,STY3302,

STY3303,

**Closeness** t2643,STY2876,STY2877,

STY2878,

STY4832,STY4826,STY4830,

STY4822,STY4821,STY4849,

STY4834,STY4828,STY4852,

t4521,sefC,STY4833,STY4829,

t2655,sefB,STY4851,STY4825,

STY4827,STY4823,STY4850,

**Eigenvector**

t2643,STY2876,STY2877,

STY2878,

STY4826,STY4830,STY4832,STY4822,STY4852,

STY4821,

STY4849,STY4834,t4521,

STY4828,STY4851,STY4825,STY4827,STY4823,

28 Current Topics in Salmonella and Salmonellosis

STY4833,

STY4829,t2655,STY4850,sefC,sefB,

cdtB,pagC,envE,STY1879,

STY1880,pagD,STY1889,

STY1890,STY1891,cspH,

msgA,STY1887,

sspH2,STY2468,

uxaC,ordL,STY3296,

STY3294,STY3295,

STY3298,STY3293,uxuA,

uxuB,exuT,STY3302,

STY3303,

STY0605,gtrB,gtrA,STY3188,STY3189,STY3192,

STY3193,

STY0605, gtrB, gtrA

gtrA2,STY2629,

barA,flag,flhA,flhB,flhD,

fliA,fliF,fliH,fliI,fliJ,fliO,

fliP,fliQ,fliR,fliZ,rcsB,

rcsC,rpoS,STY1297,yojN,

pilL,STY4521,STY4523,

STY4528,STY4558,STY4559,STY4562,STY4563,

STY4568,STY4569,STY4571,STY4572,STY4573,

STY4576,STY4577,STY4579,STY4665,t4268,

STY4564,

STY4575,

It is imperative that the WhoG-PIN, built from the empirical and theoretical results of physical and functional interactions among proteins laid down in STRING, can be random like that

**Figure 2.** (a) Protein-protein interaction network of the whole genome of Salmonella Typhi CT18 with inset (b) showing degree distribution of the proteins from the large connected component.

proposed by Erdos and Renyi [53] or a small-world type proposed by Watts and Strogatz [54]. The idea was to see if the connectivity distribution, *P*(k), of a node in a network getting connected to k other nodes, decays exponentially for large values of k. It was observed that the WhoG-PIN roughly follows the power law and is free of a characteristic scale [55] with a tailed degree distribution (**Figure 2**).

## **6. Decomposition of WhoG-PIN**

In order to get an idea of the indispensable ones from the barrage of proteins involved in the individual SPI-PINs and AS, we have performed a k-core analysis for them. A k-core is a subgraph whose nodes have degree at least equal to k. Nodes which are part of k-core, but not in the k+1 core, is called, k-shell. This is able to classify the nodes (proteins, in our study) based on the variety of their interacting partners. Proteins, which belong to outer shell, have lower k value and thus reflect limited number of interacting partner proteins. Moreover, proteins, which belong to inner k-core/shell, are specific ones, highly interacting with each other and thus can be considered to be the most important ones. Decomposition of this core decomposes the network and thus makes this the innermost core.

After decomposition of the WhoG-PIN, we have obtained the inner core member proteins which are highly robust, central and thus highly interactive in nature [56]. We have arrived to the 154th core with a number of 2180 proteins (**Figure 3**; data not shown). An idea was to look in for the rank holder proteins of the AS-PIN obtained through the EC, DC, CC or BC measures. Interestingly, it was found that the top ranker PilL, across EC, DC and CC measures, belong to the 111th core and not the 154th core. On the contrary, the top ranking BC protein, BarA, was in the 154th core along with the closely ranked PilV in the 150th core. The only other protein, amongst the unanimous top rankers of AS-PIN, STY4521 had a position of 145 in k-core measures. Very strikingly, two proteins of BC top rankers were also in the 154th innermost core along with BarA. These were the RNA polymerase sigma factor, RpoS and the chaperone protein, SicA. On a note of comparison among the top ranking proteins of EC and BC analysed for AS-PIN, proteins of the latter group had higher ranks in the whole genome context, with STY4586, STY4644 and STY4664 having the same 154th innermost core measures. On the contrary, those from the former ranking group (EC) mostly moved around the core numbers 56–70. This reflected that proteins from the BC rankers were more important in their interaction with other proteins, forming a bridge amongst those and thereby rendering high betweenness.

In an earlier work by Lahiri et al., SicA was found to be in the group of innermost core of the interactome comprising the five most extensively worked out SPI of *S*. Typhimurium [11]. This core group had IacP, InvA, InvB, InvC, InvE, InvF, InvG, InvI, InvJ, OrgA, OrgB, OrgC, PrgH, PrgI, PrgJ, PrgK, SipA, SipB, SipC, SipD, SpaO, SpaQ, SpaS, SpiC, SptP, SsaJ, SseC, SseD and SseF as other members. Referring to the context of *S*. Typhi, IacP, InvE, invF, InvG, PrgK, SpaL (InvC in *S*.Typhimurium), SpaO, SpaS and SptP all shared the same innermost 154th core with a close contestant SsaJ in the 153rd core. Interestingly, all these proteins belong to the SPI-1 and SPI-2 group, which makes up the needle for injecting the virulence factors as delineated in the **Figure 4** of Lahiri et al. All these take us to the juncture where we Computational Identification of Indispensable Virulence Proteins of *Salmonella* Typhi CT18 http://dx.doi.org/10.5772/66489 31

**Figure 3.** Distribution of the k-shell sizes for the set of proteins from the WhoG-PIN of *S*. Typhi CT18.

proposed by Erdos and Renyi [53] or a small-world type proposed by Watts and Strogatz [54]. The idea was to see if the connectivity distribution, *P*(k), of a node in a network getting connected to k other nodes, decays exponentially for large values of k. It was observed that the WhoG-PIN roughly follows the power law and is free of a characteristic scale [55] with a tailed

In order to get an idea of the indispensable ones from the barrage of proteins involved in the individual SPI-PINs and AS, we have performed a k-core analysis for them. A k-core is a subgraph whose nodes have degree at least equal to k. Nodes which are part of k-core, but not in the k+1 core, is called, k-shell. This is able to classify the nodes (proteins, in our study) based on the variety of their interacting partners. Proteins, which belong to outer shell, have lower k value and thus reflect limited number of interacting partner proteins. Moreover, proteins, which belong to inner k-core/shell, are specific ones, highly interacting with each other and thus can be considered to be the most important ones. Decomposition of this core decomposes

After decomposition of the WhoG-PIN, we have obtained the inner core member proteins which are highly robust, central and thus highly interactive in nature [56]. We have arrived to the 154th core with a number of 2180 proteins (**Figure 3**; data not shown). An idea was to look in for the rank holder proteins of the AS-PIN obtained through the EC, DC, CC or BC measures. Interestingly, it was found that the top ranker PilL, across EC, DC and CC measures, belong to the 111th core and not the 154th core. On the contrary, the top ranking BC protein, BarA, was in the 154th core along with the closely ranked PilV in the 150th core. The only other protein, amongst the unanimous top rankers of AS-PIN, STY4521 had a position of 145 in k-core measures. Very strikingly, two proteins of BC top rankers were also in the 154th innermost core along with BarA. These were the RNA polymerase sigma factor, RpoS and the chaperone protein, SicA. On a note of comparison among the top ranking proteins of EC and BC analysed for AS-PIN, proteins of the latter group had higher ranks in the whole genome context, with STY4586, STY4644 and STY4664 having the same 154th innermost core measures. On the contrary, those from the former ranking group (EC) mostly moved around the core numbers 56–70. This reflected that proteins from the BC rankers were more important in their interaction with other proteins, forming a bridge amongst those and thereby rendering high betweenness. In an earlier work by Lahiri et al., SicA was found to be in the group of innermost core of the interactome comprising the five most extensively worked out SPI of *S*. Typhimurium [11]. This core group had IacP, InvA, InvB, InvC, InvE, InvF, InvG, InvI, InvJ, OrgA, OrgB, OrgC, PrgH, PrgI, PrgJ, PrgK, SipA, SipB, SipC, SipD, SpaO, SpaQ, SpaS, SpiC, SptP, SsaJ, SseC, SseD and SseF as other members. Referring to the context of *S*. Typhi, IacP, InvE, invF, InvG, PrgK, SpaL (InvC in *S*.Typhimurium), SpaO, SpaS and SptP all shared the same innermost 154th core with a close contestant SsaJ in the 153rd core. Interestingly, all these proteins belong to the SPI-1 and SPI-2 group, which makes up the needle for injecting the virulence factors as delineated in the **Figure 4** of Lahiri et al. All these take us to the juncture where we

degree distribution (**Figure 2**).

30 Current Topics in Salmonella and Salmonellosis

**6. Decomposition of WhoG-PIN**

the network and thus makes this the innermost core.

**Figure 4.** Cartographic representation for classification of proteins from the WhoG-PIN of *S*. Typhi CT18 based on its role and region in network space.

can foresee that the needle proteins are quite important virulence factors when it comes to search targets for drug. To top them all, SicA stands out as being one of the topmost rankers in BC measure of AS-PIN and in the innermost core of the WhoG-PIN. This is quite justified as SicA is a *Salmonella* type III secretion-associated invasin chaperone protein required for the stabilization of SipB and SipC to prevent their premature association which may lead to their targeting for degradation. Along with InvF, SicA is required for transcriptional activation of several virulence genes like *sigDE* (*sopB, pipC*), *sipBCDA* and *sopE*. [57].

#### **7. Cartographic analyses of WhoG-PIN**

For the purpose of classification of the proteins of *S*. Typhi CT18, based on their functional role and region in the network space, we have performed a cartographic analyses for the WhoG-PIN. As described earlier here, this is delineated by within module z-score of each node (protein) and its participation coefficient within and between other modules [20]. The within-module degree z-score measures how 'well connected' a node 'i' is to other nodes in the module, while the participation coefficient measures how the node 'i' is positioned in its own module and with respect to other modules. These measures are done based on the modules of the network, which are calculated by Rosval method [21]. The proteins are mainly divided into two major categories namely the hub nodes and the non-hub nodes.

As can be understood from the name itself, a hub is a connection point of many nodes. The category of non-hub nodes can be assigned four different roles namely, R1 comprising ultraperipheral nodes, R2 of peripheral nodes, R3 of non-hub connector nodes and R4 having the non-hub kinless nodes. Likewise, the hub nodes can be assigned three different roles namely, R5 of provincial hubs, R6 of connector hubs and R7 of kinless hubs (**Figure 4**). The kinless hubs nodes are supposed to be important in terms of functionality, which has high connection within module as well as between modules. Accordingly, the ultra-peripheral nodes occupy the least connecting position in the network followed by the peripheral nodes. These nodes can be pruned easily without much affecting the whole network while decomposing it to reach the core (refer previous section for k-core). The non-hub connectors are expected to take part in only a small but fundamental set of interactions. This is just opposite to those of the provincial hubs class which have many within-module connections. The non-hub kinless nodes are those with links homogeneously distributed among all modules. The most conserved in terms of decomposition as well as evolution would be, however, those from the connector hubs with many links to most of the other modules. The system would try to retain these connections as essential ones for their very survival.

As can be perceived from the above classification of the connectors and the hubs, the proteins belonging to the R4, R6 and R7 role players are very crucial and can be regarded as potential drug targets. In the context of our WhoG-PIN, the only one R7 is a putative transposase, STY0115 and reminds of the Tn5 transposase, the enzyme that helps bacteria to share antibiotic resistance genes [58, 59]. This is closely followed by the plasmid transfer protein, TrhC


can foresee that the needle proteins are quite important virulence factors when it comes to search targets for drug. To top them all, SicA stands out as being one of the topmost rankers in BC measure of AS-PIN and in the innermost core of the WhoG-PIN. This is quite justified as SicA is a *Salmonella* type III secretion-associated invasin chaperone protein required for the stabilization of SipB and SipC to prevent their premature association which may lead to their targeting for degradation. Along with InvF, SicA is required for transcriptional activation of

For the purpose of classification of the proteins of *S*. Typhi CT18, based on their functional role and region in the network space, we have performed a cartographic analyses for the WhoG-PIN. As described earlier here, this is delineated by within module z-score of each node (protein) and its participation coefficient within and between other modules [20]. The within-module degree z-score measures how 'well connected' a node 'i' is to other nodes in the module, while the participation coefficient measures how the node 'i' is positioned in its own module and with respect to other modules. These measures are done based on the modules of the network, which are calculated by Rosval method [21]. The proteins are mainly

As can be understood from the name itself, a hub is a connection point of many nodes. The category of non-hub nodes can be assigned four different roles namely, R1 comprising ultraperipheral nodes, R2 of peripheral nodes, R3 of non-hub connector nodes and R4 having the non-hub kinless nodes. Likewise, the hub nodes can be assigned three different roles namely, R5 of provincial hubs, R6 of connector hubs and R7 of kinless hubs (**Figure 4**). The kinless hubs nodes are supposed to be important in terms of functionality, which has high connection within module as well as between modules. Accordingly, the ultra-peripheral nodes occupy the least connecting position in the network followed by the peripheral nodes. These nodes can be pruned easily without much affecting the whole network while decomposing it to reach the core (refer previous section for k-core). The non-hub connectors are expected to take part in only a small but fundamental set of interactions. This is just opposite to those of the provincial hubs class which have many within-module connections. The non-hub kinless nodes are those with links homogeneously distributed among all modules. The most conserved in terms of decomposition as well as evolution would be, however, those from the connector hubs with many links to most of the other modules. The system would try to retain

As can be perceived from the above classification of the connectors and the hubs, the proteins belonging to the R4, R6 and R7 role players are very crucial and can be regarded as potential drug targets. In the context of our WhoG-PIN, the only one R7 is a putative transposase, STY0115 and reminds of the Tn5 transposase, the enzyme that helps bacteria to share antibiotic resistance genes [58, 59]. This is closely followed by the plasmid transfer protein, TrhC

divided into two major categories namely the hub nodes and the non-hub nodes.

several virulence genes like *sigDE* (*sopB, pipC*), *sipBCDA* and *sopE*. [57].

**7. Cartographic analyses of WhoG-PIN**

32 Current Topics in Salmonella and Salmonellosis

these connections as essential ones for their very survival.

**Table 2.** Functions of the R4, R6 and R7 Proteins from the WhoG-PIN cartographic analysis.

in R6 group. This could very well play a good target for drugs as plasmids are known to be powerhouse of the antibiotic resistance genes [60]. Uncoupling of phosphotransferase system could also be an effective way of getting targets for novel drugs as exemplified by PtsG, TreB, NagE and t0287 [61]. Inhibition of glutamate Synthase, GltB has already been utilized as target for *Mycobacterium tuberculosis* [62] as has been uroporphyrinogen decarboxylase, HemE, *albeit* in a different context [63]. Recently, bacterial GCN5-related N-acetyltransferases of the R4 group have been thought of as essential drug targets as well [64]. All the functions of R7, R6 and R4 are listed in **Table 2**.

## **8. Conclusion**

This work schematically delineates a process of figuring out the most indispensable protein in a system of interacting proteins of *S*. Typhi. It deals with the computational framework of building of the theoretical networks comprising the 17 individual SPI-PINs along with the AS-PIN followed by the conventional parametric approach of identifying the most interacting protein connected to other important proteins in the concerned phenotype of virulence. This is reinforced by the analysis of disintegrating the WhoG-PIN to the innermost core of the proteins, essential for virulence. All these lead to the identification of SicA to be the most indispensable one amongst a group of other virulent proteins being benefitted through network centrality and decomposition analyses. A further investigation of the WhoG-PIN brought forth the proteins of important conserved class, potential enough to be the most important ones and thus indispensable among the barrage of other proteins of the whole genome of *S*. Typhi CT18.

## **Acknowledgements**

The authors wish to acknowledge the support of IMSc, Chennai and Dept. of Computer Applications at BSAU, Chennai for the provision of computational facilities. The personal contribution of Ong Su Yean for the SPI data and of Indrajeet Chakraborty for the formatting are highly appreciated and acknowledged.

## **Author contributions**

CL conceived the concepts, planned and designed the analyses. SP and MIA contributed equally for producing the data analysed by CL. Artwork was done by MIA and SP. CL primarily wrote and edited the manuscript aided by additional help from SP, MIA and KMM.

## **Conflict of interest**

The authors declare that they have no conflict of interest.

## **Author details**

in R6 group. This could very well play a good target for drugs as plasmids are known to be powerhouse of the antibiotic resistance genes [60]. Uncoupling of phosphotransferase system could also be an effective way of getting targets for novel drugs as exemplified by PtsG, TreB, NagE and t0287 [61]. Inhibition of glutamate Synthase, GltB has already been utilized as target for *Mycobacterium tuberculosis* [62] as has been uroporphyrinogen decarboxylase, HemE, *albeit* in a different context [63]. Recently, bacterial GCN5-related N-acetyltransferases of the R4 group have been thought of as essential drug targets as well [64]. All the functions of R7,

This work schematically delineates a process of figuring out the most indispensable protein in a system of interacting proteins of *S*. Typhi. It deals with the computational framework of building of the theoretical networks comprising the 17 individual SPI-PINs along with the AS-PIN followed by the conventional parametric approach of identifying the most interacting protein connected to other important proteins in the concerned phenotype of virulence. This is reinforced by the analysis of disintegrating the WhoG-PIN to the innermost core of the proteins, essential for virulence. All these lead to the identification of SicA to be the most indispensable one amongst a group of other virulent proteins being benefitted through network centrality and decomposition analyses. A further investigation of the WhoG-PIN brought forth the proteins of important conserved class, potential enough to be the most important ones and thus indispensable among the barrage of other proteins of the whole genome of *S*.

The authors wish to acknowledge the support of IMSc, Chennai and Dept. of Computer Applications at BSAU, Chennai for the provision of computational facilities. The personal contribution of Ong Su Yean for the SPI data and of Indrajeet Chakraborty for the formatting

CL conceived the concepts, planned and designed the analyses. SP and MIA contributed equally for producing the data analysed by CL. Artwork was done by MIA and SP. CL primarily wrote and edited the manuscript aided by additional help from SP, MIA and KMM.

R6 and R4 are listed in **Table 2**.

34 Current Topics in Salmonella and Salmonellosis

**8. Conclusion**

Typhi CT18.

**Acknowledgements**

**Author contributions**

**Conflict of interest**

are highly appreciated and acknowledged.

The authors declare that they have no conflict of interest.

Shrikant Pawar¹# , Izhar Ashraf2,3#, Kondamudi Manobhai Mehata and Chandrajit Lahiri⁴\*

\*Address all correspondence to: chandrajitl@sunway.edu.my

1 Department of Biology and Department of Computer Science, Georgia State University, Atlanta, GA, USA

2 The Institute of Mathematical Sciences, Chennai, India

3 B.S. Abdur Rahman University, Vandalur, Chennai, India

4 Department of Biological Sciences, Sunway University, Selangor, Malaysia
