**Analysis of Protein Interaction Networks to Prioritize Drug Targets of Neglected-Diseases Pathogens**

Aldo Segura-Cabrera1,5, Carlos A. García-Pérez1, Mario A. Rodríguez-Pérez2, Xianwu Guo2, Gildardo Rivera3 and Virgilio Bocanegra-García4 *1Laboratorio de Bioinformática 2Laboratorio de Biomedicina Molecular 3Laboratorio de Biotecnología Ambiental 4 Laboratorio de Medicina de Conservación Centro de Biotecnología Genómica, Instituto Politécnico Nacional 5U.A.M. Reynosa Aztlán, Universidad Autónoma de Tamaulipas, Reynosa México* 

#### **1. Introduction**

26 Medicinal Chemistry and Drug Design

Xiong, X.; Pirrung, M.C. (2008) Modular synthesis of candidate ndole-based insulin mimics

Yamato, M.; Hashigaki, K.; Yasumoto, Y.; Sakai, J.; Luduena, R.F.; Banerjee, A.; Tsukagoshi,

Yang, C.T.; Sreerama, S.G.; Hsieh, W.Y.; Liu, S. (2008) Synthesis and Characterisation of a

Zborowski, K.; Grybos, R.; Proniewicz, L.M. (2003) Determination of the most stable

Zborowski, K.; Grybos, R,; Proniewicz, L.M. (2005)Molecular structure of oxovanadium (IV)

S.; Tashiro, T.; Tsuruo, T. (1987) Synthesis and antitumor activity of tropolone derivatives. 6. Structure-activity relationships of antitumor-active tropolone and 8-

Novel Macrocyclic Chelator with 3-Hydroxy-4-Pyrone Chelating Arms and Its Complexes with Medicinally Important Metals. *Inorg. Chem.*, Vol.47, pp.2719-2727. Zaremba, K.; Lasocha, W.; Adamski, A.; Stanek, J.; Pattek-Janczyk. A. (2007) Crystal

structure and magnetic properties of tris(2-hydroxymethyl-4-oxo-4H-pyran-5-olato-

structures of selected hydroxypyrones and their cations and anions. *J. Mol. Struct.* 

complexes with maltol and kojic acid: aquantum mechanical study, *Inorg. Chem.* 

by Claisen rearrangement. *Org. Lett.*, Vol.10(6), pp.1151-1154.

hyroxyquinolone derivatives. *J. Med. Chem.,* Vol.30, pp.1897-1900.

κ2O5 ,O4)iron(III), *J. Coord. Chem.*, Vol.60(14), pp.1537-1546.

*(Theochem)*, Vol.639, pp.87-100.

*Commun*., Vol.8, pp.76-78.

Many technological, social and biological systems have been modeled in terms of large networks providing invaluable insight in the understanding of such systems. Systems biology is an emerging and multi-disciplinary discipline that studies the interactions of cellular components by treating them as part of an integrated system. Thus, systems biology has shown that functional molecules are involved in complex networks of interrelationships, and that most of the cellular processes depend on functional modules rather than isolated components. Large amounts of biological network data of different types are available, e.g., protein-protein interaction, transcriptional regulatory, signal transduction, and metabolic networks. Since proteins carry out most biological processes, the protein interaction networks (PINs) are of particular importance. The advancement of the functional genomics and systems biology of model organisms such as *Saccharomyces cerevisiae*, *Caenorhabditis elegans*, and *Drosophila melanogaster* has contributed to the development of experimental and computational methods, and also to the understanding of human complex diseases. The availability of these methods has facilitated systematic efforts at creating largescale data sets of protein interactions, which are modeled as PINs.

Usually, a PIN is represented as a graph where the proteins are the nodes and the interactions are the edges. According to the complex network theory, PINs are scale-free networks characterized by a power-law degree distribution. In scale-free networks, most nodes have a small number of links between them; whereas, a small percentage of nodes interact with a disproportionately large number of others. The nodes with a large number of links in PINs are called hub proteins. Functional genomics studies showed that in PINs, the deletion of a hub protein is lethal to the organism, a phenomenon known as the centralitylethality rule. This rule is widely believed to reflect the special importance of hubs in organizing the network, which in turn suggests the biological significance of network topology. Several well-known studied proteins that are implicated in human diseases are hub proteins. Examples include p53, p21, p27, BRCA1, ubiquitin, calmodulin, and others which play central roles in various cellular mechanisms.

Despite recent advances in systems biology of model organisms, the systems biology of human pathogenic organisms such as those that cause the so-called "neglected-diseases" has not received much attention. Neglected-diseases are chronic or related disabling infections affecting more than 1 billion people worldwide, mainly in Africa. Pathogens of neglecteddiseases include: Protozoan parasites (e.g., *Leishmania* spp.*, Plasmodium* spp.*,* and *Trypanosoma*  spp.), vector-borne helminthes (e.g., *Schistosoma* spp., *Brugia malayi*, and *Onchocerca volvulus*), soil-transmitted helminthes (e.g., *Ascaris lumbricoides* and *Trichuris trichura*), bacteria (e.g., *Mycobacterium tuberculosis* and *M*. *leprae*), and viruses (e.g., dengue and yellow fever virus). A number of factors limit the utility of existing drugs in neglected-diseases such as high cost, poor compliance, drug resistance, low efficacy, and poor safety. Since the evolution of drug resistance is likely to compromise every drug over time, the demand for new drugs and targets is continuous. The drug target identification is the first step in the drug discovery flowthrough process. This step is complicated because a drug target must satisfy a variety of criteria. The important factors in this context are mainly related to the toxicity to host, and the essentiality of the target to the pathogen's physiology for growth and survival. Thus, the topological and functional analysis of neglected-disease pathogen PINs offers a potentially effective strategy for identifying and prioritizing new drug targets.

This chapter will introduce the reader to the basic concepts of network analyses and outline why it is important in terms of predicting protein function and essentiality. Work involving PINs of neglected-disease pathogens will be explained so that the reader will understand the current state in terms of its application to prioritize drug targets. The experimental and computational methods most likely to be used to identify and predict PINs, and the strategies for identifying multiple potential drug targets in neglected-disease pathogens will be also outlined using several biological databases in an integrated way.

To achieve this goal, the chapter includes three sections. Firstly, we present an outline of the conceptual development of network biology. The applied functional genomics involving the analysis of PINs of model organisms has led to developing methods and principles for elucidating protein function. We will also explain how these concepts are connected with protein essentiality to identify their "weak" points on the PINs of neglected-disease pathogens and its use for prioritizing drug targets. In the second section, we outline the experimental and computational methods that are most extensively to be used to identify and predict PINs. Some new approaches for predicting PINs are also introduced. These include the probabilistic integrated network methods which have shown the capability to increase the accuracy and coverage of the PINs. These primary research articles will be reviewed and the potential applications for the future be explained. This section mainly focused on analyzing the PINs of most prevalent neglected-disease pathogens in which the use of drugs is often limited by factors including high cost, low efficacy, toxicity, and the emergence of drug resistance. The potential use as an integrated strategy aimed at prioritizing and identifying drug targets of neglected-disease pathogens will be put forward, and the argument for future research involving the application of many tools and strategies will be discussed. In the final section,

lethality rule. This rule is widely believed to reflect the special importance of hubs in organizing the network, which in turn suggests the biological significance of network topology. Several well-known studied proteins that are implicated in human diseases are hub proteins. Examples include p53, p21, p27, BRCA1, ubiquitin, calmodulin, and others

Despite recent advances in systems biology of model organisms, the systems biology of human pathogenic organisms such as those that cause the so-called "neglected-diseases" has not received much attention. Neglected-diseases are chronic or related disabling infections affecting more than 1 billion people worldwide, mainly in Africa. Pathogens of neglecteddiseases include: Protozoan parasites (e.g., *Leishmania* spp.*, Plasmodium* spp.*,* and *Trypanosoma*  spp.), vector-borne helminthes (e.g., *Schistosoma* spp., *Brugia malayi*, and *Onchocerca volvulus*), soil-transmitted helminthes (e.g., *Ascaris lumbricoides* and *Trichuris trichura*), bacteria (e.g., *Mycobacterium tuberculosis* and *M*. *leprae*), and viruses (e.g., dengue and yellow fever virus). A number of factors limit the utility of existing drugs in neglected-diseases such as high cost, poor compliance, drug resistance, low efficacy, and poor safety. Since the evolution of drug resistance is likely to compromise every drug over time, the demand for new drugs and targets is continuous. The drug target identification is the first step in the drug discovery flowthrough process. This step is complicated because a drug target must satisfy a variety of criteria. The important factors in this context are mainly related to the toxicity to host, and the essentiality of the target to the pathogen's physiology for growth and survival. Thus, the topological and functional analysis of neglected-disease pathogen PINs offers a potentially

This chapter will introduce the reader to the basic concepts of network analyses and outline why it is important in terms of predicting protein function and essentiality. Work involving PINs of neglected-disease pathogens will be explained so that the reader will understand the current state in terms of its application to prioritize drug targets. The experimental and computational methods most likely to be used to identify and predict PINs, and the strategies for identifying multiple potential drug targets in neglected-disease pathogens will

To achieve this goal, the chapter includes three sections. Firstly, we present an outline of the conceptual development of network biology. The applied functional genomics involving the analysis of PINs of model organisms has led to developing methods and principles for elucidating protein function. We will also explain how these concepts are connected with protein essentiality to identify their "weak" points on the PINs of neglected-disease pathogens and its use for prioritizing drug targets. In the second section, we outline the experimental and computational methods that are most extensively to be used to identify and predict PINs. Some new approaches for predicting PINs are also introduced. These include the probabilistic integrated network methods which have shown the capability to increase the accuracy and coverage of the PINs. These primary research articles will be reviewed and the potential applications for the future be explained. This section mainly focused on analyzing the PINs of most prevalent neglected-disease pathogens in which the use of drugs is often limited by factors including high cost, low efficacy, toxicity, and the emergence of drug resistance. The potential use as an integrated strategy aimed at prioritizing and identifying drug targets of neglected-disease pathogens will be put forward, and the argument for future research involving the application of many tools and strategies will be discussed. In the final section,

which play central roles in various cellular mechanisms.

effective strategy for identifying and prioritizing new drug targets.

be also outlined using several biological databases in an integrated way.

we describe, amenably, the basic criteria to select pathogen drug targets, and the PINs of neglected-disease pathogens will be described in such a manner that the chapter will work as a source of key literature references for students and researchers. Papers will be reviewed to describe these basic principles, using key publications containing data and quantitative analyses (models, figures, tables) for PINs of some neglected-disease pathogens. We will describe novel lines of research; pros and cons of the use of PINs for prioritizing and identifying drug targets of neglected-disease pathogens.

#### **2. Systems and network biology: Basic concepts**

Systems biology is a holistic approach that involves the study of the inter-relationships of all the different elements in a biological system in order to understand non-deterministic behaviors that emerge from interaction between the cellular components and their environment and not by studying them in an isolated manner, one at a time (Hood and Perlmutter 2004, Weston and Hood 2004, Kohl and Noble 2009). Thus, the cell's behavior can be understood as a consequence of the complex interactions between its numerous constituents such as DNA, RNA, proteins, and metabolites. These interactions are also responsible for performing processes critical to cellular survival. For example, during transcription process the regulatory proteins can activate or inhibit the expression of genes or regulate each other as part of gene regulatory networks. Likewise, the cellular metabolism can be integrated into a metabolic network whose fluxes are regulated by enzymes. Similarly, the PINs represent how the proteins work together through interactions that lead to the modification of protein functions or new roles in protein complexes.

The biological systems consisting of interacting cellular components have led to the use of graph theory and mathematical tools based on graphs where the individual components are represented by nodes and the interactions by links (Fig. 1). Albert and Barabási (2002) have shown the general properties found among several networks ranging from the Internet to social and biological networks (Albert and Barabási 2002). The analysis of topology of those networks showed that they deviate substantially from randomly built networks as studied by Erdös and Rényi (Fig. 1a) (Erdös and Rényi 1960). Also, these networks did not show a well-shaped frequency distribution of the number of links per node as expected from randomly formed networks; instead, they showed a power-law distribution, which is characteristic of scale-free networks (Fig. 1b and 1c) (Amaral *et al.,* 2000, Albert 2005).

In scale-free network, the majority of nodes have only a few links, whereas very few nodes have a large number of links. Those nodes are called hubs and they represent the most vulnerable points of a network (Barabasi and Albert 1999, Albert *et al.,* 2000, Jeong *et al.,* 2001, Yu *et al.,* 2004a, Tew *et al.,* 2007). The topological features of networks can be quantified by measuring topological parameters whose information content provides a description from local (e.g., single nodes or links) to network-wide level (e.g., connections and relationships between nodes). For example, the nodes of a graph can be characterized by means of the number of links they have (the number of other nodes to which they are connected). This parameter is called "node degree". In directed networks, it is possible to distinguish the number of directed links that points toward the node (in-degree), and the number of directed edges that points outward the node (out-degree). The node degree characterizes individual nodes; however, in order to relate this parameter to whole network, a network degree distribution can be defined. The degree distribution P(k) represents the fraction of nodes that have degree k and it is obtained by counting the number of nodes N(k) that have k = 1, 2… links and dividing it by the total number of nodes *N*. The degree distributions of numerous networks such as the Internet, social, and biological networks, follow a power law (Fig. 1b and 1c) which is defined by the functional equation P(*k*) *~ k*γ, where γ represents the degree exponent, taking usually values in the range between 2<γ<3 (Barabasi and Oltvai 2004). This function is intimately linked to the growth of the network in which new nodes are preferentially attached to already established nodes, a property that is also thought to characterize the evolution of biological systems (Jeong *et al.,* 2000).

Fig. 1. Three types of network models and their associated distributions: (a) random network, (b) scale-free network, and (c) hierarchical network.

fraction of nodes that have degree k and it is obtained by counting the number of nodes N(k) that have k = 1, 2… links and dividing it by the total number of nodes *N*. The degree distributions of numerous networks such as the Internet, social, and biological networks, follow a power law (Fig. 1b and 1c) which is defined by the functional equation P(*k*) *~ k*γ, where γ represents the degree exponent, taking usually values in the range between 2<γ<3 (Barabasi and Oltvai 2004). This function is intimately linked to the growth of the network in which new nodes are preferentially attached to already established nodes, a property that is

also thought to characterize the evolution of biological systems (Jeong *et al.,* 2000).

Fig. 1. Three types of network models and their associated distributions: (a) random

network, (b) scale-free network, and (c) hierarchical network.

The distance between any two nodes in a network could be defined by the path length. In other words, it represents how many links we need to pass between two nodes. Nevertheless, it could have many alternative paths between two nodes in a network. The path with the smallest number of links between the selected nodes (shortest path) is of special interest. A common characteristic of several biological networks, including metabolic networks (Jeong *et al.,* 2000, Wagner and Fell 2001) and PINs (Giot *et al.,* 2003, Yook *et al.,* 2004) is that any two nodes can be connected with a path of a few links only. The main biological implications of this characteristic are related to: i) how the biological networks are capable of rapid responses to perturbations; ii) its capacity to employ alternative roads for the same input and output; and iii) the ability to efficiently compensate the perturbations in essential pathways.

Another important issue derived from network analysis is the concept of modularity, which can be used to describe how a group of physically or functionally linked nodes work together to achieve a particular function. The topological parameter used to quantify the modularity in a network is the clustering coefficient *Ci*, which represents the ratio between the number of links connecting nodes adjacent to node *i* and the total possible number of links among them (Watts and Strogatz 1998). It is worth noting that in first instance, the modularity concept might be in contradiction of the scale-free nature of the networks because the presence of modules implies that there are clusters of nodes that are relatively isolated from the rest of the network. However, it has been demonstrated that modularity and scale-free properties naturally co-occur in biological networks indicating that modules are not independent, instead, they are combined to form a hierarchical network (Fig. 1c) (Ravasz *et al.,* 2002).

Biological networks, including PINs and metabolic networks are good examples of network modularity because they exhibit high average *Ci*, which are associated to a high level of network robustness (Alon *et al.,* 1999, Ravasz *et al.,* 2002, Barabasi and Oltvai 2004). The most common representation of a module or cluster in a network is as a highly interconnected group of nodes. The biological implication of the modularity concept is that the nodes that integrate a module tend to participate in related biological processes and pathways; for example, protein and nucleic-acid synthesis, protein degradation, signal transduction, and metabolic pathways (Ma'ayan *et al.,* 2005). The analysis of experimental PINs have shown to have a remarkably modularity character (Giot *et al.,* 2003, Yook *et al.,* 2004). These findings in experimental PIN maps have been used to improve the understanding of the pleiotropic effects, and how perturbations on genes or proteins can propagate through the network and produce, in appearance, unrelated or extensive effects.

In addition to the modules, within a network, small and recurring sub-graphs, known as interaction motifs, with well-defined topologies can be identified (Fig. 2). The frequency analysis of these interaction motifs in networks revealed that they are over-represented when compared to a randomized version of the same network, suggesting that not all subgraphs are equally significant in networks and that interaction motifs form functionally separable building blocks of cellular networks (Mangan and Alon 2003, Wuchty *et al.,* 2003, Alon 2007). For example, triangle motifs, also called feed-forward loops in directed networks, appear in both transcription-regulatory and neural networks. Likewise, there is evidence suggesting that specific motif type aggregates to form large motif clusters and that also appear to be commonly involved with certain functional roles (Milo *et al.,* 2002, ShenOrr *et al.,* 2002, Wuchty *et al.,* 2003). For example, in the *E. coli* transcription regulatory network, most motifs overlap, in which the specific motifs are no longer clearly separable (Shen-Orr *et al.*, 2002).

Fig. 2. Some types of interaction motifs found in biological networks.

The relevance of any node in mediating the communications flow among other nodes in the network is quantified by its betweenness centrality, which is defined as the total number of non-redundant shortest paths going through a certain node or edge (Freeman 1977). Girvan and Newman (2002), have proposed that the edges with high betweenness are the ones that are "between" network clusters; therefore, the information flow within a network could be altered by removing these edges (Girvan and Newman 2002). Dunn *et al*., (2005) using an edge betweenness based-method have shown that clusters in PINs tend to share similar functions (Dunn *et al*., 2005). Moreover, Yu *et al*., (2007) have reconsidered the classical meaning of betweenness as a measure of the centrality of the nodes in a PIN. They have defined those nodes as "bottlenecks" with the highest betweenness centrality and find that bottlenecks nodes have a higher probability to be essential (Yu *et al.*, 2007).

It is worth noting that the topological parameters might be combined between them or with additional information of functional annotations regarding the network nodes (genes or proteins). Thus, a network provides testable predictions ranging from single interactions to essential genes and functional modules (del Rio *et al.,* 2009). Likewise, the functions of unannotated genes or proteins can be also predicted on the basis of the annotation of their interacting partners. This approach to predict the protein/gene function is known as "guilty by association". Additionally, the integration of information related to diseases or specific phenotypes with network approaches also enhances the understanding of human diseases, pharmacology response, and phenotype prediction (Ideker and Sharan 2008, Lee *et al.,* 2008a, Lee *et al.,* 2010, Wang and Marcotte 2010, Lee *et al.,* 2011).

#### **3. Methods to identify protein interactions networks (PINs)**

#### **3.1 Experimental methods**

In the postgenomic era, the accumulation of protein-protein interaction data has enabled the biology systems studies at PINs levels (von Mering *et al.*, 2002). However, PIN analysis requires methods amenable to high throughput (HT) screening, such as large-scale versions of techniques like yeast two hybrid (Y2H) and tandem affinity purification coupled to mass spectrometry (TAP-MS) for performing systematic screens (Ito *et al.,* 2001a, Cusick *et al.,* 2005). In addition, there are a wide variety of methods to detect, analyze, and quantify protein interactions, including surface plasmon resonance spectroscopy, nuclear magnetic resonance (NMR), x-ray crystallography, and fluorescence-based technologies. These techniques provide detailed information on physical properties of protein interactions.

Orr *et al.,* 2002, Wuchty *et al.,* 2003). For example, in the *E. coli* transcription regulatory network, most motifs overlap, in which the specific motifs are no longer clearly separable

The relevance of any node in mediating the communications flow among other nodes in the network is quantified by its betweenness centrality, which is defined as the total number of non-redundant shortest paths going through a certain node or edge (Freeman 1977). Girvan and Newman (2002), have proposed that the edges with high betweenness are the ones that are "between" network clusters; therefore, the information flow within a network could be altered by removing these edges (Girvan and Newman 2002). Dunn *et al*., (2005) using an edge betweenness based-method have shown that clusters in PINs tend to share similar functions (Dunn *et al*., 2005). Moreover, Yu *et al*., (2007) have reconsidered the classical meaning of betweenness as a measure of the centrality of the nodes in a PIN. They have defined those nodes as "bottlenecks" with the highest betweenness centrality and find that

It is worth noting that the topological parameters might be combined between them or with additional information of functional annotations regarding the network nodes (genes or proteins). Thus, a network provides testable predictions ranging from single interactions to essential genes and functional modules (del Rio *et al.,* 2009). Likewise, the functions of unannotated genes or proteins can be also predicted on the basis of the annotation of their interacting partners. This approach to predict the protein/gene function is known as "guilty by association". Additionally, the integration of information related to diseases or specific phenotypes with network approaches also enhances the understanding of human diseases, pharmacology response, and phenotype prediction (Ideker and Sharan 2008, Lee *et al.,*

In the postgenomic era, the accumulation of protein-protein interaction data has enabled the biology systems studies at PINs levels (von Mering *et al.*, 2002). However, PIN analysis requires methods amenable to high throughput (HT) screening, such as large-scale versions of techniques like yeast two hybrid (Y2H) and tandem affinity purification coupled to mass spectrometry (TAP-MS) for performing systematic screens (Ito *et al.,* 2001a, Cusick *et al.,* 2005). In addition, there are a wide variety of methods to detect, analyze, and quantify protein interactions, including surface plasmon resonance spectroscopy, nuclear magnetic resonance (NMR), x-ray crystallography, and fluorescence-based technologies. These techniques provide detailed information on physical properties of protein interactions.

Fig. 2. Some types of interaction motifs found in biological networks.

bottlenecks nodes have a higher probability to be essential (Yu *et al.*, 2007).

2008a, Lee *et al.,* 2010, Wang and Marcotte 2010, Lee *et al.,* 2011).

**3.1 Experimental methods** 

**3. Methods to identify protein interactions networks (PINs)** 

(Shen-Orr *et al.*, 2002).

These methods are of paramount usefulness; however, herein, the techniques that can be applied to determine protein-protein interactions, at large-scale level, will be highlighted. In particular, the outcomes of Y2H system and TAP-MS are used further to perform *in silico* global network analysis. Both techniques were intensively applied to map the PIN of yeast, the first model organism with available PINs (Uetz *et al.,* 2000, Ito *et al.,* 2001b, Gavin *et al.,* 2002, Ho *et al.,* 2002, Ito *et al.,* 2002, Tong *et al.,* 2004, Yu *et al.,* 2008). Afterwards, large-scale efforts have been made to determine PINs for other model minor eukaryotic organisms: *D. melanogaster* (Giot *et al*., 2003), and *C. elegans* (Li *et al*., 2004); pathogenic microorganisms: *Helicobacter pylori*, *Campylobacter jejuni*, *Treponema pallidum*, *M. tuberculosis* (Wang et al., 2010), herpes simplex virus 1 (Lee *et al*., 2008b), and Kaposi's sarcoma-associated herpesvirus (Uetz *et al.,* 2006, Rozen *et al.,* 2008), and major eukaryotic organisms: *Arabidopsis thaliana* (de Folter *et al*., 2005) and humans (Rual *et al.,* 2005, Stelzl *et al.,* 2005, Gandhi *et al.,* 2006). Even though the PINs are not completed, the available PINs provide insight into how particular properties of proteins are integrated at systems level, and also, as a useful resource to predict the functional role of genes or proteins.

#### **3.1.2 Yeast two-hybrid (Y2H) system**

The Y2H system has considerably accelerated the *in vivo* large-scale screening of protein interactions enabling the detection of physically interacting proteins by using the modular organization of eukaryotic transcriptional activators. The eukaryotic transcription activators are formed by at least two distinct domains, one responsible of binding to a DNA region (BD) promoter and the other of activating the transcriptional processes (AD). It is wellknown that splitting BD and AD domains will inactivate the transcriptional processes, but the transcription can be restored if a BD domain is re-associated with an AD domain (Fields and Song 1989). Thus, the standard Y2H system includes a DB domain fused to the "bait" protein-coding region and an AD domain fused to the "prey" protein-coding region. When DB-bait and AD-prey domains are co-expressed in the nucleus of yeast cells, "bait"-"prey" domain interaction reconstitutes a functional transcription factor that activates the transcription of one reporter gene (Fig. 3). The most used Y2H system is based on GAL4/LexA, where the GAL4 protein controls the expression of the LacZ gene encoding beta-galactosidase.

The main advantages of Y2H system are: i) the DNA ( not the protein) is manipulated to study both bait and prey proteins (Walhout and Vidal 2001a); ii) it allows to identify protein interactions *in vivo*; iii) to identify transitory protein interactions, and iv) it is amenable to high-throughput screening methods (Buckholz *et al.,* 1999, Uetz and Hughes 2000, Walhout and Vidal 2001b, Ito *et al.,* 2002, Rual *et al.,* 2005).

The drawbacks include: i) a high proportion of false-positives and negatives (Vidal and Legrain 1999, Ito *et al.,* 2002); ii) it forces sub-cellular localization of bait and prey in the yeast nucleus which might preclude certain interactions from taking place (Cusick *et al.*, 2005). For example, membrane protein interactions cannot be identified by standard Y2H system because the AD-prey fusion will be retained at the membrane, thus, avoiding the reconstitution of a functional transcription factor (Xia *et al.*, 2006); iii) the over-expression of tested proteins, thus modifying the relative concentrations of potential interaction partners in comparison to the *in vivo* state; iv) the presence of auto-activators, i.e. proteins initiating transcription by themselves (Cusick *et al*., 2005), and v) the differences in post-translational modifications and protein folding processes between yeasts and other organisms (Shoemaker and Panchenko 2007). Given these cons, several modifications have been made to improve the quality of the Y2H system results, including the development of membrane Y2H, the inclusion of different promoters of reporter genes, the use of low copy vectors, and the reduction of auto-activators. Once that these drawbacks are reduced, the quality of the Y2H system is significantly improved (Lehner *et al.,* 2004, Li *et al.,* 2004, Rual *et al.,* 2005, Yu *et al.,* 2008).

Fig. 3. The Y2H system. Y2H detects interactions between proteins X and Y, where X is linked to BD domain which binds to DNA region promoter.

#### **3.1.3 Tandem affinity purification-tag coupled to mass spectrometry (TAP-MS)**

TAP-MS method is a powerful approach to determine the composition of relevant protein complexes. In this method, a target protein-coding region is fused with a DNA sequence encoding an affinity tag which will be expressed with other cellular proteins, followed by two-step affinity purification (AP) and elucidation of the complex components by mass spectrometry (MS). A typical TAP tag is formed by an immunoglobulin interacting domain of protein A (protA) and a calmodulin-binding peptide (CBP) (Fig. 4). The protA/CBP binding domains are separated by a short recognition sequence for the site-specific tobaccoetch virus protease (TEV protease). The TEV site allows proteolytic elution of the protein complex from IgG-sepharose after the first affinity-purification step, which is based on the protA/IgG-sepharose interaction. The eluted protein complex is further purified by binding to a calmodulin affinity resin, eluted with EGTA and processed for identification with MS analyses.

transcription by themselves (Cusick *et al*., 2005), and v) the differences in post-translational modifications and protein folding processes between yeasts and other organisms (Shoemaker and Panchenko 2007). Given these cons, several modifications have been made to improve the quality of the Y2H system results, including the development of membrane Y2H, the inclusion of different promoters of reporter genes, the use of low copy vectors, and the reduction of auto-activators. Once that these drawbacks are reduced, the quality of the Y2H system is significantly improved (Lehner *et al.,* 2004, Li *et al.,* 2004, Rual *et al.,* 2005, Yu

Fig. 3. The Y2H system. Y2H detects interactions between proteins X and Y, where X is

**3.1.3 Tandem affinity purification-tag coupled to mass spectrometry (TAP-MS)** 

TAP-MS method is a powerful approach to determine the composition of relevant protein complexes. In this method, a target protein-coding region is fused with a DNA sequence encoding an affinity tag which will be expressed with other cellular proteins, followed by two-step affinity purification (AP) and elucidation of the complex components by mass spectrometry (MS). A typical TAP tag is formed by an immunoglobulin interacting domain of protein A (protA) and a calmodulin-binding peptide (CBP) (Fig. 4). The protA/CBP binding domains are separated by a short recognition sequence for the site-specific tobaccoetch virus protease (TEV protease). The TEV site allows proteolytic elution of the protein complex from IgG-sepharose after the first affinity-purification step, which is based on the protA/IgG-sepharose interaction. The eluted protein complex is further purified by binding to a calmodulin affinity resin, eluted with EGTA and processed for identification with MS

linked to BD domain which binds to DNA region promoter.

*et al.,* 2008).

analyses.

Fig. 4. TAP-MS method. TAP purifies protein complexes and removes the molecules of contaminants and MS identifies the complex components.

Similar to Y2H system results, TAP–MS method shows a high rate of false-positives and negatives, missing many transient interactions. In contrast to the Y2H system, the TAP–MS method can elucidate higher-order interactions beyond binary interactions and, therefore, provides direct information on protein complexes. Several large-scale studies of protein complexes have been performed using the TAP–MS method (Gavin *et al.,* 2002, Ho *et al.,* 2002, Gavin *et al.,* 2006). For example, Gavin *et al*., (2006) used 5,500 ORFs fused to DNA sequences encoding an affinity tag to analyze PIN of *S. cerevisiae*. They found 491 complexes, of which 257 are novel, showing that PIN in *S. cerevisiae* has a modular organization (Gavin *et al*., 2006). In addition, Stingl *et al*., (2008), have elucidated the urease interactome of *H. pylori*. They combined the tandem affinity purification protocol with *in vivo* cross-link in order to capture transient interactions, which represent an improvement to TAP–MS method (Stingl *et al*., 2008).

The use of experimental orthogonal approaches has demonstrated that Y2H and TAP-MS interaction data sets contain mostly highly reliable interactions. It has been suggested that the integration of data from the two approaches can also serve to increase confidence in either data set, and has provided support to derivate predictions from these approaches (Cusick *et al.*, 2005). Moreover, Venkatesan *et al*., (2009) have developed a framework to estimate various quality parameters associated with currently used methods to identify PINs. The combination of these quality parameters (screening completeness, assay sensitivity, sampling sensitivity, and precision), has shown an estimate of the size of human binary interactome and a path toward the completion of its mapping (Venkatesan *et al*., 2009).

Despite the technical or biological limitations (Cusick et al., 2005) of the aforementioned methods, that does not preclude a reduction on their impact in PINs studies, instead they are marking a paradigm change from one-gene/one-function reductionist approach to a more systemic approach that can capture all potential interactions encoded in a genome or proteome.

#### **3.1.4 Protein interaction databases**

The huge amounts of protein interaction data produced by high-throughput experimental methods as Y2H and TAP-MS and analyzed by bioinformatics have led to the conformation of several research groups aimed at conducting important efforts in designing and setting up databases that include carefully analyzed information to provide useful scientific knowledge about protein-protein interactions. Table 1 shows a summary of most significant public databases of protein-protein interactions published to date. These databases contain interactions obtained by direct submission from experimentalists, text-mining and other data sources. Also, there are other online resources integrating information from several of the databases that are listed in Table 1, or tools to browse and visualize such data; for example resources like APID (Prieto and De Las Rivas 2006, Hernandez-Toro *et al.,* 2007) and PINA (Wu *et al*., 2009). The information deposited in these databases is verified using automated algorithms or manual curation like in the DIP database (Deane *et al*., 2002). Altogether, protein interaction databases are an invaluable resource to develop projects that aims to analyze PINs of organisms ranging from viruses to humans.


Table 1. Most representative databases of protein-protein interactions. (E) high-throughput experimental data; (S) structural data; (C) manual curation, and (I) integrative resource. The number of interactions was updated on September 29, 2011.

#### **3.2 Computational methods to predict protein interactions networks (PINs)**

Parallel to the experimental methods, several computational methods have been designed to predict protein-protein interactions. Initially, these methods were strictly limited to proteins whose three-dimensional structures had been determined (structure-based methods). The completion of genome sequences has provided large amounts of genomic information enabling the analysis from a genomic context of a given gene. Thus, a number of computational methods and resources have been developed for the prediction of protein interactions resulting from genomic information (genomic context-based methods), even in those cases where the three-dimensional structures are unknown yet (Galperin and Koonin 2000, Huynen *et al.,* 2000, Huynen and Snel 2000).

Hereinafter, we will describe computational methods and resources available for protein interaction prediction that exploit the genomic and biological contexts of proteins for complete genomes.

#### **3.2.1 Genomic context-based methods**

#### **3.2.1.1 Gene neighborhood**

36 Medicinal Chemistry and Drug Design

Despite the technical or biological limitations (Cusick et al., 2005) of the aforementioned methods, that does not preclude a reduction on their impact in PINs studies, instead they are marking a paradigm change from one-gene/one-function reductionist approach to a more systemic approach that can capture all potential interactions encoded in a genome or

The huge amounts of protein interaction data produced by high-throughput experimental methods as Y2H and TAP-MS and analyzed by bioinformatics have led to the conformation of several research groups aimed at conducting important efforts in designing and setting up databases that include carefully analyzed information to provide useful scientific knowledge about protein-protein interactions. Table 1 shows a summary of most significant public databases of protein-protein interactions published to date. These databases contain interactions obtained by direct submission from experimentalists, text-mining and other data sources. Also, there are other online resources integrating information from several of the databases that are listed in Table 1, or tools to browse and visualize such data; for example resources like APID (Prieto and De Las Rivas 2006, Hernandez-Toro *et al.,* 2007) and PINA (Wu *et al*., 2009). The information deposited in these databases is verified using automated algorithms or manual curation like in the DIP database (Deane *et al*., 2002). Altogether, protein interaction databases are an invaluable resource to develop projects that

**Website** 

aims to analyze PINs of organisms ranging from viruses to humans.

DIP E,C,S 71,589 http://dip.doe-mbi.ucla.edu MINT E,C 235,635 http://mint.bio.uniroma2.it IntAct E,C 275,144 http://www.ebi.ac.uk/intact/ BioGRID E,C 282,005 http://thebiogrid.org/ HPRD E,C 39,194 http://www.hprd.org/

APID I 322,579 http://bioinfow.dep.usal.es/apid/apid2net.html PINA I 221,702 http://cbg.garvan.unsw.edu.au/pina

Table 1. Most representative databases of protein-protein interactions. (E) high-throughput experimental data; (S) structural data; (C) manual curation, and (I) integrative resource. The

Parallel to the experimental methods, several computational methods have been designed to predict protein-protein interactions. Initially, these methods were strictly limited to proteins whose three-dimensional structures had been determined (structure-based methods). The completion of genome sequences has provided large amounts of genomic information enabling the analysis from a genomic context of a given gene. Thus, a number of

**3.2 Computational methods to predict protein interactions networks (PINs)** 

**Number of interactions** 

number of interactions was updated on September 29, 2011.

proteome.

**Database Type of** 

**data** 

**3.1.4 Protein interaction databases** 

The gene neighborhood method exploits the notion that genes which physically interact or are functionally associated to the same process or functional pathway will be adjacent to each other in the genome (Fig. 5a) (Tamames *et al.,* 1997, Overbeek *et al.,* 1999, Bowers *et al.,* 2004). For example, Dandekar et al. (2005), have shown that the neighborhood relationship could be used as fingerprint, suggesting that the proteins encoded by these genes may physically interact (Dandekar *et al*., 1998). The most representative example of this phenomenon can be found in bacterial operons, where genes that work together are generally transcribed as a unit. Furthermore, operons which encode for co-regulated genes

Fig. 5. Genomic context-based methods. (a) Gene neighborhood plots for four organisms, showing a pair of genes (blue and magenta) which are in close proximity in all four organisms. (b) Example phylogenetic profiles of four proteins from the three organisms. The proteins 1 and 4 have the same patterns of co-occurrence in all three organisms, and may physically interact based on this evidence. (c) A gene fusion event between two proteins (green and magenta) in two organisms is shown. Thus, the proteins a y b from organism 1 is predict to interact because they form part of a single protein in organism 2.

are usually conserved. The neighborhood relationship tends to be more relevant when it is conserved across different species (Tamames *et al*., 1997). Hence, the gene neighborhood method, like many of the comparative genomics approaches, increases its robustness when a larger numbers of genomes are used for the prediction. Since operons and genes neighborhood are uncommon in eukaryotic species (Zorio *et al.,* 1994, Blumenthal 1998, Liu and Han 2009, Fitzpatrick *et al.,* 2010), this method is principally applicable to bacteria where such genome properties are relevant.

#### **3.2.1.2 Phylogenetic profiles**

The phylogenetic profile method is based on the co-occurrence of pairs of genes across multiple genomes (Fig. 5b). Consequently, a pair of orthologous genes remains together across many distant species representing a concerted evolution mechanism and indicating that these genes need to be simultaneously present to participate in the same biological process, pathway or physically interacting. A phylogenetic profile is commonly represented as a vector for the presence or absence of a gene across multiple genomes (Fig.), where "0" or "1" denoted the presence/absence at each position of a profile (Ouzounis and Kyrpides 1996, Rivera *et al.,* 1998, Pellegrini *et al.,* 1999).

The main drawbacks of this method are: it can only be applied to complete genomes; the prediction robustness is dependent on the number and distribution of genomes used to build the profile, thus, a pair of genes with similar profiles across many bacterial, archaeal and eukaryotic genomes is much more likely to interact each other than those genes found to co-occur in a small number of closely related species; its high computational cost since it needs to compare many complete genomes; and, fails in homology detection between distant organisms.

Like others genomic context methods, with the increasing number of completely sequenced genomes, it is expected that the accuracy of these predictions will be improved over time.

#### **3.2.1.3 Gene fusion**

The gene fusion method is based on the fact that some interacting protein domains (termed the rosetta stones) have homologs in other genomes that are fused into one protein chain (Fig. 5c). Thus, gene fusion events have been proposed for the identification of potential protein-protein interactions, metabolic or regulatory networks (Sali 1999, Galperin and Koonin 2000). The information about gene fusion events can be combined with phylogenomic profiling and identification of conserved chromosomal localization, to test hypotheses leading to the characterization of proteins of unknown function (Marcotte *et al.,* 1999a, Marcotte 2000, Enright and Ouzounis 2001). Marcotte *et al*., (1999) found 6,809 potentially interacting pairs of non-homologous proteins in *E. coli,* revealing that, for more than half of the pairs, both involved members were functionally associated. More approaches with similar results have been used, including in eukaryotic genomes (Enright and Ouzounis 2001).

The drawbacks of this method are related with the domain complexity of eukaryotic proteins, the presence of promiscuous domains, and large degrees of paralogy (Enright *et al.,* 2002).

Currently, there are excellent resources implementing the genomic context-based methods. The most notable are the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) and ProLinks. The STRING (URL: http://string-db.org) and ProLinks (URL: http://prl.mbi.ucla.edu) resources provide a web interface giving comprehensive access to gene context information in 1,100 and 900 complete genomes, respectively (Szklarczyk *et al.*, 2011, Bowers *et al.,* 2004).

#### **3.2.2 Interologs**

38 Medicinal Chemistry and Drug Design

are usually conserved. The neighborhood relationship tends to be more relevant when it is conserved across different species (Tamames *et al*., 1997). Hence, the gene neighborhood method, like many of the comparative genomics approaches, increases its robustness when a larger numbers of genomes are used for the prediction. Since operons and genes neighborhood are uncommon in eukaryotic species (Zorio *et al.,* 1994, Blumenthal 1998, Liu and Han 2009, Fitzpatrick *et al.,* 2010), this method is principally applicable to bacteria

The phylogenetic profile method is based on the co-occurrence of pairs of genes across multiple genomes (Fig. 5b). Consequently, a pair of orthologous genes remains together across many distant species representing a concerted evolution mechanism and indicating that these genes need to be simultaneously present to participate in the same biological process, pathway or physically interacting. A phylogenetic profile is commonly represented as a vector for the presence or absence of a gene across multiple genomes (Fig.), where "0" or "1" denoted the presence/absence at each position of a profile (Ouzounis and Kyrpides

The main drawbacks of this method are: it can only be applied to complete genomes; the prediction robustness is dependent on the number and distribution of genomes used to build the profile, thus, a pair of genes with similar profiles across many bacterial, archaeal and eukaryotic genomes is much more likely to interact each other than those genes found to co-occur in a small number of closely related species; its high computational cost since it needs to compare many complete genomes; and, fails in homology detection between

Like others genomic context methods, with the increasing number of completely sequenced genomes, it is expected that the accuracy of these predictions will be improved over time.

The gene fusion method is based on the fact that some interacting protein domains (termed the rosetta stones) have homologs in other genomes that are fused into one protein chain (Fig. 5c). Thus, gene fusion events have been proposed for the identification of potential protein-protein interactions, metabolic or regulatory networks (Sali 1999, Galperin and Koonin 2000). The information about gene fusion events can be combined with phylogenomic profiling and identification of conserved chromosomal localization, to test hypotheses leading to the characterization of proteins of unknown function (Marcotte *et al.,* 1999a, Marcotte 2000, Enright and Ouzounis 2001). Marcotte *et al*., (1999) found 6,809 potentially interacting pairs of non-homologous proteins in *E. coli,* revealing that, for more than half of the pairs, both involved members were functionally associated. More approaches with similar results have been used, including in eukaryotic genomes (Enright

The drawbacks of this method are related with the domain complexity of eukaryotic proteins, the presence of promiscuous domains, and large degrees of paralogy (Enright *et al.,* 2002).

Currently, there are excellent resources implementing the genomic context-based methods. The most notable are the Search Tool for the Retrieval of Interacting Genes/Proteins

where such genome properties are relevant.

1996, Rivera *et al.,* 1998, Pellegrini *et al.,* 1999).

**3.2.1.2 Phylogenetic profiles** 

distant organisms.

**3.2.1.3 Gene fusion** 

and Ouzounis 2001).

The use of homology relationships is a key paradigm in molecular biology and genomics. This approach has been extensively exploited to predict protein structure (Abagyan and Batalov 1997, Brenner *et al.,* 1998, Rost 1999), to study sub-cellular localization (Nair and Rost 2002), enzymatic activity (Devos and Valencia 2001, Todd *et al.,* 2001), and for comparative genomics (Marcotte *et al.,* 1999b, Pellegrini *et al.,* 1999). Thus, interologs is defined as a conserved interaction between a pair of proteins of a given organism which have interacting homologs in another organism (Yu *et al.*, 2004b). For example, the experimental observation that two yeast proteins interact is extrapolated to predict that the two corresponding homologs in human also interact in a similar way. Walhout (Walhout and Vidal 2001b) and Vidal (2001) have used yeast experimental interaction data (Uetz *et al.,* 2000, Ito *et al.,* 2001b) to infer similar interactions in worm (Fig. 6). Mika and Rost (2006) suggested that the extrapolation of interactions between distant organisms has to be undertaken with some caution. They found that the homology transfers are only accurate at high levels of sequence identity, and it is more reliable for protein pairs from the same species than for two protein pairs from different organisms (Mika and Rost 2006). Likewise, Wiles *et al*., (2010) have developed a scoring schema to assess the confidence of interologs prediction. They have predicted protein interactions across five species (human, mouse, fly, worm, and yeast) based on available experimental evidence and conservation across species (Wiles *et al.*, 2010). Also, they developed the Interolog Finder (URL: http://www.interologfinder.org) to provide access to these data.

Fig. 6. The Interlog method. The A and B are interacting proteins in worm, and A' and B' are homologs in human of A and B proteins. Then A' and B' in human also interact in a similar way.

#### **3.2.3 Integrative approaches**

Currently, high-confidence PINs data sets are limited; however, they still provide a framework onto which other types of biological information can be integrated. Thus, new approaches that integrate other types of data, including protein-protein interactions, text mining, homology-based, and functional genomics approaches (Lee *et al.,* 2004, Chua *et al.,* 2007, Lee *et al.,* 2008a, Pena-Castillo *et al.,* 2008, Linghu *et al.,* 2009, Lee *et al.,* 2010, Wu *et al.,* 2010, Lee *et al.,* 2011, Szklarczyk *et al.,* 2011), have shown to be the most effective way to assign function to uncharacterized proteins that are components of the network (Fig. 7).

Fig. 7. General scheme for integrative approaches. N1, N2, N3 and N4 are networks representing four data sources. Each node is a protein, while each edge is a binary

relationship. The edges are weighted into common weight that is consistent across different data sources. N1, N2, N3 and N4 are then combined and re-scored to form the final high confidence network N'.

The most representative example of these approaches is STRING which integrates experimental as well as predicted interaction information, mostly from the methods

Currently, high-confidence PINs data sets are limited; however, they still provide a framework onto which other types of biological information can be integrated. Thus, new approaches that integrate other types of data, including protein-protein interactions, text mining, homology-based, and functional genomics approaches (Lee *et al.,* 2004, Chua *et al.,* 2007, Lee *et al.,* 2008a, Pena-Castillo *et al.,* 2008, Linghu *et al.,* 2009, Lee *et al.,* 2010, Wu *et al.,* 2010, Lee *et al.,* 2011, Szklarczyk *et al.,* 2011), have shown to be the most effective way to assign function to uncharacterized proteins that are components of the network (Fig. 7).

Fig. 7. General scheme for integrative approaches. N1, N2, N3 and N4 are networks representing four data sources. Each node is a protein, while each edge is a binary

relationship. The edges are weighted into common weight that is consistent across different data sources. N1, N2, N3 and N4 are then combined and re-scored to form the final high

The most representative example of these approaches is STRING which integrates experimental as well as predicted interaction information, mostly from the methods

**3.2.3 Integrative approaches** 

confidence network N'.

aforementioned. STRING provides ease of access to explore this integrated information (URL: http://string-db.org). Moreover, for each protein-protein interaction it provides a confidence score, and supplementary information such as protein domains and 3D structures, all within a stable and consistent identifier space. The version 9.0 of STRING includes the information of more than 1,100 completely sequenced organisms, ranging from bacteria and archaea to humans allowing to periodically execute interaction prediction algorithms and update such data depending on genome sequence information (Szklarczyk *et al.*, 2011).

Similarly, several groups have integrated multiple networks to predict protein functions, interactions and functional modules including data from multiple sources, ranging from coexpression patterns, sequence similarity to genomic context-based methods (Kemmeren *et al.,* 2002, Jansen *et al.,* 2003, Lee *et al.,* 2004, Lu *et al.,* 2005, Chua *et al.,* 2007, Lee *et al.,* 2008a, Pena-Castillo *et al.,* 2008, Linghu *et al.,* 2009, Lee *et al.,* 2010, Wu *et al.,* 2010, Lee *et al.,* 2011). For example, Marcotte´s group have shown the predictive power of an integrated functional network for *C. elegans* (Lee *et al.*, 2008a). Firstly, they computationally built an integrated functional network covering approximately 82% of *C. elegans* genes. Second, they used this network to predict the effects of perturbing individual genes on the organism's phenotype, identifying genes causing specific phenotypes ranging from cell cycle defects in single embryonic cells to life-span alterations, neuronal defects, and altered patterning of specific tissues. They select a set of candidate genes and their interactions associated to a phenotype and used RNAi to test whether targeting these candidate genes suppressed such phenotype. They found that 20% of such interactions suppressed the studied phenotype; instead, using only an RNAi, at large-scale screening, inactivation of 0.9% of genes produces such effect. Therefore, predictions arising from interactions of integrated network are 21-fold better than those expected by chance. They suggested a network-guided schema to accelerate research by using screening methods to identify genes and interactions for pathways of interest in human diseases.

The main limitation of integrative approaches is related with the availability of functional association data of genes/proteins. For example, these methods will not be able to make extensive predictions if no associations are available, as in the case of a novel genome with no known sequence or domain homology with known sequences, poorly studied genomes, and lack of functional genomics studies.

#### **4. PINs as a tool to prioritize drug targets of neglected-disease pathogens**

#### **4.1 Drug targets prioritization**

Despite the advent of the high-throughput techniques sparked by the genomics revolution, discovery and development of new drugs for neglected-disease pathogens has lagged in recent years due to the serious problems such as high cost, poor compliance, low efficacy, poor safety, evolution of antibiotic resistance, among others (Schmid 1998).

Target identification is the first step in the drug discovery process and such task can provide the foundation for years of dedicated research in the pharmaceutical industry (Read *et al*., 2001). As compared with all the other steps in drug discovery, this stage is complicated by the fact that the identified drug target must satisfy a variety of criteria to permit progression to the next step. For example, the target must be selectively present in the pathogen, i.e. target coding genes that are conserved across different pathogens and have no human homologs represent attractive target candidates for new broad-spectrum drugs (Schmid 2006); relevant for the pathogenesis process (Galperin and Koonin 1999, Sakharkar *et al.,* 2004); and, the essentiality of the target to the pathogen's growth and survival (Koonin *et al.,* 1998, Thanassi *et al.,* 2002, Galperin and Koonin 2004); suitability of the target for expression and assayability, and the availability of structures or models to initiate rational drug design (Aguero *et al.*, 2008). Hence, the integrated uses of above-mentioned strategies are considered as the basic schema in the drug target prioritization approaches. The criteria values of this basic schema can be found by querying publicly available bioinformatics resources and databases. For example, using metabolic pathway databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata *et al.,* 1999, Kanehisa and Goto 2000), protein classification sets such as Clusters of Orthologous Groups (COGs), Gene Ontology (GO), and resources to evaluate the "druggability" of proteins (Hopkins and Groom 2002, Russ and Lampel 2005, Hambly *et al.,* 2006), like "Structure-based DrugEBIlity" online service at EBI (URL: https://www.ebi.ac.uk/chembl/drugebility/structure). For drug targets of neglected-disease pathogens, the TDR Targets Database (URL: http://tdrtargets.org) is an extensive resource for neglected tropical diseases (Aguero *et al.*, 2008). This database includes extensive genetic, biochemical, and pharmacological data related to tropical disease pathogens and computationally predicted druggability for potential targets. The database contains the data on the tuberculosis pathogen *M. tuberculosis*; the leprosy pathogen *M. leprae*; the malaria parasites *Plasmodium falciparum* and *P. vivax*, the toxoplasmosis parasite *Toxoplasma gondii*; the trematode *Schistosoma mansoni*; the filariasis helminth *Brugia malayi* and its intracellular symbiont bacterium *Wolbachia*; and the kinetoplastid parasites *Leishmania major*, *Trypanosoma brucei,* and *T. cruzi*, which are responsible for kala-azar and other forms of leishmaniasis, sleeping sickness, and Chagas disease, respectively.

#### **4.2 PINs, drug targets, and neglected-disease pathogens**

Networks analysis is a broadly applicable tool for the drug discovery and development process. Any type of association data linking one gene to another, a protein or a compound, can be modeled, visualized and analyzed as networks (Lee *et al.,* 2004, Chua *et al.,* 2007, Lee *et al.,* 2008a, Linghu *et al.,* 2009, Lee *et al.,* 2010, McGary *et al.,* 2010, Wu *et al.,* 2010, Lee *et al.,* 2011). Hence, data from pre-clinical and clinical trial studies can be included in network analyses (Nikolsky *et al.*, 2005). Thus, networks could represent the standard for data integration and analysis. Network analysis involving neglected-disease pathogens is a very young area of research. Moreover, despite the availability experimentally PINs of model organisms as *S. cerevisiae*, *C. elegans*, and *D. melanogaster*, and some bacterial pathogens like *H. pylori, C. jejuni, Treponema pallidum,* the number of experimentally neglected-disease pathogens PINs is limited. For example, LaCount *et al*., (2005) identified protein-protein interactions of *P. falciparum* through a high throughput screening version of the yeast twohybrid system (LaCount *et al*., 2005). They found 2,846 unique interactions in more than 32,000 *P. falciparum* protein fragments. In order to determine clusters of interacting proteins they used computational methods such as analysis of network connectivity, gene coexpression, and enrichment of Gene Ontology terms. The results of the network analysis was the identification of two protein clusters, one of which related to the chromatin modification, transcription, messenger RNA stability, and ubiquitination and the other

target coding genes that are conserved across different pathogens and have no human homologs represent attractive target candidates for new broad-spectrum drugs (Schmid 2006); relevant for the pathogenesis process (Galperin and Koonin 1999, Sakharkar *et al.,* 2004); and, the essentiality of the target to the pathogen's growth and survival (Koonin *et al.,* 1998, Thanassi *et al.,* 2002, Galperin and Koonin 2004); suitability of the target for expression and assayability, and the availability of structures or models to initiate rational drug design (Aguero *et al.*, 2008). Hence, the integrated uses of above-mentioned strategies are considered as the basic schema in the drug target prioritization approaches. The criteria values of this basic schema can be found by querying publicly available bioinformatics resources and databases. For example, using metabolic pathway databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata *et al.,* 1999, Kanehisa and Goto 2000), protein classification sets such as Clusters of Orthologous Groups (COGs), Gene Ontology (GO), and resources to evaluate the "druggability" of proteins (Hopkins and Groom 2002, Russ and Lampel 2005, Hambly *et al.,* 2006), like "Structure-based DrugEBIlity" online service at EBI (URL: https://www.ebi.ac.uk/chembl/drugebility/structure). For drug targets of neglected-disease pathogens, the TDR Targets Database (URL: http://tdrtargets.org) is an extensive resource for neglected tropical diseases (Aguero *et al.*, 2008). This database includes extensive genetic, biochemical, and pharmacological data related to tropical disease pathogens and computationally predicted druggability for potential targets. The database contains the data on the tuberculosis pathogen *M. tuberculosis*; the leprosy pathogen *M. leprae*; the malaria parasites *Plasmodium falciparum* and *P. vivax*, the toxoplasmosis parasite *Toxoplasma gondii*; the trematode *Schistosoma mansoni*; the filariasis helminth *Brugia malayi* and its intracellular symbiont bacterium *Wolbachia*; and the kinetoplastid parasites *Leishmania major*, *Trypanosoma brucei,* and *T. cruzi*, which are responsible for kala-azar and other forms of leishmaniasis, sleeping sickness, and Chagas

Networks analysis is a broadly applicable tool for the drug discovery and development process. Any type of association data linking one gene to another, a protein or a compound, can be modeled, visualized and analyzed as networks (Lee *et al.,* 2004, Chua *et al.,* 2007, Lee *et al.,* 2008a, Linghu *et al.,* 2009, Lee *et al.,* 2010, McGary *et al.,* 2010, Wu *et al.,* 2010, Lee *et al.,* 2011). Hence, data from pre-clinical and clinical trial studies can be included in network analyses (Nikolsky *et al.*, 2005). Thus, networks could represent the standard for data integration and analysis. Network analysis involving neglected-disease pathogens is a very young area of research. Moreover, despite the availability experimentally PINs of model organisms as *S. cerevisiae*, *C. elegans*, and *D. melanogaster*, and some bacterial pathogens like *H. pylori, C. jejuni, Treponema pallidum,* the number of experimentally neglected-disease pathogens PINs is limited. For example, LaCount *et al*., (2005) identified protein-protein interactions of *P. falciparum* through a high throughput screening version of the yeast twohybrid system (LaCount *et al*., 2005). They found 2,846 unique interactions in more than 32,000 *P. falciparum* protein fragments. In order to determine clusters of interacting proteins they used computational methods such as analysis of network connectivity, gene coexpression, and enrichment of Gene Ontology terms. The results of the network analysis was the identification of two protein clusters, one of which related to the chromatin modification, transcription, messenger RNA stability, and ubiquitination and the other

disease, respectively.

**4.2 PINs, drug targets, and neglected-disease pathogens** 

implicated in the invasion of host cells. They suggested that the information provided by this network may be relevant to understand the basic biology of the parasite and to discover new drug and vaccine targets. Wang *et al*., (2010) built a PIN of the *M. tuberculosis*  H37Rv strain based on a high-throughput bacterial two-hybrid method. They found more than 8,000 novel interactions and performed a cross-species PINs comparison, showing 94 conserved sub-networks between *M. tuberculosis* and several prokaryotic PINs (Wang *et al*., 2010).

Additionally, even the lack of data, several computational studies aims to predict PINs of neglected-disease pathogens and prioritize drug targets have been performed. Florez *et al.,* (2010) built an *in silico* PIN of *L. major* by combining information of PSIMAP, PEIMAP, iPfam databases, and using the interologs method (Florez *et al.*, 2010). They predicted 33,861 interactions for 1,366 proteins, and also analyzed the PIN by calculating topology parameters such as connectivity and betweenness centrality detecting 142 potential and specific drug targets without human orthologs (Fig. 8). Pedamallu and Posfai (2010) have developed a simple open source package module (OpenPPI\_predictor) to predict putative PIN for target genomes (Pedamallu and Posfai 2010). The package is based on interologs method and uses experimental data from a related organism. Thus, they assayed OpenPPI\_predictor to infer a PIN for *B. malayi* using experimental PIN data from *C. elegans*. They identified 118 and 143 clusters in *B. malayi* and *C. elegans* interactomes,

Fig. 8. Predicted PIN of *Leishmania major* by Florez *et al*., (2010). The nodes in color red represent predicted essential proteins without human orthologs.

respectively, and found that highly connected region contains 363 and 340 proteins in *B. malayi* and *C. elegans* PINs. They suggests that core cellular functions of the two related organisms have similar complexity and that further analysis of these highly connected regions may provide clues about genes missing from a conserved pathway, or proteins missing from a complex.

Similarly, computational studies have been developed in order to model host-neglecteddisease pathogens PINs. For example, Dyer *et al*., (2007) integrated public intra-species PINs datasets with protein–domain profiles to predict a Human–*P. falciparum* PIN. They found 516 protein interactions between these two organisms, and showed that *Plasmodium* proteins interacting with human proteins are co-expressed in DNA microarray datasets, associated with developmental stages of the *Plasmodium* life cycle (Dyer *et al.*, 2007). Dyer *et al*., (2008) have analyzed the landscape of human proteins interacting with pathogens. They integrated human–pathogen PINs for 190 pathogen strains from seven public databases and found that both viral and bacterial pathogens tend to interact with proteins with many interacting partners (hubs) and those that are central to many paths (bottlenecks) in the human PIN (Dyer *et al*., 2008). Similar results were obtained by Navratil *et al*., (2011). They used a highquality dataset manually curated and validated of virus-host protein interactions to depict the "human infectome" (Navratil *et al.*, 2011). Additionally, they showed, by using functional genomic RNAi data, that the high centrality of targeted proteins was correlated to their essentiality for viruses' lifecycle. Also, they perform a simulation of cellular network perturbations and showed a stealth-attack of viruses on proteins bridging cellular functions, which is a property that could be essential in the molecular etiology of some human diseases (Fig. 9). Doolittle and Gomez (2011) have predicted interactions between dengue

Fig. 9. The human infectome by Navratil *et al*., (2011).

 virus (DENV) and its hosts, both human and the insect vector *Aedes aegypti*. They implemented a protocol based on structural similarity between DENV and host proteins, and also they supported a subset of the predictions via mining from the literature. They predicted, after filtering and based on shared Gene Ontology cellular component, over 2,000 interactions between DENV and humans, as well as 18 interactions between DENV and the *A. aegypti* vector (Doolittle and Gomez 2011). They suggested those specific interactions between virus and host proteins are involved in interferon signaling, transcriptional regulation, stress, and the unfolded protein response.

The most relevant outcome of such computational studies is the identification of human and pathogen proteins to target experimentally for developing new drugs. It also provides different roadmaps and emerging approaches to develop projects to model and analyze PINs of neglected-disease pathogens. For example, novel therapies for human diseases employ multi-target drugs (Borisy *et al.,* 2003, Csermely *et al.,* 2005) and compounds targeted to inhibit protein-protein interactions (Emerson *et al.,* 2003, Klein and Vassilev 2004, Vassilev 2004, Vassilev *et al.,* 2004).

### **5. Conclusions**

44 Medicinal Chemistry and Drug Design

respectively, and found that highly connected region contains 363 and 340 proteins in *B. malayi* and *C. elegans* PINs. They suggests that core cellular functions of the two related organisms have similar complexity and that further analysis of these highly connected regions may provide clues about genes missing from a conserved pathway, or proteins

Similarly, computational studies have been developed in order to model host-neglecteddisease pathogens PINs. For example, Dyer *et al*., (2007) integrated public intra-species PINs datasets with protein–domain profiles to predict a Human–*P. falciparum* PIN. They found 516 protein interactions between these two organisms, and showed that *Plasmodium* proteins interacting with human proteins are co-expressed in DNA microarray datasets, associated with developmental stages of the *Plasmodium* life cycle (Dyer *et al.*, 2007). Dyer *et al*., (2008) have analyzed the landscape of human proteins interacting with pathogens. They integrated human–pathogen PINs for 190 pathogen strains from seven public databases and found that both viral and bacterial pathogens tend to interact with proteins with many interacting partners (hubs) and those that are central to many paths (bottlenecks) in the human PIN (Dyer *et al*., 2008). Similar results were obtained by Navratil *et al*., (2011). They used a highquality dataset manually curated and validated of virus-host protein interactions to depict the "human infectome" (Navratil *et al.*, 2011). Additionally, they showed, by using functional genomic RNAi data, that the high centrality of targeted proteins was correlated to their essentiality for viruses' lifecycle. Also, they perform a simulation of cellular network perturbations and showed a stealth-attack of viruses on proteins bridging cellular functions, which is a property that could be essential in the molecular etiology of some human diseases (Fig. 9). Doolittle and Gomez (2011) have predicted interactions between dengue

missing from a complex.

Fig. 9. The human infectome by Navratil *et al*., (2011).

Because of the development of massive analysis technologies in genomics and computational biology, we can outline a trend to interplay and integrate the computational and experimental techniques. Thus, the methods and resources to identify protein interactions that combine both approaches will be used as a routine protocol in the future.

Even though the use of network biology approaches to drug discovery are in their initial stages, they already contributed to meaningful drug development decisions by accelerating hypothesis-driven biology, modeling specific physiologic problems in target validation or clinical physiology and, providing rapid characterization and interpretation of diseaserelevant cell systems.

Despite the lack of experimental functional genomics and PINs data for neglected-disease pathogens, computational approaches represent a starting point and complementary approach to current high-throughput screening projects whose aim is to delineate the complete genomes of neglected-disease pathogens. Moreover, integrative computational approaches have shown to be a powerful tool as guide for large scale-studies improving and facilitating the rational identication of therapeutic targets.

It is clear that for those organisms whose genome has not been sequenced yet, it will be difficult to implement the aforementioned protocols. That is the case for some nematodes and trypanosomal parasites as *T. cruzi*, *S. mansoni*, *B. malayi*, and *O. volvulus*, and the soiltransmitted helminthes (e.g., species of *A. lumbricoides*, and *T. trichura).* However, according to NCBI Entrez Genome (URL:, http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi; Sep 29, 2011), the status of most of them is in "assembly"stage. Once the genome of the neglecteddisease pathogen is available, we can use the information of experimental PINs of model organism as *C. elegans* to model and predict PINs of such pathogens enabling the discovery of those hubs and bottlenecks proteins that modulate the infectious process and prioritize them as drug targets.

While the computational approaches analyzed here are by nature probabilistic, i.e. it offers the likelihood of association of a given pair of proteins, nevertheless it clearly indicates the utility of inferring functionally relevant correlations from the available genomic databases for systematic drug target identification. The further improvement of computational approaches will help to increasing the availability of systematically collected biologic data and will provide an easy schema for the integration of different types of data within network analysis, thus enhancing the role of such approaches in drug discovery.

Finally, comprehensive repositories of functional genomic data for neglected-disease pathogens will be created. Hence, as soon as large molecular datasets are processed with the help of network analysis, a growing set of predicted pathways and PINs will emerge and will offer a new paradigm for re-thinking about how to revolutionize the drug discovery process.

#### **6. Acknowledgments**

The authors thank BioMed Central for allowing the reproduction of figures 8 and 9 (Florez *et al.,* 2010, Navratil *et al.,* 2011). Mario A. Rodríguez-Pérez and Xianwu Guo holds a scholarship from Comisión de Operación y Fomento de Actividades Académicas (COFAA)/IPN.

#### **7. References**


While the computational approaches analyzed here are by nature probabilistic, i.e. it offers the likelihood of association of a given pair of proteins, nevertheless it clearly indicates the utility of inferring functionally relevant correlations from the available genomic databases for systematic drug target identification. The further improvement of computational approaches will help to increasing the availability of systematically collected biologic data and will provide an easy schema for the integration of different types of data within

Finally, comprehensive repositories of functional genomic data for neglected-disease pathogens will be created. Hence, as soon as large molecular datasets are processed with the help of network analysis, a growing set of predicted pathways and PINs will emerge and will offer a new paradigm for re-thinking about how to revolutionize the drug discovery

The authors thank BioMed Central for allowing the reproduction of figures 8 and 9 (Florez *et al.,* 2010, Navratil *et al.,* 2011). Mario A. Rodríguez-Pérez and Xianwu Guo holds a scholarship from Comisión de Operación y Fomento de Actividades Académicas

Abagyan, R. A., and S. Batalov. (1997). Do aligned sequences share the same fold? *J Mol Biol*

Aguero, F., B. Al-Lazikani, M. Aslett, M. Berriman, F. S. Buckner, R. K. Campbell, S.

Albert, R., and A. L. Barabási. (2002). Statistical mechanics of complex networks. *Rev. Modern* 

Albert, R., H. Jeong, and A. L. Barabasi. (2000). Error and attack tolerance of complex

Alon, U. (2007). Network motifs: theory and experimental approaches. *Nat Rev Genet* 8: 450-

Alon, U., M. G. Surette, N. Barkai, and S. Leibler. (1999). Robustness in bacterial chemotaxis.

Amaral, L. A., A. Scala, M. Barthelemy, and H. E. Stanley. (2000). Classes of small-world

Barabasi, A. L., and R. Albert. (1999). Emergence of scaling in random networks. *Science* 286:

Barabasi, A. L., and Z. N. Oltvai. (2004). Network biology: understanding the cell's

networks. *Proc Natl Acad Sci U S A* 97: 11149-11152: Oct 10.

functional organization. *Nat Rev Genet* 5: 101-113: Feb.

Albert, R. (2005). Scale-free networks in cell biology. *J Cell Sci* 118: 4947-4957: Nov 1.

Carmona, I. M. Carruthers, A. W. Chan, F. Chen, G. J. Crowther, M. A. Doyle, C. Hertz-Fowler, A. L. Hopkins, G. McAllister, S. Nwaka, J. P. Overington, A. Pain, G. V. Paolini, U. Pieper, S. A. Ralph, A. Riechers, D. S. Roos, A. Sali, D. Shanmugam, T. Suzuki, W. C. Van Voorhis, and C. L. Verlinde. (2008). Genomic-scale prioritization of drug targets: the TDR Targets database. *Nat Rev Drug Discov* 7:

network analysis, thus enhancing the role of such approaches in drug discovery.

process.

**6. Acknowledgments** 

273: 355-368: Oct 17.

900-907: Nov.

networks. *Nature* 406: 378-382: Jul 27.

*Nature* 397: 168-171: Jan 14.

*Phys* 1: 30.

461: Jun.

509-512: Oct 15.

(COFAA)/IPN.

**7. References** 


Enright, A. J., and C. A. Ouzounis. (2001). Functional associations of proteins in entire

Enright, A. J., S. Van Dongen, and C. A. Ouzounis. (2002). An efficient algorithm for largescale detection of protein families. *Nucleic Acids Res* 30: 1575-1584: Apr 1. Erdös, P., and A. Rényi. (1960). On the evolution of random graphs. *Publ. Math. Inst. Hung.* 

Fields, S., and O. Song. (1989). A novel genetic system to detect protein-protein interactions.

Fitzpatrick, D. A., P. O'Gaora, K. P. Byrne, and G. Butler. (2010). Analysis of gene evolution

Florez, A. F., D. Park, J. Bhak, B. C. Kim, A. Kuchinsky, J. H. Morris, J. Espinosa, and C.

Galperin, M. Y., and E. V. Koonin. (2000). Who's your neighbor? New computational approaches for functional genomics. *Nat Biotechnol* 18: 609-613: Jun. Galperin, M. Y., and E. V. Koonin. (2004). 'Conserved hypothetical' proteins: prioritization of

Gandhi, T. K., J. Zhong, S. Mathivanan, L. Karthick, K. N. Chandrika, S. S. Mohan, S.

Gavin, A. C., P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S.

reveals modularity of the yeast cell machinery. *Nature* 440: 631-636: Mar 30. Gavin, A. C., M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick,

Giot, L., J. S. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, Y. L. Hao, C. E. Ooi, B.

Sharma, S. Pinkert, S. Nagaraju, B. Periaswamy, G. Mishra, K. Nandakumar, B. Shen, N. Deshpande, R. Nayak, M. Sarker, J. D. Boeke, G. Parmigiani, J. Schultz, J. S. Bader, and A. Pandey. (2006). Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. *Nat Genet* 38: 285-293:

Bastuck, B. Dumpelfeld, A. Edelmann, M. A. Heurtier, V. Hoffman, C. Hoefert, K. Klein, M. Hudak, A. M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster, P. Bork, R. B. Russell, and G. Superti-Furga. (2006). Proteome survey

A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A. Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. *Nature* 415: 141-147:

Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A. DaSilva, J. Zhong, C. A. Stanyon, R. L. Finley, Jr., K. P. White, M.

major as a tool for drug target selection. *BMC Bioinformatics* 11: 484. Freeman, L. C. (1977). Set of measures of centrality based on betweenness. *Sociometry* 40: 7. Galperin, M. Y., and E. V. Koonin. (1999). Searching for drug targets in microbial genomes.

targets for experimental study. *Nucleic Acids Res* 32: 5452-5463.

and metabolic pathways using the Candida Gene Order Browser. *BMC Genomics*

Muskus. (2010). Protein network prediction and topological analysis in Leishmania

RESEARCH0034.

*Nature* 340: 245-246: Jul 20.

*Curr Opin Biotechnol* 10: 571-578: Dec.

*Acad. Sci*: 4.

11: 290.

Mar.

Jan 10.

genomes by means of exhaustive detection of gene fusions. *Genome Biol* 2:

Braverman, T. Jarvie, S. Gold, M. Leach, J. Knight, R. A. Shimkets, M. P. McKenna, J. Chant, and J. M. Rothberg. (2003). A protein interaction map of Drosophila melanogaster. *Science* 302: 1727-1736: Dec 5.


Kemmeren, P., N. L. van Berkum, J. Vilo, T. Bijma, R. Donders, A. Brazma, and F. C.

Kohl, P., and D. Noble. (2009). Systems biology and the virtual physiological human. *Mol* 

Koonin, E. V., R. L. Tatusov, and M. Y. Galperin. (1998). Beyond complete genomes: from sequence to structure and function. *Curr Opin Struct Biol* 8: 355-363: Jun. LaCount, D. J., M. Vignali, R. Chettier, A. Phansalkar, R. Bell, J. R. Hesselberth, L. W.

Lee, I., S. V. Date, A. T. Adai, and E. M. Marcotte. (2004). A probabilistic functional network

Lee, I., U. M. Blom, P. I. Wang, J. E. Shim, and E. M. Marcotte. (2011). Prioritizing candidate

Lee, I., B. Lehner, C. Crombie, W. Wong, A. G. Fraser, and E. M. Marcotte. (2008a). A single

Lee, I., B. Lehner, T. Vavouri, J. Shin, A. G. Fraser, and E. M. Marcotte. (2010). Predicting

Lee, J. H., V. Vittone, E. Diefenbach, A. L. Cunningham, and R. J. Diefenbach. (2008b).

Lehner, B., J. I. Semple, S. E. Brown, D. Counsell, R. D. Campbell, and C. M. Sanderson.

Li, S., C. M. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P. O. Vidalain, J. D. Han, A.

Linghu, B., E. S. Snitkin, Z. Hu, Y. Xia, and C. Delisi. (2009). Genome-wide prioritization of

Liu, X., and B. Han. (2009). Evolutionary conservation of neighbouring gene pairs in plants.

Lu, L. J., Y. Xia, A. Paccanaro, H. Yu, and M. Gerstein. (2005). Assessing the limits of

human functional linkage network. *Genome Biol* 10: R91.

integrated analysis of genome-scale data. *Mol Cell* 9: 1133-1143: May. Klein, C., and L. T. Vassilev. (2004). Targeting the p53-MDM2 interaction to treat cancer. *Br J* 

*Cancer* 91: 1415-1419: Oct 18.

falciparum. *Nature* 438: 103-107: Nov 3.

of yeast genes. *Science* 306: 1555-1558: Nov 26.

Caenorhabditis elegans. *Nat Genet* 40: 181-188: Feb.

*Syst Biol* 5: 292.

*Res* 21: 1109-1121: Jul.

1. *Virology* 378: 347-354: Sep 1.

III region. *Genomics* 83: 153-167: Jan.

Aug.

540-543: Jan 23.

Jul.

*Gene* 437: 71-79: May 15.

Holstege. (2002). Protein interaction verification and functional annotation by

Schoenfeld, I. Ota, S. Sahasrabudhe, C. Kurschner, S. Fields, and R. E. Hughes. (2005). A protein interaction network of the malaria parasite Plasmodium

disease genes by network-based boosting of genome-wide association data. *Genome* 

gene network accurately predicts phenotypic effects of gene perturbation in

genetic modifier loci using functional gene networks. *Genome Res* 20: 1143-1153:

Identification of structural protein-protein interactions of herpes simplex virus type

(2004). Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class

Chesneau, T. Hao, D. S. Goldberg, N. Li, M. Martinez, J. F. Rual, P. Lamesch, L. Xu, M. Tewari, S. L. Wong, L. V. Zhang, G. F. Berriz, L. Jacotot, P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Q. Li, H. W. Gabel, A. Elewa, B. Baumgartner, D. J. Rose, H. Yu, S. Bosak, R. Sequerra, A. Fraser, S. E. Mango, W. M. Saxton, S. Strome, S. Van Den Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Gerstein, L. Doucette-Stamm, K. C. Gunsalus, J. W. Harper, M. E. Cusick, F. P. Roth, D. E. Hill, and M. Vidal. (2004). A map of the interactome network of the metazoan C. elegans. *Science* 303:

disease genes and identification of disease-disease associations from an integrated

genomic data integration for predicting protein networks. *Genome Res* 15: 945-953:


Prieto, C., and J. De Las Rivas. (2006). APID: Agile Protein Interaction DataAnalyzer. *Nucleic* 

Ravasz, E., A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabasi. (2002). Hierarchical organization of modularity in metabolic networks. *Science* 297: 1551-1555: Aug 30. Read, T. D., S. R. Gill, H. Tettelin, and B. A. Dougherty. (2001). Finding drug targets in

Rivera, M. C., R. Jain, J. E. Moore, and J. A. Lake. (1998). Genomic evidence for two functionally distinct gene classes. *Proc Natl Acad Sci U S A* 95: 6239-6244: May 26.

Rual, J. F., K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D.

Russ, A. P., and S. Lampel. (2005). The druggable genome: an update. *Drug Discov Today* 10:

Sakharkar, K. R., M. K. Sakharkar, and V. T. Chow. (2004). A novel genomics approach for

Schmid, M. B. (1998). Novel approaches to the discovery of antimicrobial agents. *Curr Opin* 

Schmid, M. B. (2006). Crystallizing new approaches for antimicrobial drug discovery.

Shen-Orr, S. S., R. Milo, S. Mangan, and U. Alon. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. *Nat Genet* 31: 64-68: May. Shoemaker, B. A., and A. R. Panchenko. (2007). Deciphering protein-protein interactions. Part I. Experimental techniques and databases. *PLoS Comput Biol* 3: e42: Mar 30. Stelzl, U., U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M.

network: a resource for annotating the proteome. *Cell* 122: 957-968: Sep 23. Stingl, K., K. Schauer, C. Ecobichon, A. Labigne, P. Lenormand, J. C. Rousselle, A. Namane,

Tamames, J., G. Casari, C. Ouzounis, and A. Valencia. (1997). Conserved clusters of functionally related genes in two bacterial genomes. *J Mol Evol* 44: 66-73: Jan.

by tandem affinity purification. *Mol Cell Proteomics* 7: 2429-2441: Dec. Szklarczyk, D., A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, M.

the identification of drug targets in pathogens, with special reference to

Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Krobitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. (2005). A human protein-protein interaction

and H. de Reuse. (2008). In vivo interactome of Helicobacter pylori urease revealed

Stark, J. Muller, P. Bork, L. J. Jensen, and C. von Mering. (2011). The STRING database in 2011: functional interaction networks of proteins, globally integrated

Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth, and M. Vidal. (2005). Towards a proteome-scale map of the human protein-protein interaction network. *Nature*

Rost, B. (1999). Twilight zone of protein sequence alignments. *Protein Eng* 12: 85-94: Feb. Rozen, R., N. Sathish, Y. Li, and Y. Yuan. (2008). Virion-wide protein interactions of Kaposi's

microbial genomes. *Drug Discov Today* 6: 887-892: Sep 1.

sarcoma-associated herpesvirus. *J Virol* 82: 4742-4750: May.

Pseudomonas aeruginosa. *In Silico Biol* 4: 355-360.

*Biochem Pharmacol* 71: 1048-1056: Mar 30.

and scored. *Nucleic Acids Res* 39: D561-568: Jan.

Sali, A. (1999). Functional links between proteins. *Nature* 402: 23, 25-26: Nov 4.

*Acids Res* 34: W298-302: Jul 1.

437: 1173-1178: Oct 20.

*Chem Biol* 2: 529-534: Aug.

1607-1610: Dec.

