**2. Selection of drug targets**

An initial step in the drug discovery process involves the search and selection of the drug target. This target is frequently a protein that is essential for the organism survival or critical for regulating a particular signaling pathway. In the specific case of parasites, the protein target when inhibited should impair or delay parasite viability. The classical approach of finding a new essential protein that can act as a potential target is the experimental characterization by using gene knockout or knock-down strategies in the target organism. Besides essentiality, some targets are selected for being specific for the pathogen; for example, the ergosterol pathway is present in fungi and *Leishmania spp*, but humans only contain the required enzymes for the synthesis of cholesterol. This is the reason why this pathway has been exploited for searching drugs against mycotic pathogens and also *Leishmania*. However, the experimental approach employing interference RNA (RNAi) is not feasible given *Leishmania* species do not carry the machinery for RNAi (Peacock., et al. 2007), with the exception of *Leishmania braziliensis* where some RNAi-associated genes have been found. In addition, depending on the parasite stage the essentiality of a particular protein could change dramatically. With all these constraints, a rational alternative for choosing effective targets is a more systematic study of the biology of the parasite, with the aim of uncovering important mechanisms that are not evident by studying descriptively isolated proteins. A starting point for this "systems view" of the parasite biology in the case of *Leishmania*, was the sequencing of its genome in 2005 (Ivens., et al. 2005). Since then, more high-throughput data have been generated, not at the same rate as other organisms but with important applications for drug discovery in tropical diseases. This leads to an important issue of data analysis, where computational tools can have a role in reducing the ocean of possibilities of finding a drug for this disease, making more efficient and less costly the experimental setup. In the following sections, we will describe the current computational methods that can be applied to find new drug targets, with special application to the *Leishmania* parasite*.*

#### **2.1 Selection of targets by homology searching**

The simplest approach for finding a drug target is the homology search of essential proteins. There are several organisms with available essential data at genome-wide level (Forsyth., et al. 2002; Kamath., et al. 2003; Hu., et al. 2007). In model organisms such as yeast, the

(Peacock., et al. 2007) have been sequenced and annotated and a fourth species, *L. mexicana* and some *L. major* strains are in the process of being sequenced (GeneDB, http://www.genedb.org; University of Washington Genome Sequencing Center, http://genome.wustl.edu/gsc/gschmpg.html). The availability of these genomes and the annotated proteins can be used in a rational manner to predict novel drug targets and

The computational prediction of drugs, in addition to the evaluation of drugs already synthesized and used in other diseases, must be coupled with automated in vitro assessment methodologies of these compounds. In this sense and in the case of *Leishmania*, the use of GFP (Varela., et al. 2009) or luciferase transgenic parasites (Lang., et al. 2005) coupled with techniques such as flow cytometry or fluorometry can be used to rapidly evaluate potential anti-leishmanial drugs. The WHO program for training in tropical diseases research has created a network based on reporter gene technology to foster the process of drug search not only against leishmaniasis but also against other diseases with

An initial step in the drug discovery process involves the search and selection of the drug target. This target is frequently a protein that is essential for the organism survival or critical for regulating a particular signaling pathway. In the specific case of parasites, the protein target when inhibited should impair or delay parasite viability. The classical approach of finding a new essential protein that can act as a potential target is the experimental characterization by using gene knockout or knock-down strategies in the target organism. Besides essentiality, some targets are selected for being specific for the pathogen; for example, the ergosterol pathway is present in fungi and *Leishmania spp*, but humans only contain the required enzymes for the synthesis of cholesterol. This is the reason why this pathway has been exploited for searching drugs against mycotic pathogens and also *Leishmania*. However, the experimental approach employing interference RNA (RNAi) is not feasible given *Leishmania* species do not carry the machinery for RNAi (Peacock., et al. 2007), with the exception of *Leishmania braziliensis* where some RNAi-associated genes have been found. In addition, depending on the parasite stage the essentiality of a particular protein could change dramatically. With all these constraints, a rational alternative for choosing effective targets is a more systematic study of the biology of the parasite, with the aim of uncovering important mechanisms that are not evident by studying descriptively isolated proteins. A starting point for this "systems view" of the parasite biology in the case of *Leishmania*, was the sequencing of its genome in 2005 (Ivens., et al. 2005). Since then, more high-throughput data have been generated, not at the same rate as other organisms but with important applications for drug discovery in tropical diseases. This leads to an important issue of data analysis, where computational tools can have a role in reducing the ocean of possibilities of finding a drug for this disease, making more efficient and less costly the experimental setup. In the following sections, we will describe the current computational methods that can be applied to find new

The simplest approach for finding a drug target is the homology search of essential proteins. There are several organisms with available essential data at genome-wide level (Forsyth., et al. 2002; Kamath., et al. 2003; Hu., et al. 2007). In model organisms such as yeast, the

provide a basis to develop new drugs.

limited therapeutic options.

**2. Selection of drug targets** 

drug targets, with special application to the *Leishmania* parasite*.*

**2.1 Selection of targets by homology searching** 

phenotypic effects of deletion of particular genes have been shown (Giaever., et al. 2002) and more recently the study of genetic interactions on a large scale (Costanzo., et al. 2010). This has been used to elucidate redundancy and possibly some synergistic effects among genes. Therefore, it is possible to find orthologs in the organism of interest that could be essential by comparing its sequences against the list of essential genes in model organisms. The Database of Essential Genes (http://tubic.tju.edu.cn/deg/) (Zhang & Lin 2009) provides information of essential genes in prokaryotes and eukaryotes, and it is also possible to do a BLAST search with the protein of interest. This resource is useful for an exploratory search of essentiality of a particular protein. Another important resource, for drug target data, is the DrugBank database (http://www.drugbank.ca/) (Knox., et al. 2011), which can be used to extract drug-target interactions along with additional pharmacological data. The same strategy can be employed in this case; with the advantage that the homology search will also return possible drug candidates that can be tested on the protein found to have homology to the target in DrugBank.

This methodology has been applied in *Pseudomonas aeruginosa* (Sakharkar., et al. 2004) with the aim of detecting new drug targets, given this bacterium is an important problem in nosocomial settings due to the rapid generation of resistance. In *Leishmania,* drug targets can be also identified by this approach. Tools like BLAST or PSI-BLAST can be employed, with PSI-BLAST being more sensitive for detecting distant relationships among proteins (Altschul., et al. 1997). However, some false positives still can occur due to alignments that are optimal according to the algorithm but not biologically meaningful. The E value helps to detect those alignments that are significant. As an example, running a PSI-BLAST search with the *Leishmania major* proteome against the DrugBank database, one can find among the potential *Leishmania* orthologs to known targets, the protein *LmjF36.2430,* which is similar to the sterol 14- alpha demethylase in fungi. Drugs such as miconazole are known inhibitors of this enzyme. Interestingly, the protein *LmjF19.0450* belongs to the group of protein kinases conserved in other *Leishmania* species; it is constitutively expressed and has significant similarity to other kinase targets in cancer. These are simple cases of how a homology search can generate a list of potential drug targets using existing genomic data. The main advantage of this methodology is that it offers a quick overview of potential targets and second use of drugs. In addition, the STITCH 2 database (http://stitch.embl.de/) (Kuhn., et al. 2010) compiles known and predicted drug-target relationships jointly with biological information about targets in a network-based view.

Despite its simplicity, the homology search strategy has some caveats. Proteins inside the cell perform specific functions depending on their interactions, and these interactions can vary between species. Even if sequences are highly related, pathway conservation is not necessarily present. In addition, temporal regulation is important, as not all the interactions are active at the same time, which can further complicate the analysis. These problems highlight the importance of detecting targets by incorporating more detailed information about the molecular interactions.

#### **2.2 Selection of targets by topological analysis of protein networks**

In order to better understand complex pathogens such as *Leishmania* and to improve the efficiency of the drug discovery process, it is crucial to gain deeper knowledge about how protein interactions are established and how these interactions are regulated. This is a central issue for a more accurate definition of essentiality and biological robustness. These interactions can be described as a *network*, a representation commonly used to describe

Current Advances in Computational Strategies for Drug Discovery in Leishmaniasis 257

and it was elucidated by analyzing the number of interactions (or degree distribution) of proteins in the yeast interactome, showing that some nodes were more highly connected than others, and those nodes were in relatively low frequency in the network. This scale-free structure followed a *power law* distribution for the node degree and it described the probability of a node having a certain degree. An interesting consequence of having a scalefree structure is that the network was robust against random deletion of nodes, but susceptible to the deletion of highly connected nodes or *hubs* (Jeong., et al. 2001). The hubs can be detected by measuring the connectivity or degree of the network. In addition, the scale-free network was also susceptible to deletion of other types of nodes that were not highly connected but control the flux of the network; these nodes were called *bottlenecks*  (Yu., et al. 2007). A classical example of bottleneck nodes is the scaffold proteins (Good., et al. 2011); these proteins facilitate the communication between signalling pathways very efficiently, although sometimes they are not highly connected. Deleting a bottleneck node will disrupt cellular homeostasis by destroying communication between processes in the cell. This network biology approach becomes an important step in a systems level understanding of the biology of parasites like *Leishmania*, and it becomes very useful for

The analysis of the *Leishmania* protein network could lead to the discovery of new and effective drug targets. However, current protein interaction data in *Leishmania* have only focused on a few specific proteins, and at this time, no yeast two-hybrid data is available for this organism. Despite this limitation, the use of a computationally-predicted protein network from orthology-based methods is a good first step for the exploration of drug targets that may be more informative than a traditional homology search. The results described in the next section will focus on the current status of the predicted *Leishmania major* interactome and will give some directions for future experimental studies for network

Even when protein domain sequences are conserved, multiple combinations of these domains enable an organism to rewire the interactome in different ways. This can overcome the problem of the context of the targets that influence essentiality and enable new hubs or protein targets to be detected. A common disadvantage is the bias towards detection of conserved interactions, which could be a caveat in the case of organism-specific interactions that may also be important for survival. These specific interactions will be only detected when more data

In our recent study (Florez., et al. 2010), the protein interaction network in *Leishmania major* was predicted using only the parasite protein sequences and several protein interaction databases, in particular iPfam (Finn., et al. 2005), PSIMAP (Park., et al. 2005) and PEIMAP. These databases included protein-protein interactions defined by analysis of structures of protein complexes and experimental data extracted from literature, including highthroughput experiments. From the structures, the analysis of interacting structural domains was mapped to the sequence, using the domain definition by Pfam (Finn., et al. 2006) and SCOP (Hubbard., et al. 1997). These two databases contained information of domains with a systematic classification for protein families. In this particular case the physical distance between adjacent domains within a complex was used as the criteria for the definition of interaction and it was stored in iPfam and PSIMAP databases. This strategy has been used

detecting essential nodes that may constitute potential new drug targets.

becomes available, which will also allow existent predictions to be validated.

**2.2.1 Construction of the** *Leishmania* **protein interaction network** 

and target validation.

complex systems. The protein interaction network (interactome) describes all possible molecular interactions among proteins. The interactome is composed of *nodes* that represent the molecular components, in this case proteins, and *edges,* that are the interactions between components (Fig. 1). Depending on the biological function of the node, other types of networks can also be constructed; for example, gene networks involving transcription factors as nodes that regulate other genes by binding (edges) and metabolic networks where the nodes are the enzymes connected by the production of some metabolites. The study of networks comes from a mathematical discipline called *graph theory*, and the analysis of the interaction patterns in the network is defined as *network topology*.(Barabasi & Oltvai 2004)

Fig. 1. Schematic representation of a protein network. Yellow circle corresponds to a hub protein, green circles correspond to bottleneck proteins connecting several sub-networks. Lines connecting circles represent the edges of the network.

To detect protein interactions in biological systems, large-scale methods have been developed that can map all possible pairwise interactions. Yeast two-hybrid is a popular technique of this kind, which was used to construct the first interactome (Uetz., et al. 2000). The technique involves the fusion of a protein with a transcription factor DNAbinding domain subunit. This protein is called the *bait*. The second protein is fused onto an activator domain subunit and it is called the *prey*. If the interaction between the bait and the prey is present, the two transcription factor subunits will come closer and the expression of the reporter gene is activated (Osman 2004). The most important limitation of this method is the presence of high number of false positives. However recent evidence has shown that a combination of experimental methods will reduce the number of false interactions (Dreze., et al. 2010).

The initial studies of the yeast interactome revealed that the network structure was not organized randomly, and in fact the organization pattern was similar to other experimentally-observed networks. This particular network structure was called *scale-free* 

complex systems. The protein interaction network (interactome) describes all possible molecular interactions among proteins. The interactome is composed of *nodes* that represent the molecular components, in this case proteins, and *edges,* that are the interactions between components (Fig. 1). Depending on the biological function of the node, other types of networks can also be constructed; for example, gene networks involving transcription factors as nodes that regulate other genes by binding (edges) and metabolic networks where the nodes are the enzymes connected by the production of some metabolites. The study of networks comes from a mathematical discipline called *graph theory*, and the analysis of the interaction patterns in the network is defined as *network topology*.(Barabasi & Oltvai 2004)

Fig. 1. Schematic representation of a protein network. Yellow circle corresponds to a hub protein, green circles correspond to bottleneck proteins connecting several sub-networks.

To detect protein interactions in biological systems, large-scale methods have been developed that can map all possible pairwise interactions. Yeast two-hybrid is a popular technique of this kind, which was used to construct the first interactome (Uetz., et al. 2000). The technique involves the fusion of a protein with a transcription factor DNAbinding domain subunit. This protein is called the *bait*. The second protein is fused onto an activator domain subunit and it is called the *prey*. If the interaction between the bait and the prey is present, the two transcription factor subunits will come closer and the expression of the reporter gene is activated (Osman 2004). The most important limitation of this method is the presence of high number of false positives. However recent evidence has shown that a combination of experimental methods will reduce the number of false

The initial studies of the yeast interactome revealed that the network structure was not organized randomly, and in fact the organization pattern was similar to other experimentally-observed networks. This particular network structure was called *scale-free* 

Lines connecting circles represent the edges of the network.

interactions (Dreze., et al. 2010).

and it was elucidated by analyzing the number of interactions (or degree distribution) of proteins in the yeast interactome, showing that some nodes were more highly connected than others, and those nodes were in relatively low frequency in the network. This scale-free structure followed a *power law* distribution for the node degree and it described the probability of a node having a certain degree. An interesting consequence of having a scalefree structure is that the network was robust against random deletion of nodes, but susceptible to the deletion of highly connected nodes or *hubs* (Jeong., et al. 2001). The hubs can be detected by measuring the connectivity or degree of the network. In addition, the scale-free network was also susceptible to deletion of other types of nodes that were not highly connected but control the flux of the network; these nodes were called *bottlenecks*  (Yu., et al. 2007). A classical example of bottleneck nodes is the scaffold proteins (Good., et al. 2011); these proteins facilitate the communication between signalling pathways very efficiently, although sometimes they are not highly connected. Deleting a bottleneck node will disrupt cellular homeostasis by destroying communication between processes in the cell. This network biology approach becomes an important step in a systems level understanding of the biology of parasites like *Leishmania*, and it becomes very useful for detecting essential nodes that may constitute potential new drug targets.

#### **2.2.1 Construction of the** *Leishmania* **protein interaction network**

The analysis of the *Leishmania* protein network could lead to the discovery of new and effective drug targets. However, current protein interaction data in *Leishmania* have only focused on a few specific proteins, and at this time, no yeast two-hybrid data is available for this organism. Despite this limitation, the use of a computationally-predicted protein network from orthology-based methods is a good first step for the exploration of drug targets that may be more informative than a traditional homology search. The results described in the next section will focus on the current status of the predicted *Leishmania major* interactome and will give some directions for future experimental studies for network and target validation.

Even when protein domain sequences are conserved, multiple combinations of these domains enable an organism to rewire the interactome in different ways. This can overcome the problem of the context of the targets that influence essentiality and enable new hubs or protein targets to be detected. A common disadvantage is the bias towards detection of conserved interactions, which could be a caveat in the case of organism-specific interactions that may also be important for survival. These specific interactions will be only detected when more data becomes available, which will also allow existent predictions to be validated.

In our recent study (Florez., et al. 2010), the protein interaction network in *Leishmania major* was predicted using only the parasite protein sequences and several protein interaction databases, in particular iPfam (Finn., et al. 2005), PSIMAP (Park., et al. 2005) and PEIMAP. These databases included protein-protein interactions defined by analysis of structures of protein complexes and experimental data extracted from literature, including highthroughput experiments. From the structures, the analysis of interacting structural domains was mapped to the sequence, using the domain definition by Pfam (Finn., et al. 2006) and SCOP (Hubbard., et al. 1997). These two databases contained information of domains with a systematic classification for protein families. In this particular case the physical distance between adjacent domains within a complex was used as the criteria for the definition of interaction and it was stored in iPfam and PSIMAP databases. This strategy has been used

Current Advances in Computational Strategies for Drug Discovery in Leishmaniasis 259

selection can vary depending on how strongly supported the interactions were required. For us, a 0.70 confidence value gave a smooth fit to the power law distribution and this was an

Topological metrics such as clustering coefficient and mean shortest path help to describe global characteristics of the network. They measure the density of the connections within the network. Highly dense connected networks are characterized by modular components which also maintain the robustness of the network against failures. Biological networks tend to have a modular structure (Jeong., et al. 2001) and one additional way to test for reliability of the predicted network is by comparing the values of the clustering coefficient and mean shortest path to randomly generated networks with the same number of nodes and edges. These metrics should be statistically different between predicted and random networks. In the case of *Leishmania* network, 1,000 random networks were generated and the metrics

The power law fitting for the definition of scale-free structure can be calculated using the plug-in Network Analyzer v.2.6.1(Assenov., et al. 2008) available in the platform Cytoscape (Shannon., et al. 2003). This platform includes a very advanced environment for network visualization and analysis. Network topology metrics, such as betweenness centrality, and connectivity were calculated using the Hubba server (http://hub.iis.sinica.edu.tw/ Hubbawebcite). (Lin., et al. 2008) A plug-in version of this tool in Cytoscape was recently made available. For the calculation of the metrics, the confidence scores of the interactions were used so the detection could be focused on the nodes most likely to be essential in the group of highly supported interactions. From this analysis, a potential list of targets was selected. However, it was possible that some proteins detected could also be conserved in terms of sequence and function among several organisms including humans. This becomes a problem if drugs targeting some of these proteins interfere with important biological process in humans, generating unwanted toxic effects. To avoid this, an additional filter was used for the list of predicted targets and it consisted of aligning the *Leishmania* proteins to the human proteins and excluding proteins that were conserved

An important feature of network analysis was the prediction of protein function. The normal procedure for inferring function involved a homology search of the unknown protein versus a curated protein database such as UniProt (http://www.uniprot.org/). In some occasions, the detection of protein function was not feasible as significant similarity could not be found. When this approach failed, protein interaction network analysis helped to uncover potential functions. The prediction of protein function based on network analysis involved the assumption suggested by experimental data that interacting proteins tended to have related functions. This implied that it was possible to predict the function of neighboring nodes by clustering network modules and knowing the function of some of the nodes inside of the module. This analysis was carried out over the *Leishmania* network using the Markov Clustering (MCL) algorithm (Enright., et al. 2002) which has been demonstrated to be a robust and fast algorithm for detecting clusters or modules in protein networks (Brohee & van Helden 2006). The algorithm was implemented in the NeAT tool (Brohee., et

important condition for reliable detection of hubs and bottlenecks.

**2.2.3 Topological analysis of the network** 

calculated and compared to the original network.

**2.2.4 Prediction of protein function from network clusters** 

between these two species.

in other organisms such as fungi and bacteria (He., et al. 2008; Kim., et al. 2008). The domain interaction analysis generated more diversity in the detection of possible interactions because modular exchange of protein domains allowed rewiring the network even if the isolated sequence of the domain was conserved. However, despite the high accuracy of this method, the prediction of protein interactions was limited as there was not an abundance of crystallized protein complexes. The PEIMAP database was also used, and it included sequences of protein interaction pairs detected by several methods, including coimmunoprecipitation (co-IP) and yeast two-hybrid.

To construct the *Leishmania major* network, protein sequences were extracted from the GeneDB database. This database included genomic and proteomic information of pathogens, including protozoan parasites. The protein sequences were aligned to the interacting domain pairs using PSI-BLAST against the SCOP 1.71 database with an E-value cutoff of 0.0001, as described previously (Kim., et al. 2008). The PSI-BLAST tool was used for the alignments because it had the advantage of detecting small conserved sequences, such as small domains that would be otherwise missed by using the standard BLASTP. The same strategy was applied for the alignments concerning the iPfam database. In this case, the domain assignment for the *Leishmania* proteins was carried out using the Pfam database (release 18.0) with the hmmpfam tool employed for the alignments. The final set of predicted interactions was carried out by homology search over the PEIMAP database using BLASTP, with a minimal cutoff of 40% sequence identity and 70% length coverage. The PEIMAP database included protein-protein interaction (PPI) information from six source databases: DIP (Xenarios., et al. 2000), BIND (Bader., et al. 2001), IntAct (Hermjakob., et al. 2004), MINT (Zanzoni., et al. 2002), HPRD(Peri., et al. 2004), and BioGrid (Stark., et al. 2006).

#### **2.2.2 Filtering interactions by using a combined confidence score**

As discussed earlier, the reliability of this analysis and its bias to certain types of protein interactions was dependent on the experimental method employed. Therefore, it was necessary to combine results from different databases to increase the coverage and the confidence of the predicted interactions. In the *Leishmania major* interactome, we used a simple scoring system to identify high confidence interactions. A previous study classified the experimental methods according to their reliability (Chua., et al. 2006), and we used this data in addition to the significance of the sequence alignments to calculate the confidence of the interactions. This scoring system was called the 'combined score' method, and it was applied for the confidence calculations in the STRING database (von Mering., et al. 2005). This database is useful for searching predicted protein interactions detected by other methods, although the definitions are beyond the scope of this chapter.

The score was calculated according to the formula (1):

$$score = 1 - \prod\_{i \in E} (1 - R\_i)^n \tag{1}$$

where *score* was the confidence value ranging from 0-1 with 1 equals to 100% accuracy, *E* was the set of methods under analysis (PEIMAP, PSIMAP, iPfam); *Ri* was the reliability of method *i,* and *n* was the number of interactions predicted by method *i*. The results of these calculations represented pairs of interactions with their respective confidence. With this information, it was possible to select those interactions that fulfilled a particular confidence threshold. In this case, a confidence score of 0.7 was chosen to select the core *Leishmania major* network. The threshold selection can vary depending on how strongly supported the interactions were required. For us, a 0.70 confidence value gave a smooth fit to the power law distribution and this was an important condition for reliable detection of hubs and bottlenecks.
