

**2. Unravelling the relations between evolution and network structure in PPI**

We begin with a summary of those studies that involve the analysis of evolutionary information in a single PPI network. One can divide these works into the following two main groups. The first group studies evolutionary conservation with respect to topological properties of a PPI network. The second one primarily investigates the role of evolution with

The aim of the first group of studies is to describe how the topology of a single PPI network reflects the evolutionary signal present in the proteins it contains. This evolutionary signal is represented by the set of orthologs and it is retrieved with respect to a different species. Specifically, given a PPI network of the species to be investigated and a set of proteins of a

  distinct species, those proteins of the network being a part of orthologous pairs or clusters (resulting from a sequence comparison of proteins of the two or multiple species respectively) are considered to be source of the evolutionary or orthology signal in the network. Then, having established the orthology relationship between proteins of the two or multiple species, one can estimate the evolutionary rate or distance of aligned protein sequences (see e.g. Yang & Nielsen, 2000). The higher the rate, the faster is considered the evolution of proteins. Consequently, proteins which evolve slowly are well-conserved and a little or none change to them can be observed throughout the evolution. Other protein evolutionary measures have been also considered, as propensity for gene loss, evolutionary excess retention or protein age (see Table 1).


**Type of evolutionary Evolutionary**

Table 1. Measures of evolutionary signal at protein level

### **2.1 Relation between a single protein in a PPI network and evolution**

Various features of a PPI network topology can be investigated with respect to evolutionary information; the first and simplest ones are measures acting on the single nodes of the network. One can associate with a node different topological measures which estimate the relative relevance of the node within the network, here called *centrality* or *connectivity* of a node.

A basic centrality measure of a node is its degree. The degree of a node is the number of edges containing the node or, in terms of a PPI network, it is the number of proteins with which the protein represented by the node in the network interacts. It has been observed that a protein degree distribution of PPI networks follows a power law and thus PPI networks fall into a class of scale-free networks (see e.g. Jeong et al., 2001). Scale-free networks have a few highly connected nodes, called hubs, and numerous less connected nodes, which mostly interact only with one or two nodes.

### **2.1.1 Essentiality, centrality and conservation of a protein**

As a decade ago large protein physical interaction data were not yet available, researchers mainly focussed on the study of the correlation between importance of a protein function for a living cell (essentiality, dispensability) and its evolutionary conservation rate. The generally accepted premise is that essential genes or proteins should evolve at slower rates

node within the network. Moreover, these features also relate with gene expression (Krylov

A Survey on Evolutionary Analysis in PPI Networks 431

We consider now works that analyse the correlation between evolution and centrality. Also in this case the two main features used to estimate this correlation are the degree of a node and the evolutionary rate. At first, it was hypothesized that proteins with a higher degree should evolve slower (Fraser et al., 2002). A main criticism to this hypothesis was based on the fact that the analysis conducted in (Fraser et al., 2002) did not take into account the presence of a possible bias and of noise in data obtained from high-throughput experiments (Bloom & Adami, 2003; Jordan et al., 2003a;b). Nevertheless Fraser et al. (2003), Fraser & Hirsh (2004) and Lemos et al. (2005) could confirm the existence of such correlation by taking into account these objections. Kim et al. (2007) also confirmed interconnection between centrality, essentiality and conservation and showed that peripheral proteins of the PPI network are under positive selection for species adaptation. Moreover, the link between the connectivity of a node and its evolutionary history was further substantiated by works studying the correlation between node degree and other evolutionary measures such as propensity for gene loss (Krylov et al., 2003), evolutionary excess retention (Wuchty, 2004) and protein age (Ekman et al., 2006; Kunin et al., 2004). However Batada, Hurst & Tyers (2006) again pointed to a lack of evidence for a significant correlation between the evolutionary rate and the connectivity of a node. Moreover, Makino & Gojobori (2006) classified proteins according to two criteria, clustering coefficient of a node and protein's multi-functionality, and showed that multi-functional proteins of sparse parts of yeast PPI network (with a low clustering coefficient) evolve at the slowest rate regardless of the degrees of the connectivity. This suggests that clustering coefficient is a better descriptor of protein evolution within the global

A possible explanation for these conflicting results was proposed by Saeed & Deane (2006) who showed that the strength and significance of the correlation between evolution and centrality varies depending upon the type of PPI data used. Also Saeed & Deane (2006) found that more accurate datasets demonstrate stronger correlations between connectivity and evolutionary rate than less accurate datasets. Another reason may be the existence of two distinct types of highly connected nodes, so-called *party* and *date hubs*, which appear to satisfy

Specifically, Han et al. (2004) observed a bimodal distribution of average Pearson correlation coefficients between the expression profiles of proteins and its interacting partners. This yielded a classification of hubs into party hubs, having similar co-expression profiles with their neighbours, and date hubs, having different co-expression profiles with their neighbours. As a consequence, party hubs tend to interact simultaneously ("permanently") with their partners and to connect proteins within functional modules while date hubs tend to interact with different partners at different time/space ("transiently") and to bridge different modules. Thus, one may also refer to party hubs as *intramodule* and to date hubs as *intermodule* (Fraser,

Fraser (2005) was the first to investigate the difference in evolution between date and party hubs and found that party hubs are highly evolutionary constrained, whereas date hubs are

et al., 2003; Pang, Sheng & Ma, 2010; Yu et al., 2007).

network of protein interactions.

different evolutionary constraints.

2005).

**2.1.2 Evolution of party and date hubs**

than non-essential ones (see e.g. Kimura, 1983). Although empirical studies have cast doubts on the validity of this hypothesis (see e.g. Hurst & Smith, 1999; Pal et al., 2003; Rocha & Danchin, 2004), in the end the vast majority and late evidences favour the existence of correlation between gene essentiality or dispensability and evolutionary conservation (see e.g. Fang et al., 2005; Fraser et al., 2002; 2003; Hahn & Kern, 2005; Hirsh & Fraser, 2001; 2003; Jordan et al., 2002; Krylov et al., 2003; Ulitsky & Shamir, 2007; Wall et al., 2005; Wang & Zhang, 2009; Waterhouse et al., 2011; Zhang & He, 2005). In particular, as recently stated by Wang & Zhang (2009), the correlation remains weak yet still conveniently sufficient for practical use.

After the growth of protein interaction data, also the correlation between essentiality and centrality, and evolutionary conservation and centrality started to be investigated. At first the *centrality-essentiality relationship* was mostly investigated by examining the degree of a node, proving the existence of the correlation (see e.g. Fraser et al., 2002; 2003; Hahn & Kern, 2005; Jeong et al., 2001; Krylov et al., 2003). However Coulomb et al. (2005) showed no correlation between essentiality and centrality, where centrality was assessed not only by the degree but also by higher order centrality measures, namely average neighbours' degree of a node and clustering coefficient of a node, suggesting that the correlation centrality-essentiality could be an artefact of the dataset. These findings were later supported by Gandhi et al. (2006) who considered a set of PPI networks and also did not observe any significant relationship between a node degree and the essentiality of the corresponding protein. Interestingly, Coulomb et al. (2005) did not test other centrality measures as betweenness and closeness, which showed a higher correlation with essentiality than just the simple degree (Hahn & Kern, 2005). Nevertheless, Batada, Hurst & Tyers (2006) reaffirmed the existence of the correlation between the node degree and essentiality taking into account Coulomb et al.'s concerns. However, Yu et al. (2008) again disputed the correlation using the compilation of Yeast high quality PPI data. Results contradicting this work appeared in two consecutive studies by Park & Kim (2009) and Pang, Sheng & Ma (2010). The first study (Park & Kim, 2009) considered also other centrality measures than just the degree of a node. As a result, the correlation could be successfully revealed, whereas the highest correlation was observed with measures based on betweenness and closeness, similarly to Hahn & Kern (2005). In the other study (Pang, Sheng & Ma, 2010) the newer, updated yeast PPI dataset was used and the correlation between degree of a node and its (protein) essentiality could be detected.

Although, the above works support that there is a connection between topological position of a node and functional importance, it seems one cannot explain this centrality-lethality rule just by the degree distribution (He & Zhang, 2006; Zotenko et al., 2008). This seems to be in accordance with the analysis conducted in (Lin et al., 2007) showing that protein domain complexity is not the single determinant of protein essentiality and that there is a correlation between the number of protein domains and the number of interactions (Schuster-Bockler & Bateman, 2007). In addition, Kafri et al. (2008) showed that highly connected essential proteins tend to have duplicates which can compensate their deletion thus decreasing the deleterious effect of their removal, a phenomenon that could possibly explain the findings that genes with no duplicates are more likely to be essential (Giaever et al., 2002). Therefore higher order topological features appear to be more appropriate for capturing gene essentiality, especially those based on node-betweenness and node-closeness (Hahn & Kern, 2005; Park & Kim, 2009; Yu et al., 2007), which are believed to estimate better the local connectivity or centrality of a 4 Will-be-set-by-IN-TECH

than non-essential ones (see e.g. Kimura, 1983). Although empirical studies have cast doubts on the validity of this hypothesis (see e.g. Hurst & Smith, 1999; Pal et al., 2003; Rocha & Danchin, 2004), in the end the vast majority and late evidences favour the existence of correlation between gene essentiality or dispensability and evolutionary conservation (see e.g. Fang et al., 2005; Fraser et al., 2002; 2003; Hahn & Kern, 2005; Hirsh & Fraser, 2001; 2003; Jordan et al., 2002; Krylov et al., 2003; Ulitsky & Shamir, 2007; Wall et al., 2005; Wang & Zhang, 2009; Waterhouse et al., 2011; Zhang & He, 2005). In particular, as recently stated by Wang & Zhang (2009), the correlation remains weak yet still conveniently sufficient for practical use. After the growth of protein interaction data, also the correlation between essentiality and centrality, and evolutionary conservation and centrality started to be investigated. At first the *centrality-essentiality relationship* was mostly investigated by examining the degree of a node, proving the existence of the correlation (see e.g. Fraser et al., 2002; 2003; Hahn & Kern, 2005; Jeong et al., 2001; Krylov et al., 2003). However Coulomb et al. (2005) showed no correlation between essentiality and centrality, where centrality was assessed not only by the degree but also by higher order centrality measures, namely average neighbours' degree of a node and clustering coefficient of a node, suggesting that the correlation centrality-essentiality could be an artefact of the dataset. These findings were later supported by Gandhi et al. (2006) who considered a set of PPI networks and also did not observe any significant relationship between a node degree and the essentiality of the corresponding protein. Interestingly, Coulomb et al. (2005) did not test other centrality measures as betweenness and closeness, which showed a higher correlation with essentiality than just the simple degree (Hahn & Kern, 2005). Nevertheless, Batada, Hurst & Tyers (2006) reaffirmed the existence of the correlation between the node degree and essentiality taking into account Coulomb et al.'s concerns. However, Yu et al. (2008) again disputed the correlation using the compilation of Yeast high quality PPI data. Results contradicting this work appeared in two consecutive studies by Park & Kim (2009) and Pang, Sheng & Ma (2010). The first study (Park & Kim, 2009) considered also other centrality measures than just the degree of a node. As a result, the correlation could be successfully revealed, whereas the highest correlation was observed with measures based on betweenness and closeness, similarly to Hahn & Kern (2005). In the other study (Pang, Sheng & Ma, 2010) the newer, updated yeast PPI dataset was used and the correlation between degree of a node

Although, the above works support that there is a connection between topological position of a node and functional importance, it seems one cannot explain this centrality-lethality rule just by the degree distribution (He & Zhang, 2006; Zotenko et al., 2008). This seems to be in accordance with the analysis conducted in (Lin et al., 2007) showing that protein domain complexity is not the single determinant of protein essentiality and that there is a correlation between the number of protein domains and the number of interactions (Schuster-Bockler & Bateman, 2007). In addition, Kafri et al. (2008) showed that highly connected essential proteins tend to have duplicates which can compensate their deletion thus decreasing the deleterious effect of their removal, a phenomenon that could possibly explain the findings that genes with no duplicates are more likely to be essential (Giaever et al., 2002). Therefore higher order topological features appear to be more appropriate for capturing gene essentiality, especially those based on node-betweenness and node-closeness (Hahn & Kern, 2005; Park & Kim, 2009; Yu et al., 2007), which are believed to estimate better the local connectivity or centrality of a

and its (protein) essentiality could be detected.

node within the network. Moreover, these features also relate with gene expression (Krylov et al., 2003; Pang, Sheng & Ma, 2010; Yu et al., 2007).

We consider now works that analyse the correlation between evolution and centrality. Also in this case the two main features used to estimate this correlation are the degree of a node and the evolutionary rate. At first, it was hypothesized that proteins with a higher degree should evolve slower (Fraser et al., 2002). A main criticism to this hypothesis was based on the fact that the analysis conducted in (Fraser et al., 2002) did not take into account the presence of a possible bias and of noise in data obtained from high-throughput experiments (Bloom & Adami, 2003; Jordan et al., 2003a;b). Nevertheless Fraser et al. (2003), Fraser & Hirsh (2004) and Lemos et al. (2005) could confirm the existence of such correlation by taking into account these objections. Kim et al. (2007) also confirmed interconnection between centrality, essentiality and conservation and showed that peripheral proteins of the PPI network are under positive selection for species adaptation. Moreover, the link between the connectivity of a node and its evolutionary history was further substantiated by works studying the correlation between node degree and other evolutionary measures such as propensity for gene loss (Krylov et al., 2003), evolutionary excess retention (Wuchty, 2004) and protein age (Ekman et al., 2006; Kunin et al., 2004). However Batada, Hurst & Tyers (2006) again pointed to a lack of evidence for a significant correlation between the evolutionary rate and the connectivity of a node. Moreover, Makino & Gojobori (2006) classified proteins according to two criteria, clustering coefficient of a node and protein's multi-functionality, and showed that multi-functional proteins of sparse parts of yeast PPI network (with a low clustering coefficient) evolve at the slowest rate regardless of the degrees of the connectivity. This suggests that clustering coefficient is a better descriptor of protein evolution within the global network of protein interactions.

A possible explanation for these conflicting results was proposed by Saeed & Deane (2006) who showed that the strength and significance of the correlation between evolution and centrality varies depending upon the type of PPI data used. Also Saeed & Deane (2006) found that more accurate datasets demonstrate stronger correlations between connectivity and evolutionary rate than less accurate datasets. Another reason may be the existence of two distinct types of highly connected nodes, so-called *party* and *date hubs*, which appear to satisfy different evolutionary constraints.

### **2.1.2 Evolution of party and date hubs**

Specifically, Han et al. (2004) observed a bimodal distribution of average Pearson correlation coefficients between the expression profiles of proteins and its interacting partners. This yielded a classification of hubs into party hubs, having similar co-expression profiles with their neighbours, and date hubs, having different co-expression profiles with their neighbours. As a consequence, party hubs tend to interact simultaneously ("permanently") with their partners and to connect proteins within functional modules while date hubs tend to interact with different partners at different time/space ("transiently") and to bridge different modules. Thus, one may also refer to party hubs as *intramodule* and to date hubs as *intermodule* (Fraser, 2005).

Fraser (2005) was the first to investigate the difference in evolution between date and party hubs and found that party hubs are highly evolutionary constrained, whereas date hubs are

protein essentiality (Park & Kim, 2009; Yu et al., 2007) and it could clarify the link between essentiality and evolution as such. Thereafter, they could improve on the prediction of essential genes from the topology of a PPI network in combination with protein evolutionary information, such as phyletic retention (Gustafson et al., 2006), as already corroborated by several application of machine learning techniques for essential gene detection, prioritizing drug targets and determining virulence factors (see e.g. Chen & Xu, 2005; Deng et al., 2011;

A Survey on Evolutionary Analysis in PPI Networks 433

Since the factors relevant for protein evolution could be of a multiple character (Wolf et al., 2006), it is interesting to investigate whether protein connectivity plays a central or a more subtle role. In the latter case, the link between protein connectivity and evolution could be the results of spurious correlations due to other underlying biological processes (Bloom & Adami, 2003). In order to address this issue, the contribution of protein connectivity to protein evolutionary conservation has been also studied in an integrated way (Pal et al., 2006) using multidimensional methods such as principal component analysis (PCA) and principal

The first successful application of PCA was given by Wolf et al. (2006) on seven genome-related variables. The derived first component reflected a gene's 'importance' and confirmed positive correlation between lethality, expression levels and number of protein-protein interaction which at the same time constrained protein evolution measures. Interestingly, the component also showed that the number of paralogs positively contributes to gene essentiality, which contradicts the finding of Giaever et al. (2002) that non-duplicated genes tend to be essential. However, the study of Drummond et al. (2006) revealed by using PCR only single determinant of protein evolution, namely translational selection, which is almost entirely determined by the gene expression level, protein abundance, and codon bias. Later, Plotkin & Fraser (2007) re-examined the use of PCR method and showed noise in biological data can confound PCRs, leading to spurious conclusions. As a result, when they equalized for different amounts of noise across the predictor variables no single determinant of evolution could be found indicating that a variety of factors-including expression level, gene dispensability, and protein-protein interactions may independently affect evolutionary rates in yeast. This observation was further substantiated by a recent study (Theis et al., 2011) where 16 genomic variables were analysed using Bayesian PCA. The study supports the evidence for the three above-discussed correlations. It also demonstrates how different definitions of paralogs may lead to different conclusions on their effect on essentiality, and

Researchers have also focused on other topological structures of a PPI network than just a node and their relation to evolutionary conservation. With increasing topological complexity we may talk about a single protein-protein interaction (an edge in PPI network), topological motifs, and protein clusters or modules as detected by their interaction density or network

Doyle et al., 2010; Gustafson et al., 2006; McDermott et al., 2009).

thus commenting on Wolf et al.'s conflicting result (Wolf et al., 2006).

**2.2 Higher-order structures in a PPI network and evolution**

**2.1.3 Node connectivity is relevant for protein evolution**

component regression (PCR).

traffic.

more evolutionary labile. This is clearly in accordance with findings of Mintseris & Weng (2005) who argued that residues in the interfaces of permanent protein interactions tend to evolve at a relatively slower rate, allowing them to co-evolve with their interacting partners, in contrast to the plasticity inherent in transient interactions, which leads to an increased rate of substitution for the interface residues and leaves little or no evidence of correlated mutations across the interface. The work of Fraser (2005) was, in addition, later corroborated by Bertin et al. (2007). Examining three dimensional properties of proteins also supported this hypothesis, as multi-interface hubs were found to be more evolutionary conserved and essential as well as more likely to correspond to party hubs (Kim et al., 2006). Defining singlish- and multi-Motif hubs further substantiated these findings, because multi-Motif hubs were found to be more evolutionary conserved, more essential and to correlate with multi-interface hubs (Aragues et al., 2007). In addition, other features as orderness of regions in protein sequences and the solvent accessibility of the amino acid residues was shown to be different between party and date hubs and to contribute in the lowering of the evolutionary rate of party hubs (Kahali et al., 2009). Recently, Mirzarezaee et al. (2010) applied feature selection methods and machine learning techniques to predict party and date hubs based on a set of different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, etc.

However, other researchers disputed not only the evolutionary differences between party and date hubs but the existence of hub types as such (Agarwal et al., 2010; Batada, Reguly, Breitkreutz, Boucher, Breitkreutz, Hurst & Tyers, 2006; Batada et al., 2007). Indeed, some datasets do not exhibit clear or robust bimodal distribution of hubs' gene co-expression profiles (Agarwal et al., 2010) and in some cases there is even a complete lack of bimodality (Batada, Reguly, Breitkreutz, Boucher, Breitkreutz, Hurst & Tyers, 2006; Batada et al., 2007). Therefore, Pang, Cheng, Xuan, Sheng & Ma (2010) argue that the average Pearson correlation coefficient is a weak measure of whether a protein acts transiently or permanently with its interacting partners and they propose a new measure, a co-expressed protein-protein interaction degree. This measure estimates the actual number of partners with which a protein can permanently interact. One can interpret it as a degree of 'protein party-ness' and it offers more a continuum-like estimate of the protein's interaction property. This seems to be in accordance with Nooren & Thornton (2003) who suggest that rather a continuum range exists between distinct types of protein interactions and that their stability very much depends on the physiological conditions and environment.

Pang, Cheng, Xuan, Sheng & Ma (2010) firstly corroborated the results of Saeed & Deane (2006) on the correlation variations between connectivity and evolutionary rate of a protein on different datasets and then they showed that the co-expression-dependent node degree correlates significantly with the protein's evolutionary rate irrespectively of the specific dataset used. However, their topological measure is derived by using an external source of experimental data on gene expression. The further investigation on purely topological features of a PPI network which would distinguish transient and permanent interactions, and party and date hubs could bring more insights on how the evolutionary history of a protein is wired in its position within the network of all the protein interactions in an organism. In this perspective, network path-based measures, such as betweenness and closeness, seem to be promising (Yu et al., 2007). All the more, these measures also appear to relate to 6 Will-be-set-by-IN-TECH

more evolutionary labile. This is clearly in accordance with findings of Mintseris & Weng (2005) who argued that residues in the interfaces of permanent protein interactions tend to evolve at a relatively slower rate, allowing them to co-evolve with their interacting partners, in contrast to the plasticity inherent in transient interactions, which leads to an increased rate of substitution for the interface residues and leaves little or no evidence of correlated mutations across the interface. The work of Fraser (2005) was, in addition, later corroborated by Bertin et al. (2007). Examining three dimensional properties of proteins also supported this hypothesis, as multi-interface hubs were found to be more evolutionary conserved and essential as well as more likely to correspond to party hubs (Kim et al., 2006). Defining singlish- and multi-Motif hubs further substantiated these findings, because multi-Motif hubs were found to be more evolutionary conserved, more essential and to correlate with multi-interface hubs (Aragues et al., 2007). In addition, other features as orderness of regions in protein sequences and the solvent accessibility of the amino acid residues was shown to be different between party and date hubs and to contribute in the lowering of the evolutionary rate of party hubs (Kahali et al., 2009). Recently, Mirzarezaee et al. (2010) applied feature selection methods and machine learning techniques to predict party and date hubs based on a set of different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, etc.

However, other researchers disputed not only the evolutionary differences between party and date hubs but the existence of hub types as such (Agarwal et al., 2010; Batada, Reguly, Breitkreutz, Boucher, Breitkreutz, Hurst & Tyers, 2006; Batada et al., 2007). Indeed, some datasets do not exhibit clear or robust bimodal distribution of hubs' gene co-expression profiles (Agarwal et al., 2010) and in some cases there is even a complete lack of bimodality (Batada, Reguly, Breitkreutz, Boucher, Breitkreutz, Hurst & Tyers, 2006; Batada et al., 2007). Therefore, Pang, Cheng, Xuan, Sheng & Ma (2010) argue that the average Pearson correlation coefficient is a weak measure of whether a protein acts transiently or permanently with its interacting partners and they propose a new measure, a co-expressed protein-protein interaction degree. This measure estimates the actual number of partners with which a protein can permanently interact. One can interpret it as a degree of 'protein party-ness' and it offers more a continuum-like estimate of the protein's interaction property. This seems to be in accordance with Nooren & Thornton (2003) who suggest that rather a continuum range exists between distinct types of protein interactions and that their stability very much depends on

Pang, Cheng, Xuan, Sheng & Ma (2010) firstly corroborated the results of Saeed & Deane (2006) on the correlation variations between connectivity and evolutionary rate of a protein on different datasets and then they showed that the co-expression-dependent node degree correlates significantly with the protein's evolutionary rate irrespectively of the specific dataset used. However, their topological measure is derived by using an external source of experimental data on gene expression. The further investigation on purely topological features of a PPI network which would distinguish transient and permanent interactions, and party and date hubs could bring more insights on how the evolutionary history of a protein is wired in its position within the network of all the protein interactions in an organism. In this perspective, network path-based measures, such as betweenness and closeness, seem to be promising (Yu et al., 2007). All the more, these measures also appear to relate to

the physiological conditions and environment.

protein essentiality (Park & Kim, 2009; Yu et al., 2007) and it could clarify the link between essentiality and evolution as such. Thereafter, they could improve on the prediction of essential genes from the topology of a PPI network in combination with protein evolutionary information, such as phyletic retention (Gustafson et al., 2006), as already corroborated by several application of machine learning techniques for essential gene detection, prioritizing drug targets and determining virulence factors (see e.g. Chen & Xu, 2005; Deng et al., 2011; Doyle et al., 2010; Gustafson et al., 2006; McDermott et al., 2009).

### **2.1.3 Node connectivity is relevant for protein evolution**

Since the factors relevant for protein evolution could be of a multiple character (Wolf et al., 2006), it is interesting to investigate whether protein connectivity plays a central or a more subtle role. In the latter case, the link between protein connectivity and evolution could be the results of spurious correlations due to other underlying biological processes (Bloom & Adami, 2003). In order to address this issue, the contribution of protein connectivity to protein evolutionary conservation has been also studied in an integrated way (Pal et al., 2006) using multidimensional methods such as principal component analysis (PCA) and principal component regression (PCR).

The first successful application of PCA was given by Wolf et al. (2006) on seven genome-related variables. The derived first component reflected a gene's 'importance' and confirmed positive correlation between lethality, expression levels and number of protein-protein interaction which at the same time constrained protein evolution measures. Interestingly, the component also showed that the number of paralogs positively contributes to gene essentiality, which contradicts the finding of Giaever et al. (2002) that non-duplicated genes tend to be essential. However, the study of Drummond et al. (2006) revealed by using PCR only single determinant of protein evolution, namely translational selection, which is almost entirely determined by the gene expression level, protein abundance, and codon bias. Later, Plotkin & Fraser (2007) re-examined the use of PCR method and showed noise in biological data can confound PCRs, leading to spurious conclusions. As a result, when they equalized for different amounts of noise across the predictor variables no single determinant of evolution could be found indicating that a variety of factors-including expression level, gene dispensability, and protein-protein interactions may independently affect evolutionary rates in yeast. This observation was further substantiated by a recent study (Theis et al., 2011) where 16 genomic variables were analysed using Bayesian PCA. The study supports the evidence for the three above-discussed correlations. It also demonstrates how different definitions of paralogs may lead to different conclusions on their effect on essentiality, and thus commenting on Wolf et al.'s conflicting result (Wolf et al., 2006).

#### **2.2 Higher-order structures in a PPI network and evolution**

Researchers have also focused on other topological structures of a PPI network than just a node and their relation to evolutionary conservation. With increasing topological complexity we may talk about a single protein-protein interaction (an edge in PPI network), topological motifs, and protein clusters or modules as detected by their interaction density or network traffic.

& Jurisica (2007) who analysed the presence of protein interactions across multiple species via orthology mapping and found that the greater the conservation of a protein interaction is, the higher the enrichment for stable complexes. Beltrao et al. (2009) also observed that stable interactions are more conserved than transient interactions, by studying evolution of interactions involved in phosphoregulation. Finally, Zinman et al. (2011) extracted protein modules from a yeast integrated protein interaction network using various source of PPI evidence, and showed that interactions within modules were much more likely to be

A Survey on Evolutionary Analysis in PPI Networks 435

The preference of conserved protein interactions to be placed in modular parts of a network was also observed by Wuchty et al. (2006) by extending the paradigm of protein's connectivity and its evolutionary conservation to the connectivity of a protein-protein interaction. Specifically, they used the hypergeometric clustering coefficient to estimate the interaction cohesiveness of the PPI's neighbourhood and orthologous excess retention in order to asses the evolutionary conservation of PPIs. They used the same clustering coefficient as that given by the presence of orthologs of interacting proteins in another organism and showed that PPIs with highly clustered environment were accompanied by an elevated propensity for the corresponding proteins to be evolutionary conserved as well as preferably co-expressed (Wuchty et al., 2006). These findings are significant all the more they were shown to be stable under perturbations. This propensity of interacting proteins to be more conserved and prevalent among taxa was later confirmed by Tillier & Charlebois (2009) who used evolutionary distances to estimate the protein's conservation. Yet another perspective on conservation of PPIs was given by Kim & Marcotte (2008) who classified proteins into four groups (from oldest to youngest) according their age and found a unique interaction density pattern between different protein age groups, where the interaction density tends to be dense

All the evidences above that PPIs whose proteins are evolutionary correlated tend to form stable complexes and to be embedded in cohesive areas of a network topology support the premise that modularity of PPI networks is maintained by evolutionary pressure (Vespignani, 2003). Indeed, when examining networks solely built from sequence co-evolution, gene context analysis or gene family evolution of completely sequenced genomes, one may observe that these networks exhibit high modularity with clusters corresponding to known functional modules, thus revealing the structure of cellular organization (Cordero et al., 2008; Tuller et al.,

Regarding the networks of physically interacting proteins, to the best of our knowledge the first direct evidence that evolution drives the modularity of PPI networks was provided by Wuchty et al. (2003). They looked beyond a single protein pair and studied the more complex patterns of interacting proteins, called topological motifs. In general, they found that, as the number of nodes in a motif and number of links among its constituents increase, a greater and stronger conservation of the proteins could be observed. This was corroborated by Vergassola et al. (2005) who focused on specific instances of motifs known as cliques. Cliques are topological patterns where all protein constituents interact with each other. Vergassola et al. (2005) provided evidence for co-operative co-evolution within cliques of interacting

conserved than interactions between proteins in different modules.

within the same group and sparse between different age groups.

**2.2.2 Evolution and modularity of PPI networks**

2009; von Mering et al., 2003).

### **2.2.1 Evolution and protein-protein interaction**

Unlike in the case of a single protein, where various well-established methods for measuring sequence evolution are developed, to the best of our knowledge only a recent attempt has been made in order to estimate the evolutionary rate of protein-protein interaction (Qian et al., 2011). However, this study is limited to a small set of PPIs in yeasts and can not be yet applied for large-scale studies due to the lack of data. Thus, the research has extensively focused on estimating correlated evolution of a protein pair and their functional or physical interaction (Pazos & Valencia, 2008).

It is generally assumed that proteins which co-evolve tend to participate together in a common biological function. This hypothesis is supported by many examples of functionally interacting protein families that co-evolve (see e.g. Galperin & Koonin, 2000; Moyle et al., 1994). Co-evolution of proteins may be assessed at sequence level (*sequence co-evolution*) by correlating evolutionary rates (Clark et al., 2011), or at gene family level (*gene family evolution*) by correlating occurrence vectors (Kensche et al., 2008). An occurrence vector or a phylogenetic profile (phyletic pattern) (Tatusov et al., 1997) is an encoding of protein's (homologue's) presence or absence within a given set of species of interest (Kensche et al., 2008). In general, the methods for correlating protein evolution have been successfully applied to predict a physical or functional interaction between proteins (Clark et al., 2011; Kensche et al., 2008), where sequence co-evolution is powerful in predicting the physical interaction and phylogenetic profiling is a good indicator of functional interplay between proteins in a broader sense. Large-scale co-evolutionary maps have also been constructed and analysed for better understanding the evolution of a species and its link to protein interactions (see e.g. Cordero et al., 2008; Tillier & Charlebois, 2009; Tuller et al., 2009). All these works suggest that the topology of PPIs should reflect the evolutionary processes behind the proteins which formed such network.

The first systematic study of linked genes and their evolutionary rates was done by Williams & Hurst (2000) who showed that the rates of linked genes are more similar than the rates of random pairs of genes. Pazos & Valencia (2001) performed the first successful large-scale prediction of physical PPIs based on sequence co-evolution by correlating phylogenetic trees. Another large-scale study by Kim et al. (2004) on domain structural data of interacting protein families also revealed their high co-evolution but also showed a high diversity in the correlation of rates of each family pair. Specifically, protein families with a greater number of domains were shown to be more likely to co-evolve. However, Hakes et al. (2007) argued that this correlation of evolutionary rates is not responsible for the covariation between functional residues of interacting proteins. Nevertheless, other studies have been able to predict interacting domains from co-evolving residues between domains or proteins (see e.g. Jothi et al., 2006; Yeang & Haussler, 2007) indicating that different organisms use the same 'building blocks' for PPIs and that the functionality of many domain pairs in mediating protein interactions is maintained in evolution (Itzhaki et al., 2006).

Another perspective on co-evolution of interacting partners was given by Mintseris & Weng (2005), who distinguished between transient and obligate interactions. The authors concluded that obligate complexes are likely to co-evolve with their interacting partners, while transient interactions with an increased evolutionary rate show only little evidence for a correlated evolution of the interacting interfaces. This observation was later corroborated by Brown 8 Will-be-set-by-IN-TECH

Unlike in the case of a single protein, where various well-established methods for measuring sequence evolution are developed, to the best of our knowledge only a recent attempt has been made in order to estimate the evolutionary rate of protein-protein interaction (Qian et al., 2011). However, this study is limited to a small set of PPIs in yeasts and can not be yet applied for large-scale studies due to the lack of data. Thus, the research has extensively focused on estimating correlated evolution of a protein pair and their functional or physical interaction

It is generally assumed that proteins which co-evolve tend to participate together in a common biological function. This hypothesis is supported by many examples of functionally interacting protein families that co-evolve (see e.g. Galperin & Koonin, 2000; Moyle et al., 1994). Co-evolution of proteins may be assessed at sequence level (*sequence co-evolution*) by correlating evolutionary rates (Clark et al., 2011), or at gene family level (*gene family evolution*) by correlating occurrence vectors (Kensche et al., 2008). An occurrence vector or a phylogenetic profile (phyletic pattern) (Tatusov et al., 1997) is an encoding of protein's (homologue's) presence or absence within a given set of species of interest (Kensche et al., 2008). In general, the methods for correlating protein evolution have been successfully applied to predict a physical or functional interaction between proteins (Clark et al., 2011; Kensche et al., 2008), where sequence co-evolution is powerful in predicting the physical interaction and phylogenetic profiling is a good indicator of functional interplay between proteins in a broader sense. Large-scale co-evolutionary maps have also been constructed and analysed for better understanding the evolution of a species and its link to protein interactions (see e.g. Cordero et al., 2008; Tillier & Charlebois, 2009; Tuller et al., 2009). All these works suggest that the topology of PPIs should reflect the evolutionary processes behind the proteins which

The first systematic study of linked genes and their evolutionary rates was done by Williams & Hurst (2000) who showed that the rates of linked genes are more similar than the rates of random pairs of genes. Pazos & Valencia (2001) performed the first successful large-scale prediction of physical PPIs based on sequence co-evolution by correlating phylogenetic trees. Another large-scale study by Kim et al. (2004) on domain structural data of interacting protein families also revealed their high co-evolution but also showed a high diversity in the correlation of rates of each family pair. Specifically, protein families with a greater number of domains were shown to be more likely to co-evolve. However, Hakes et al. (2007) argued that this correlation of evolutionary rates is not responsible for the covariation between functional residues of interacting proteins. Nevertheless, other studies have been able to predict interacting domains from co-evolving residues between domains or proteins (see e.g. Jothi et al., 2006; Yeang & Haussler, 2007) indicating that different organisms use the same 'building blocks' for PPIs and that the functionality of many domain pairs in mediating

Another perspective on co-evolution of interacting partners was given by Mintseris & Weng (2005), who distinguished between transient and obligate interactions. The authors concluded that obligate complexes are likely to co-evolve with their interacting partners, while transient interactions with an increased evolutionary rate show only little evidence for a correlated evolution of the interacting interfaces. This observation was later corroborated by Brown

protein interactions is maintained in evolution (Itzhaki et al., 2006).

**2.2.1 Evolution and protein-protein interaction**

(Pazos & Valencia, 2008).

formed such network.

& Jurisica (2007) who analysed the presence of protein interactions across multiple species via orthology mapping and found that the greater the conservation of a protein interaction is, the higher the enrichment for stable complexes. Beltrao et al. (2009) also observed that stable interactions are more conserved than transient interactions, by studying evolution of interactions involved in phosphoregulation. Finally, Zinman et al. (2011) extracted protein modules from a yeast integrated protein interaction network using various source of PPI evidence, and showed that interactions within modules were much more likely to be conserved than interactions between proteins in different modules.

The preference of conserved protein interactions to be placed in modular parts of a network was also observed by Wuchty et al. (2006) by extending the paradigm of protein's connectivity and its evolutionary conservation to the connectivity of a protein-protein interaction. Specifically, they used the hypergeometric clustering coefficient to estimate the interaction cohesiveness of the PPI's neighbourhood and orthologous excess retention in order to asses the evolutionary conservation of PPIs. They used the same clustering coefficient as that given by the presence of orthologs of interacting proteins in another organism and showed that PPIs with highly clustered environment were accompanied by an elevated propensity for the corresponding proteins to be evolutionary conserved as well as preferably co-expressed (Wuchty et al., 2006). These findings are significant all the more they were shown to be stable under perturbations. This propensity of interacting proteins to be more conserved and prevalent among taxa was later confirmed by Tillier & Charlebois (2009) who used evolutionary distances to estimate the protein's conservation. Yet another perspective on conservation of PPIs was given by Kim & Marcotte (2008) who classified proteins into four groups (from oldest to youngest) according their age and found a unique interaction density pattern between different protein age groups, where the interaction density tends to be dense within the same group and sparse between different age groups.

### **2.2.2 Evolution and modularity of PPI networks**

All the evidences above that PPIs whose proteins are evolutionary correlated tend to form stable complexes and to be embedded in cohesive areas of a network topology support the premise that modularity of PPI networks is maintained by evolutionary pressure (Vespignani, 2003). Indeed, when examining networks solely built from sequence co-evolution, gene context analysis or gene family evolution of completely sequenced genomes, one may observe that these networks exhibit high modularity with clusters corresponding to known functional modules, thus revealing the structure of cellular organization (Cordero et al., 2008; Tuller et al., 2009; von Mering et al., 2003).

Regarding the networks of physically interacting proteins, to the best of our knowledge the first direct evidence that evolution drives the modularity of PPI networks was provided by Wuchty et al. (2003). They looked beyond a single protein pair and studied the more complex patterns of interacting proteins, called topological motifs. In general, they found that, as the number of nodes in a motif and number of links among its constituents increase, a greater and stronger conservation of the proteins could be observed. This was corroborated by Vergassola et al. (2005) who focused on specific instances of motifs known as cliques. Cliques are topological patterns where all protein constituents interact with each other. Vergassola et al. (2005) provided evidence for co-operative co-evolution within cliques of interacting

Despite this measures' diversity, the common conclusion is that the majority of modules evolve flexibly (Campillos et al., 2006; Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004; Yamada et al., 2006). Also, it appears that curated modules evolve more cohesively than modules derived from high throughput interaction data (Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004). Moreover, there is a different enrichment in functions which co-evolve. For example, biochemical pathways, certain metabolic and signalling processes, as well as core functions like transcription and translation, tend to have higher rate of evolutionary cohesiveness (Campillos et al., 2006; Fokkens & Snel, 2009; Zhao et al., 2007). This is also supported by methods which cluster phylogenetic profiles in order to detect biochemical pathways or to predict functional links and thus exploiting the predictive power of phylogenetic methods (Glazko & Mushegian, 2004; Li et al., 2009; Watanabe et al., 2008). These methods show a relatively good performance in characterizing biochemical pathways but seem to have a limited coverage for physically interacting proteins (Watanabe et al., 2008). A dubious result was reported on inter-connectivity of cohesive and flexible modules. Specifically, Fokkens & Snel (2009) demonstrated that components of cohesive modules are less likely to interact with each other than in the case of flexible modules, while two other studies (Campillos et al., 2006; Zinman et al., 2011) suggest cohesive modules to be

A Survey on Evolutionary Analysis in PPI Networks 437

It is possible that the above studies underestimated the actual degree of evolutionary cohesiveness present in the modularity of protein interaction networks due to their conservative approach, the limitations in ortholog detection as well as the cohesiveness measures which are restricted to phylogenetic profiles. Nevertheless, they show that, as evolution is a complex process, its presence in modularity of protein interaction networks also exhibits a very complex nature, whose understanding is far from being complete. Evolution itself, indeed, can be expected to be asynchronous and heterotactous along the tree of life.

In general, the interim evidence shows different evolutionary pressure for different types of protein interactions and data. In particular, the slowly evolving interacting partners are enriched in stable, permanent complexes, and functional modules such as biochemical pathways and curated complexes exhibit higher evolutionary cohesiveness than high throughput complexes. It seems that the co-evolutionary degree of modules within PPI networks increases with greater integration of various sources of evidence for proteins to functionally interact (Zinman et al., 2011). Also, not all protein complexes and functional modules need to be co-evolutionary modules (Fokkens & Snel, 2009). There is a continuum from extremely conserved to rapidly changing modules, where those modules found to be co-evolving appear to be enriched in certain, specific functional categories (Campillos et al., 2006). In addition, the degree of conservation and co-evolution of functional modules within interaction networks seem to reflect cellular organization and their spatio-temporal characteristics. For instance, cohesive modules can be classified according to their evolutionary age as ancestral, intermediate and young, where one may observe ancient, ancestral modules to be highly conserved and perform essential, core processes such as information storage and metabolism of amino acids, while young modules are less conserved and responsible for the communication with the environment (Campillos et al., 2006). Therefore one might expect ancestral modules to contain static, obligate interactions as the proteins of essential functions tend to involve multiple domains with slow evolutionary

more highly connected.

proteins. Later, Lee et al. (2006) investigated motifs at a higher resolution level, by defining for each motif different motif modes based on functional attributes of interacting proteins: again their findings indicated that motifs modes may very well represent the evolutionary conserved topological units of PPI networks. More recently, Liu et al. (2011) studied network motifs according to the age of their proteins and discovered that the proteins within motifs whose constituents are of the same age class tend to be densely interconnected, to co-evolve and to share the same biological functions. Moreover, these motifs tend to be within protein complexes.

The finding that modularity of PPI networks is constrained by evolution and that conserved interactions are enriched in dense motifs and regions of a PPI network also suggest that protein complexes present in such cohesive areas should be evolutionary driven (Jancura et al., 2012). As putative protein complexes can be extracted from a PPI network by means of clustering techniques, Jancura et al. (2012) detected such protein complexes in the PPI network consisting of only yeast proteins having an ortholog in another organism and compared them with those protein complexes derived either by using the global topology of a yeast PPI network or by using a network induced by randomly selected proteins. The in-depth examination of enriched functions in these three types of protein complexes revealed that evolutionary-driven complexes are functionally well differentiated from other two types of protein complexes found in the same interaction data. As a consequence, new complexes and protein function predictions could be unravelled from PPI data by using a standard clustering approach with the inclusion of evolutionary information. In addition, evolutionary-driven complexes were found to be differentially conserved, in particular some complexes were detected for all distinct set of orthologs as determined by comparison with different species, some exhibited only a subset of proteins identifiable in a complex across all species, and some complexes being found only for one specific set of orthologs. This suggests that presence of evolution in modularity of PPI networks is more versatile and flexible with different degrees of conservation.

The findings of Jancura et al. (2012) seem to conform with related studies that focused on evolutionary cohesiveness of protein functional modules in order to investigate whether a group of proteins which functionally interact, co-evolve more cohesively than a random group of proteins. Either known protein complexes and pathways were analysed (Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004) or putative protein modules usually derived from integrated networks of functional link evidences (Campillos et al., 2006; Zhao et al., 2007; Zinman et al., 2011). A different strategy was employed by Yamada et al. (2006) who at first detected evolutionary modules which were afterwards compared with enzyme connectivity in a metabolic network.

Although the co-evolution of modules is assessed by the presence or absence of modules' constituents across a set of species, there is no standard method to measure the degree to which a module evolves cohesively (Fokkens & Snel, 2009). For instance, Snel & Huynen (2004) used the deviation of the number of modules' orthologs per species from the average number of modules' orthologs per species, whereas Campillos et al. (2006) measured the fraction of joined evolutionary events given the reconstructed, most parsimonious evolutionary scenario of the genes in a module over their phylogenetic profiles.

10 Will-be-set-by-IN-TECH

proteins. Later, Lee et al. (2006) investigated motifs at a higher resolution level, by defining for each motif different motif modes based on functional attributes of interacting proteins: again their findings indicated that motifs modes may very well represent the evolutionary conserved topological units of PPI networks. More recently, Liu et al. (2011) studied network motifs according to the age of their proteins and discovered that the proteins within motifs whose constituents are of the same age class tend to be densely interconnected, to co-evolve and to share the same biological functions. Moreover, these motifs tend to be within protein

The finding that modularity of PPI networks is constrained by evolution and that conserved interactions are enriched in dense motifs and regions of a PPI network also suggest that protein complexes present in such cohesive areas should be evolutionary driven (Jancura et al., 2012). As putative protein complexes can be extracted from a PPI network by means of clustering techniques, Jancura et al. (2012) detected such protein complexes in the PPI network consisting of only yeast proteins having an ortholog in another organism and compared them with those protein complexes derived either by using the global topology of a yeast PPI network or by using a network induced by randomly selected proteins. The in-depth examination of enriched functions in these three types of protein complexes revealed that evolutionary-driven complexes are functionally well differentiated from other two types of protein complexes found in the same interaction data. As a consequence, new complexes and protein function predictions could be unravelled from PPI data by using a standard clustering approach with the inclusion of evolutionary information. In addition, evolutionary-driven complexes were found to be differentially conserved, in particular some complexes were detected for all distinct set of orthologs as determined by comparison with different species, some exhibited only a subset of proteins identifiable in a complex across all species, and some complexes being found only for one specific set of orthologs. This suggests that presence of evolution in modularity of PPI networks is more versatile and flexible with different degrees

The findings of Jancura et al. (2012) seem to conform with related studies that focused on evolutionary cohesiveness of protein functional modules in order to investigate whether a group of proteins which functionally interact, co-evolve more cohesively than a random group of proteins. Either known protein complexes and pathways were analysed (Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004) or putative protein modules usually derived from integrated networks of functional link evidences (Campillos et al., 2006; Zhao et al., 2007; Zinman et al., 2011). A different strategy was employed by Yamada et al. (2006) who at first detected evolutionary modules which were afterwards compared with enzyme connectivity

Although the co-evolution of modules is assessed by the presence or absence of modules' constituents across a set of species, there is no standard method to measure the degree to which a module evolves cohesively (Fokkens & Snel, 2009). For instance, Snel & Huynen (2004) used the deviation of the number of modules' orthologs per species from the average number of modules' orthologs per species, whereas Campillos et al. (2006) measured the fraction of joined evolutionary events given the reconstructed, most parsimonious

evolutionary scenario of the genes in a module over their phylogenetic profiles.

complexes.

of conservation.

in a metabolic network.

Despite this measures' diversity, the common conclusion is that the majority of modules evolve flexibly (Campillos et al., 2006; Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004; Yamada et al., 2006). Also, it appears that curated modules evolve more cohesively than modules derived from high throughput interaction data (Fokkens & Snel, 2009; Seidl & Schultz, 2009; Snel & Huynen, 2004). Moreover, there is a different enrichment in functions which co-evolve. For example, biochemical pathways, certain metabolic and signalling processes, as well as core functions like transcription and translation, tend to have higher rate of evolutionary cohesiveness (Campillos et al., 2006; Fokkens & Snel, 2009; Zhao et al., 2007). This is also supported by methods which cluster phylogenetic profiles in order to detect biochemical pathways or to predict functional links and thus exploiting the predictive power of phylogenetic methods (Glazko & Mushegian, 2004; Li et al., 2009; Watanabe et al., 2008). These methods show a relatively good performance in characterizing biochemical pathways but seem to have a limited coverage for physically interacting proteins (Watanabe et al., 2008). A dubious result was reported on inter-connectivity of cohesive and flexible modules. Specifically, Fokkens & Snel (2009) demonstrated that components of cohesive modules are less likely to interact with each other than in the case of flexible modules, while two other studies (Campillos et al., 2006; Zinman et al., 2011) suggest cohesive modules to be more highly connected.

It is possible that the above studies underestimated the actual degree of evolutionary cohesiveness present in the modularity of protein interaction networks due to their conservative approach, the limitations in ortholog detection as well as the cohesiveness measures which are restricted to phylogenetic profiles. Nevertheless, they show that, as evolution is a complex process, its presence in modularity of protein interaction networks also exhibits a very complex nature, whose understanding is far from being complete. Evolution itself, indeed, can be expected to be asynchronous and heterotactous along the tree of life.

In general, the interim evidence shows different evolutionary pressure for different types of protein interactions and data. In particular, the slowly evolving interacting partners are enriched in stable, permanent complexes, and functional modules such as biochemical pathways and curated complexes exhibit higher evolutionary cohesiveness than high throughput complexes. It seems that the co-evolutionary degree of modules within PPI networks increases with greater integration of various sources of evidence for proteins to functionally interact (Zinman et al., 2011). Also, not all protein complexes and functional modules need to be co-evolutionary modules (Fokkens & Snel, 2009). There is a continuum from extremely conserved to rapidly changing modules, where those modules found to be co-evolving appear to be enriched in certain, specific functional categories (Campillos et al., 2006). In addition, the degree of conservation and co-evolution of functional modules within interaction networks seem to reflect cellular organization and their spatio-temporal characteristics. For instance, cohesive modules can be classified according to their evolutionary age as ancestral, intermediate and young, where one may observe ancient, ancestral modules to be highly conserved and perform essential, core processes such as information storage and metabolism of amino acids, while young modules are less conserved and responsible for the communication with the environment (Campillos et al., 2006). Therefore one might expect ancestral modules to contain static, obligate interactions as the proteins of essential functions tend to involve multiple domains with slow evolutionary

Several algorithmic enhancements of the interologs-based approach have been introduced since the first proposal of a systematic use of interolog inference (Matthews et al., 2001). For instance, Yu et al. (2004) have strengthen the definition of ortholog by using a reciprocal best-hit approach and compared it to the original one-way best-hit approach implemented by Matthews et al. (2001). In addition, they required a minimum level for a joint similarity of orthologous sequences in order to perform interolog mapping. Their method yielded a 54%

A Survey on Evolutionary Analysis in PPI Networks 439

Other approaches exploited the knowledge on a higher conservation rate of PPIs in dense network motifs. For instance Huang et al. (2007) scored interologs according to the density of the topological pattern containing the respective PPI of the interolog in a model species as determined by the extraction of maximal quasi-cliques from the PPI network of the model species. This score was integrated with scores of other various features used for PPI prediction, such as tissue specificity, sub-cellular localization, interacting domains and cell-cycle stage. The use of multiple types of features was shown to yield more accurate predictions of PPIs in comparison with other interolog-based methods used to build interactome databases. More recently, Jaeger et al. (2010) proposed another interesting method based on two steps. First a set of all candidate interologs is built across the considered species. Next, interologs are assembled into maximal conserved and connected patterns by detecting frequent sub-graphs appearing in the interolog network of the candidate set. Only

The interolog concept was also modified and used in other ways and application domains. In particular, Tirosh & Barkai (2005) proposed a method to assess and increase the confidence of a predicted PPI by examining the co-expression of proteins of its potential interolog in other species. Chen et al. (2007) extended interolog mapping for homologous inference of interacting 3D-domains and they built a database of so-called 3D-interologs (Lo, Chen & Yang, 2010). Chen et al. (2009) used interologs to transfer conserved domain-domain interactions. Recently, Lo, Lin & Yang (2010) combined this interolog domain transfer with the former 3D-interolog detection technique and implemented an integrated tool for searching homologous protein complexes. Finally, Lee et al. (2008) exploited interologs to predict

Despite the successful use of interolog inference, a gap was observed between the actual, observed number of conserved interactions and the expected theoretical coverage (Gandhi et al., 2006; Lee et al., 2008). In order to test the reliability of interolog transfer, Mika & Rost (2006) performed a comprehensive validation of the method on several datasets. Their findings suggested that interolog transfers are only accurate at very high levels of sequence identity. In addition, they also compared the interolog transfer within species and across species. In the case of within-species interolog inference a PPI is transferred onto proteins which are sequence similar to the proteins of the considered PPI in the same species. Surprisingly, such paralogous interolog transfers of protein-protein interactions were shown to be significantly more reliable than the orthologous ones. This result was later substantiated by Saeed & Deane (2008), indicating that homology-based interaction prediction methods may yield better results when within-species interolog inference is also considered. In addition, Brown & Jurisica (2007) argued that one also needs to take into account whether all interactions have equal probability of being transferred between organisms. For example, the dynamic components of the interactomes are less likely to be accurately mapped from

accuracy in contrast to a 30% of the previous method by Matthews et al. (2001).

functionally coherent patterns were used for interolog inference.

inter-species interactions.

rates, whereas young modules can be enriched with dynamic, transient interactions with less but fast evolving protein domains to allow adaptation to the environment.

## **3. Using evolutionary information for knowledge discovery in PPI networks**

The tendency of functionally linked or physically interacting proteins and densely interacting motifs to exhibit correlated evolution and/or to be conserved across species is at the core of methods for inferring relevant biological information using PPI networks. Although such biological information can be limited and biased towards specific type of known interactions and protein functions, it allows one to infer new, unknown functions of proteins, to improve the understanding of biological systems, and to guide the discovery of drug-target interaction. In its basic form, the knowledge discovery process is based on the transfer of information involving a single interaction between two organisms, while in its most complex form it involves the identification and transfer of protein complexes across multiple species. In the sequel we summarize concepts and techniques used to achieve these goals, in particular the notions of "interologs" and of multiple PPI networks alignment.

### **3.1 Predicting protein interaction: Interologs**

If two proteins physically interact in one species and they have orthologous counterparts in another species, it is likely that their orthologs interact in that species too. If such conserved interactions exist, they are called *interologs*. This simple method of protein interaction inference was firstly introduced and tested by Walhout et al. (2000) on proteins involved in vulval development of nematode worm, where potential interactions between these proteins were identified based on interactions of their orthologs in other species. Later, Matthews et al. (2001) performed a large-scale analysis of this inference technique using the yeast PPI network as a model and proteins of worm as a target. Although the success rate of detection of inferred interactions by Y2H analysis was between 16%-31%, it represented a 600-1100-fold increase compared to a conventional approach at that time (Matthews et al., 2001).

The interologs-based protein interaction prediction has become one of the standard methods for *in silico* PPI prediction. The method can be easily extended to more PPI data from multiple species. In particular, having two groups of orthologs, where each ortholog group contains proteins from the same *N* species, and observing an interaction between proteins of these orthologous groups in (*N* − 1) species, the interaction between proteins of the *N*-th species present in the ortholog groups can be predicted. This multidimensional character of interolog inference has been extensively used to predict and build databases of the whole interactome for various species, either as a stand alone approach or in combination with other *in silico* methods, which often integrate multiple data types including the gene co-expression, co-localization, functional category, the occurrence of orthologs and other genomic context methods. In this way researchers could provide, for instance, the first sketch of human interactome (Lehner & Fraser, 2004), build the interactome of plants (Geisler-Lee et al., 2007; Gu et al., 2011), and improve the understanding of processes in a malarial parasite (Pavithra et al., 2007) or in cancer (Jonsson & Bates, 2006). Also, three, up-to-date, tools have been recently implemented and made available to perform this inference task (Gallone et al., 2011; Michaut et al., 2008; Pedamallu & Posfai, 2010).

12 Will-be-set-by-IN-TECH

rates, whereas young modules can be enriched with dynamic, transient interactions with less

The tendency of functionally linked or physically interacting proteins and densely interacting motifs to exhibit correlated evolution and/or to be conserved across species is at the core of methods for inferring relevant biological information using PPI networks. Although such biological information can be limited and biased towards specific type of known interactions and protein functions, it allows one to infer new, unknown functions of proteins, to improve the understanding of biological systems, and to guide the discovery of drug-target interaction. In its basic form, the knowledge discovery process is based on the transfer of information involving a single interaction between two organisms, while in its most complex form it involves the identification and transfer of protein complexes across multiple species. In the sequel we summarize concepts and techniques used to achieve these goals, in particular the

If two proteins physically interact in one species and they have orthologous counterparts in another species, it is likely that their orthologs interact in that species too. If such conserved interactions exist, they are called *interologs*. This simple method of protein interaction inference was firstly introduced and tested by Walhout et al. (2000) on proteins involved in vulval development of nematode worm, where potential interactions between these proteins were identified based on interactions of their orthologs in other species. Later, Matthews et al. (2001) performed a large-scale analysis of this inference technique using the yeast PPI network as a model and proteins of worm as a target. Although the success rate of detection of inferred interactions by Y2H analysis was between 16%-31%, it represented a 600-1100-fold

The interologs-based protein interaction prediction has become one of the standard methods for *in silico* PPI prediction. The method can be easily extended to more PPI data from multiple species. In particular, having two groups of orthologs, where each ortholog group contains proteins from the same *N* species, and observing an interaction between proteins of these orthologous groups in (*N* − 1) species, the interaction between proteins of the *N*-th species present in the ortholog groups can be predicted. This multidimensional character of interolog inference has been extensively used to predict and build databases of the whole interactome for various species, either as a stand alone approach or in combination with other *in silico* methods, which often integrate multiple data types including the gene co-expression, co-localization, functional category, the occurrence of orthologs and other genomic context methods. In this way researchers could provide, for instance, the first sketch of human interactome (Lehner & Fraser, 2004), build the interactome of plants (Geisler-Lee et al., 2007; Gu et al., 2011), and improve the understanding of processes in a malarial parasite (Pavithra et al., 2007) or in cancer (Jonsson & Bates, 2006). Also, three, up-to-date, tools have been recently implemented and made available to perform this inference task (Gallone et al., 2011;

increase compared to a conventional approach at that time (Matthews et al., 2001).

**3. Using evolutionary information for knowledge discovery in PPI networks**

but fast evolving protein domains to allow adaptation to the environment.

notions of "interologs" and of multiple PPI networks alignment.

**3.1 Predicting protein interaction: Interologs**

Michaut et al., 2008; Pedamallu & Posfai, 2010).

Several algorithmic enhancements of the interologs-based approach have been introduced since the first proposal of a systematic use of interolog inference (Matthews et al., 2001). For instance, Yu et al. (2004) have strengthen the definition of ortholog by using a reciprocal best-hit approach and compared it to the original one-way best-hit approach implemented by Matthews et al. (2001). In addition, they required a minimum level for a joint similarity of orthologous sequences in order to perform interolog mapping. Their method yielded a 54% accuracy in contrast to a 30% of the previous method by Matthews et al. (2001).

Other approaches exploited the knowledge on a higher conservation rate of PPIs in dense network motifs. For instance Huang et al. (2007) scored interologs according to the density of the topological pattern containing the respective PPI of the interolog in a model species as determined by the extraction of maximal quasi-cliques from the PPI network of the model species. This score was integrated with scores of other various features used for PPI prediction, such as tissue specificity, sub-cellular localization, interacting domains and cell-cycle stage. The use of multiple types of features was shown to yield more accurate predictions of PPIs in comparison with other interolog-based methods used to build interactome databases. More recently, Jaeger et al. (2010) proposed another interesting method based on two steps. First a set of all candidate interologs is built across the considered species. Next, interologs are assembled into maximal conserved and connected patterns by detecting frequent sub-graphs appearing in the interolog network of the candidate set. Only functionally coherent patterns were used for interolog inference.

The interolog concept was also modified and used in other ways and application domains. In particular, Tirosh & Barkai (2005) proposed a method to assess and increase the confidence of a predicted PPI by examining the co-expression of proteins of its potential interolog in other species. Chen et al. (2007) extended interolog mapping for homologous inference of interacting 3D-domains and they built a database of so-called 3D-interologs (Lo, Chen & Yang, 2010). Chen et al. (2009) used interologs to transfer conserved domain-domain interactions. Recently, Lo, Lin & Yang (2010) combined this interolog domain transfer with the former 3D-interolog detection technique and implemented an integrated tool for searching homologous protein complexes. Finally, Lee et al. (2008) exploited interologs to predict inter-species interactions.

Despite the successful use of interolog inference, a gap was observed between the actual, observed number of conserved interactions and the expected theoretical coverage (Gandhi et al., 2006; Lee et al., 2008). In order to test the reliability of interolog transfer, Mika & Rost (2006) performed a comprehensive validation of the method on several datasets. Their findings suggested that interolog transfers are only accurate at very high levels of sequence identity. In addition, they also compared the interolog transfer within species and across species. In the case of within-species interolog inference a PPI is transferred onto proteins which are sequence similar to the proteins of the considered PPI in the same species. Surprisingly, such paralogous interolog transfers of protein-protein interactions were shown to be significantly more reliable than the orthologous ones. This result was later substantiated by Saeed & Deane (2008), indicating that homology-based interaction prediction methods may yield better results when within-species interolog inference is also considered. In addition, Brown & Jurisica (2007) argued that one also needs to take into account whether all interactions have equal probability of being transferred between organisms. For example, the dynamic components of the interactomes are less likely to be accurately mapped from

between these alignment nodes as follows. The direct edge corresponds to the case when a PPI between proteins of two orthologous pairs exists in the PPI networks of both species. The gap edge represents the case when in one species the respective proteins of alignments nodes are connected indirectly through a common neighbour. Finally, the mismatch edge between alignments nodes is formed if such indirect connection is found between the corresponding proteins in the PPI networks of both species. Gap and mismatch edges are used to describe possible evolutionary variations or account for experimental errors in data (Kelley et al., 2003). In the search phase, the alignment graph is turned into acyclic sub-graphs by random removal of alignment edges, which allows to extract high-scoring paths in linear time by a dynamic programming approach. The score of a path is computed as the sum of log probabilities of true orthology encoded in alignment nodes of the path and of true conserved interactions encoded by alignment edges contained in the path. Interestingly, the method was also applied to align a PPI network with its own copy. In this way they could identify conserved (paralogous)

A Survey on Evolutionary Analysis in PPI Networks 441

The work of Kelley et al. (2003) was followed by other alignment techniques for discovering conserved pathways based on evolutionary conservation. The main drawbacks of PathBLAST are that it detects conserved linear pathways in protein interaction data, which is represented as an undirected graph, and it has an exponentially worsening efficiency with the expected increasing length of a pathway to be detected. To circumvent these limitations Pinter et al. (2005) proposed an alignment technique designed explicitly for metabolic networks with directed links between enzymes. The method also handles more complex structures than a simple path, because the scoring of the alignment is based on sub-tree homeomorphism, which can be solved by an efficient deterministic approximation. Another enhancement for the pathway alignment problem was proposed by Wernicke & Rasche (2007) who designed a method that does not impose topological restrictions upon pathways and exploits the biological and local properties of pathways within the network. Another effective approach to metabolic network alignment was developed by Li et al. (2008) which uses an integrative score on compound and enzyme similarities. Pathway alignment has been further extensively investigated and various other techniques have been proposed (see e.g. Cheng et al., 2008;

Koyutürk, Kim, Subramaniam, Szpankowski & Grama, 2006; Li et al., 2007).

**3.2.2 Local pairwise network alignment for protein complex detection**

The evolutionary mapping of PathBLAST can also be used to query a known pathway of one species into the PPI network of another species. However, due to limitations and algorithmic constraints of PathBLAST, many other methods have been developed with a focussed application of orthologous querying of biological functional complexes, and tools and web-services are available for querying general pathways and other types of protein functional modules across species (see e.g. Bruckner et al., 2009; Dost et al., 2008; Qian et al.,

Another group of methods which followed PathBLAST focus on detection of conserved protein complexes across (PPI networks of two or more) species. As these methods compare networks of physical interactions, the identified complexes can be used for interolog prediction as well as for protein function prediction of yet uncharacterized proteins. The detected conserved complexes are either (putative) entire physical complexes or conserved

pathways within one species.

2009; Yang & Sze, 2007).

parts of them.

distantly related organisms. Moreover, there is apparent bias of interologs to be enriched in stable, permanent complexes (Brown & Jurisica, 2007), which is completely in accordance with findings on the different evolution of transient and permanent interactions. On the other hand, it is likely that the performance of interolog inference could be underestimated since its accuracy is assessed using experimentally tests based on Y2H techniques or high-throughput datasets with a high abundance in Y2H interactions, which were found to be highly enriched in transient and inter-complex connections (Yu et al., 2008).

## **3.2 Pairwise protein network alignment**

Detection and transfer of an interolog between species have motivated the study and exploration of interspecies conservation of protein interactions on a global scale. In particular, instead of focusing on a conserved interaction alone one can compare and align whole interactome maps of distinct species, which mimics the idea behind sequence alignment methods. This approach gave a rise to so-called *network alignment* approach (Sharan & Ideker, 2006).

Using protein network alignment, one can either search for conserved functional network structures such as protein complexes and pathways, or identify functional orthologs across species. As a result this approach should provide a greater evidence and support for protein function and protein interaction prediction for yet uncharacterized or unknown biological processes. Protein network alignment methods can be classified into two main groups:*local network alignments* and *global network alignments*.

As most of the research attention has focused on comparing PPI networks of two different species, here we discuss the successive development of methods for, so-called, *pairwise network alignment*. In sequel we survey local pairwise alignments for detecting evolutionary conserved pathways, local pairwise alignments for detecting conserved protein complexes, and global pairwise network alignment techniques.

### **3.2.1 Local pairwise network alignment for pathway detection and query tasks**

The main goal of local protein network alignment is to detect conserved pathways and protein complexes across species, by searching for local regions of input networks having both high topological similarity between the regions and high sequence similarity between proteins of these regions. The standard approach to this task consists of two main phases: *an alignment phase* and *a searching phase*. In the first phase a merged network representation of compared PPI networks is constructed, called *alignment or orthology graph*. The second phase performs a search for the structures of interest in the orthology graph. Each output result corresponds to a pair or multiplet of complexes or pathways which are evolutionary conserved across the two or more (PPI networks of the) species, respectively.

The first alignment method of whole PPI networks of two species using protein sequence similarity was introduced by Kelley et al. (2003). In this method, called PathBLAST, first a many-to-many mapping between proteins of the two species is determined by considering each pair of proteins with a sequence similarity higher than a given threshold as putative orthologs. Next, every orthologous pair is encoded in one alignment node of the new alignment graph and three types of edges (direct, gap and mismatch edge) are identified 14 Will-be-set-by-IN-TECH

distantly related organisms. Moreover, there is apparent bias of interologs to be enriched in stable, permanent complexes (Brown & Jurisica, 2007), which is completely in accordance with findings on the different evolution of transient and permanent interactions. On the other hand, it is likely that the performance of interolog inference could be underestimated since its accuracy is assessed using experimentally tests based on Y2H techniques or high-throughput datasets with a high abundance in Y2H interactions, which were found to be highly enriched

Detection and transfer of an interolog between species have motivated the study and exploration of interspecies conservation of protein interactions on a global scale. In particular, instead of focusing on a conserved interaction alone one can compare and align whole interactome maps of distinct species, which mimics the idea behind sequence alignment methods. This approach gave a rise to so-called *network alignment* approach (Sharan & Ideker,

Using protein network alignment, one can either search for conserved functional network structures such as protein complexes and pathways, or identify functional orthologs across species. As a result this approach should provide a greater evidence and support for protein function and protein interaction prediction for yet uncharacterized or unknown biological processes. Protein network alignment methods can be classified into two main groups:*local*

As most of the research attention has focused on comparing PPI networks of two different species, here we discuss the successive development of methods for, so-called, *pairwise network alignment*. In sequel we survey local pairwise alignments for detecting evolutionary conserved pathways, local pairwise alignments for detecting conserved protein complexes, and global

The main goal of local protein network alignment is to detect conserved pathways and protein complexes across species, by searching for local regions of input networks having both high topological similarity between the regions and high sequence similarity between proteins of these regions. The standard approach to this task consists of two main phases: *an alignment phase* and *a searching phase*. In the first phase a merged network representation of compared PPI networks is constructed, called *alignment or orthology graph*. The second phase performs a search for the structures of interest in the orthology graph. Each output result corresponds to a pair or multiplet of complexes or pathways which are evolutionary conserved across the

The first alignment method of whole PPI networks of two species using protein sequence similarity was introduced by Kelley et al. (2003). In this method, called PathBLAST, first a many-to-many mapping between proteins of the two species is determined by considering each pair of proteins with a sequence similarity higher than a given threshold as putative orthologs. Next, every orthologous pair is encoded in one alignment node of the new alignment graph and three types of edges (direct, gap and mismatch edge) are identified

**3.2.1 Local pairwise network alignment for pathway detection and query tasks**

in transient and inter-complex connections (Yu et al., 2008).

**3.2 Pairwise protein network alignment**

*network alignments* and *global network alignments*.

two or more (PPI networks of the) species, respectively.

pairwise network alignment techniques.

2006).

between these alignment nodes as follows. The direct edge corresponds to the case when a PPI between proteins of two orthologous pairs exists in the PPI networks of both species. The gap edge represents the case when in one species the respective proteins of alignments nodes are connected indirectly through a common neighbour. Finally, the mismatch edge between alignments nodes is formed if such indirect connection is found between the corresponding proteins in the PPI networks of both species. Gap and mismatch edges are used to describe possible evolutionary variations or account for experimental errors in data (Kelley et al., 2003). In the search phase, the alignment graph is turned into acyclic sub-graphs by random removal of alignment edges, which allows to extract high-scoring paths in linear time by a dynamic programming approach. The score of a path is computed as the sum of log probabilities of true orthology encoded in alignment nodes of the path and of true conserved interactions encoded by alignment edges contained in the path. Interestingly, the method was also applied to align a PPI network with its own copy. In this way they could identify conserved (paralogous) pathways within one species.

The work of Kelley et al. (2003) was followed by other alignment techniques for discovering conserved pathways based on evolutionary conservation. The main drawbacks of PathBLAST are that it detects conserved linear pathways in protein interaction data, which is represented as an undirected graph, and it has an exponentially worsening efficiency with the expected increasing length of a pathway to be detected. To circumvent these limitations Pinter et al. (2005) proposed an alignment technique designed explicitly for metabolic networks with directed links between enzymes. The method also handles more complex structures than a simple path, because the scoring of the alignment is based on sub-tree homeomorphism, which can be solved by an efficient deterministic approximation. Another enhancement for the pathway alignment problem was proposed by Wernicke & Rasche (2007) who designed a method that does not impose topological restrictions upon pathways and exploits the biological and local properties of pathways within the network. Another effective approach to metabolic network alignment was developed by Li et al. (2008) which uses an integrative score on compound and enzyme similarities. Pathway alignment has been further extensively investigated and various other techniques have been proposed (see e.g. Cheng et al., 2008; Koyutürk, Kim, Subramaniam, Szpankowski & Grama, 2006; Li et al., 2007).

The evolutionary mapping of PathBLAST can also be used to query a known pathway of one species into the PPI network of another species. However, due to limitations and algorithmic constraints of PathBLAST, many other methods have been developed with a focussed application of orthologous querying of biological functional complexes, and tools and web-services are available for querying general pathways and other types of protein functional modules across species (see e.g. Bruckner et al., 2009; Dost et al., 2008; Qian et al., 2009; Yang & Sze, 2007).

### **3.2.2 Local pairwise network alignment for protein complex detection**

Another group of methods which followed PathBLAST focus on detection of conserved protein complexes across (PPI networks of two or more) species. As these methods compare networks of physical interactions, the identified complexes can be used for interolog prediction as well as for protein function prediction of yet uncharacterized proteins. The detected conserved complexes are either (putative) entire physical complexes or conserved parts of them.

their gene co-expression characteristics and coherence of functional annotations. Thus, the method can be seen as detecting functional modules shared across species rather than strictly evolutionary modules. Finally, Berg & Lässig (2006) developed a generalized alignment

A Survey on Evolutionary Analysis in PPI Networks 443

Despite various pairwise alignment techniques have been introduced, only a few of them embody an evolutionary model of PPI networks in the scoring scheme of an alignment. Notably, Koyutürk, Kim, Topkara, Subramaniam, Grama & Szpankowski (2006) were the first to introduce a method that builds the orthology graph following the duplication/divergence model based on gene duplications. Another interesting method was proposed by Hirsh & Sharan (2007) who extended the probabilistic score of NetworkBLAST to asses the likelihood that two complexes originated from an ancestral complex in the common ancestor of the two species being compared under the evolutionary pressure of duplication and link dynamics

In contrast to local network alignment, which uses many-to-many homologous mapping between proteins of distinct species to detect local conserved regions of a high topological similarity in the respective PPI networks, global protein network alignment uses this mapping to define an unique, globally optimal mapping across whole topologies of PPI networks (Singh et al., 2007), even if it were locally suboptimal in some regions of the networks. In the most strict form of this unique mapping each node in one input network is either matched to one node in the other input network or has no match in the other network. Thus the goal of global protein network alignment is to define functional orthologs across species, as the solution offers a way to resolve the ambiguity of orthology detection with the use of species interactome map. Naturally, as a by-product the global alignment can also identify conserved

To the best of our knowledge, the first method performing explicitly global alignment on pair of networks, called IsoRank, was introduced by Singh et al. (2007). Similarly to the local network alignment problem, the global network alignment problem is in general computationally intractable. As a consequence, IsoRank employs an approximation using an

Several advancements have naturally followed the introduction of IsoRank. For instance, Evans et al. (2008) proposed an asymmetric network matching algorithm based on a network simulation method called quantitative simulation, where a similarity score of a protein pair is iteratively updated by the similarity scores of their neighbours and vice versa until a unique global optimum is found. Other researchers focused more on formulating global alignment as combinatorial optimization problems. For instance Zaslavskiy et al. (2009) redefined the problem of global alignment as a standard graph matching problem and investigated methods using ideas and approaches from state-of-the-art graph matching techniques. Klau (2009) formalized global network alignment as an integer linear programming problem, where a near-optimal solution with a quality guarantee is found by solving a Lagrangian relaxation of the original optimization formulation. Recently, Chindelevitch et al. (2010) proposed a method where the global alignment is encoded as bipartite matching and applied a very efficient local

eigenvalue framework in a manner analogous to Google's PageRank algorithm.

optimization heuristic used for the well-known Travelling Salesman Problem.

Bayesian method applicable to different biological networks.

**3.2.3 Global pairwise network alignment**

complexes or pathways.

events.

To the best of our knowledge, the first method for detecting conserved complexes using pairwise comparison of PPI networks was introduced by Sharan, Ideker, Kelley, Shamir & Karp (2005) and called NetworkBLAST. It can be viewed as a direct extension of PathBLAST for the task of complex detection across species. The method employs a comprehensive probabilistic model for conservation of protein complexes and searches for heavy induced sub-graphs in the weighted orthology graph. As the maximal induced sub-graph problem is computationally intractable, NetworkBLAST employs a bottom-up greedy heuristic for this task.

Many alignment network techniques which followed NetworkBLAST are motivated by the computational intractability issue derived from the problem of a finding maximal common or induced sub-graph in an ortholog graph, and are based on different heuristics. For instance, Koyutürk, Kim, Topkara, Subramaniam, Grama & Szpankowski (2006) partitions the alignment graph into smaller clusters by performing an approximated balanced ratio-cut. In another method by Koyutürk, Kim, Subramaniam, Szpankowski & Grama (2006) the most frequent interaction motifs are extracted from an orthology-contracted graph. Liang et al. (2006) transforms the problem of maximal common sub-graph into the problem of finding all maximal cliques in the graph. Recently, Tian & Samatova (2009) introduced an algorithm based on detection of connected-components of the orthology graph solvable in a very efficient way.

Other researchers propose to restrict the search space to cope with intractability issue of searching phase instead of performing heavy heuristics. For example Li et al. (2007) pre-clusters one PPI network in order to detect candidate complexes which are afterwards aligned to the target species network with an exact integer programming algorithm. Jancura & Marchiori (2010) proposed a pre-processing algorithm based on detection of network hubs for dividing PPI networks, prior to their alignment, into smaller sub-networks containing potential conserved modules. Each possible pair of sub-networks can be later aligned with a state-of-the-art alignment method where the search phase can be performed by means of an exact algorithm, allowing one to perform network comparison in a fully modular fashion and possibly to parallelize the computation. An interesting modular approach was introduced by Narayanan & Karp (2007), where an orthology graph is not constructed but rather networks are compared and split consecutively in several recursive steps until all possible solutions, conserved sub-graphs, are found. Similarly, Gerke et al. (2007) only compares, but does not merge, local hub-centred regions of PPI networks as identified by clustering coefficients and node degrees. The method by Ali & Deane (2009) is again another example of approach where an alignment graph is not explicitly constructed; there interspecies protein similarities are considered as new edges in such a way that species PPI networks and similarity edges between them are encoded into a single global meta-graph which can be searched by standard clustering techniques.

There are also alignment methods which try to incorporate or use other types of information than just the one based on sequence similarity and interaction conservation. For instance, Guo & Hartemink (2009) exploited the findings on co-evolving interacting domains which mediate PPIs and, instead of using putatively homologous proteins for alignment, compares PPI networks across species according to conserved domains of protein-protein interactions. Ali & Deane (2009) propose a functionally guided alignment of PPI networks, where a scoring function incorporates not only sequence and topological similarity of aligned proteins but also 16 Will-be-set-by-IN-TECH

To the best of our knowledge, the first method for detecting conserved complexes using pairwise comparison of PPI networks was introduced by Sharan, Ideker, Kelley, Shamir & Karp (2005) and called NetworkBLAST. It can be viewed as a direct extension of PathBLAST for the task of complex detection across species. The method employs a comprehensive probabilistic model for conservation of protein complexes and searches for heavy induced sub-graphs in the weighted orthology graph. As the maximal induced sub-graph problem is computationally intractable, NetworkBLAST employs a bottom-up greedy heuristic for this

Many alignment network techniques which followed NetworkBLAST are motivated by the computational intractability issue derived from the problem of a finding maximal common or induced sub-graph in an ortholog graph, and are based on different heuristics. For instance, Koyutürk, Kim, Topkara, Subramaniam, Grama & Szpankowski (2006) partitions the alignment graph into smaller clusters by performing an approximated balanced ratio-cut. In another method by Koyutürk, Kim, Subramaniam, Szpankowski & Grama (2006) the most frequent interaction motifs are extracted from an orthology-contracted graph. Liang et al. (2006) transforms the problem of maximal common sub-graph into the problem of finding all maximal cliques in the graph. Recently, Tian & Samatova (2009) introduced an algorithm based on detection of connected-components of the orthology graph solvable in a

Other researchers propose to restrict the search space to cope with intractability issue of searching phase instead of performing heavy heuristics. For example Li et al. (2007) pre-clusters one PPI network in order to detect candidate complexes which are afterwards aligned to the target species network with an exact integer programming algorithm. Jancura & Marchiori (2010) proposed a pre-processing algorithm based on detection of network hubs for dividing PPI networks, prior to their alignment, into smaller sub-networks containing potential conserved modules. Each possible pair of sub-networks can be later aligned with a state-of-the-art alignment method where the search phase can be performed by means of an exact algorithm, allowing one to perform network comparison in a fully modular fashion and possibly to parallelize the computation. An interesting modular approach was introduced by Narayanan & Karp (2007), where an orthology graph is not constructed but rather networks are compared and split consecutively in several recursive steps until all possible solutions, conserved sub-graphs, are found. Similarly, Gerke et al. (2007) only compares, but does not merge, local hub-centred regions of PPI networks as identified by clustering coefficients and node degrees. The method by Ali & Deane (2009) is again another example of approach where an alignment graph is not explicitly constructed; there interspecies protein similarities are considered as new edges in such a way that species PPI networks and similarity edges between them are encoded into a single global meta-graph which can be searched by standard

There are also alignment methods which try to incorporate or use other types of information than just the one based on sequence similarity and interaction conservation. For instance, Guo & Hartemink (2009) exploited the findings on co-evolving interacting domains which mediate PPIs and, instead of using putatively homologous proteins for alignment, compares PPI networks across species according to conserved domains of protein-protein interactions. Ali & Deane (2009) propose a functionally guided alignment of PPI networks, where a scoring function incorporates not only sequence and topological similarity of aligned proteins but also

task.

very efficient way.

clustering techniques.

their gene co-expression characteristics and coherence of functional annotations. Thus, the method can be seen as detecting functional modules shared across species rather than strictly evolutionary modules. Finally, Berg & Lässig (2006) developed a generalized alignment Bayesian method applicable to different biological networks.

Despite various pairwise alignment techniques have been introduced, only a few of them embody an evolutionary model of PPI networks in the scoring scheme of an alignment. Notably, Koyutürk, Kim, Topkara, Subramaniam, Grama & Szpankowski (2006) were the first to introduce a method that builds the orthology graph following the duplication/divergence model based on gene duplications. Another interesting method was proposed by Hirsh & Sharan (2007) who extended the probabilistic score of NetworkBLAST to asses the likelihood that two complexes originated from an ancestral complex in the common ancestor of the two species being compared under the evolutionary pressure of duplication and link dynamics events.

### **3.2.3 Global pairwise network alignment**

In contrast to local network alignment, which uses many-to-many homologous mapping between proteins of distinct species to detect local conserved regions of a high topological similarity in the respective PPI networks, global protein network alignment uses this mapping to define an unique, globally optimal mapping across whole topologies of PPI networks (Singh et al., 2007), even if it were locally suboptimal in some regions of the networks. In the most strict form of this unique mapping each node in one input network is either matched to one node in the other input network or has no match in the other network. Thus the goal of global protein network alignment is to define functional orthologs across species, as the solution offers a way to resolve the ambiguity of orthology detection with the use of species interactome map. Naturally, as a by-product the global alignment can also identify conserved complexes or pathways.

To the best of our knowledge, the first method performing explicitly global alignment on pair of networks, called IsoRank, was introduced by Singh et al. (2007). Similarly to the local network alignment problem, the global network alignment problem is in general computationally intractable. As a consequence, IsoRank employs an approximation using an eigenvalue framework in a manner analogous to Google's PageRank algorithm.

Several advancements have naturally followed the introduction of IsoRank. For instance, Evans et al. (2008) proposed an asymmetric network matching algorithm based on a network simulation method called quantitative simulation, where a similarity score of a protein pair is iteratively updated by the similarity scores of their neighbours and vice versa until a unique global optimum is found. Other researchers focused more on formulating global alignment as combinatorial optimization problems. For instance Zaslavskiy et al. (2009) redefined the problem of global alignment as a standard graph matching problem and investigated methods using ideas and approaches from state-of-the-art graph matching techniques. Klau (2009) formalized global network alignment as an integer linear programming problem, where a near-optimal solution with a quality guarantee is found by solving a Lagrangian relaxation of the original optimization formulation. Recently, Chindelevitch et al. (2010) proposed a method where the global alignment is encoded as bipartite matching and applied a very efficient local optimization heuristic used for the well-known Travelling Salesman Problem.

All these multiple local network alignments do not reconstruct a plausible evolutionary history of PPI networks based on a model of evolution, although they might be phylogeny-aware. Motivated by this observation, Dutkowski & Tiuryn (2007) introduced a new multiple local network alignment method, called CAPPI, which from the given PPI networks of distinct species aims to reconstruct an ancient PPI network of the common ancestor. The method uses a Bayesian inference framework based on a duplication and divergence model of network evolution which mimics the processes by which most protein interactions are formed. After the reconstruction step, the ancestral network is decomposed into connected components which correspond to the ancestral modules of protein interactions and are projected back to the original networks to obtain the actual conserved network residues. Although the demonstrated application of the method was restricted to orthologous groups spanning across all species (Dutkowski & Tiuryn, 2007), to the best of our knowledge CAPPI is the only model-based approach for large-scale ancestral network reconstruction. Among the multiple alignment methods above mentioned, only Graemlin was shown to perform a global multiple network alignment, yet it relies on a involved parameter estimation step and phylogeny-guided approximation. Recently Liao et al. (2009) developed another global alignment technique which is fully unsupervised and phylogeny-free. The method, called IsoRankN, is built on the IsoRank algorithm mentioned above (Singh et al., 2007) and its extension to the multiple global network alignment (Singh et al., 2008a). At first IsoRankN scores topological and sequence similarity matching between putatively orthologous proteins of each pair of input networks using IsoRank. Then, a maximum k-partite graph matching problem is formulated on the induced graph of pairwise alignment scores (Singh et al., 2008a) and the exact solution is approximated by a spectral graph partitioning algorithm. IsoRankN also effectively identifies one-to-one orthologous mappings for all subset of species and appears to out-perform Graemlin in terms of coverage and quality of functional enrichments.

A Survey on Evolutionary Analysis in PPI Networks 445

Local and global alignment methods have been successfully applied to study evolution of species and to discover relevant biological knowledge. For example, Suthram et al. (2005) applied the network alignment of Sharan, Suthram, Kelley, Kuhn, McCuine, Uetz, Sittler, Karp & Ideker (2005) to examine the degree of conservation between the Plasmodium protein network and other model organisms, such as yeast, nematode worm, fruit fly and the bacterial pathogen Helicobacter pylori. They investigated whether the divergence of Plasmodium at the sequence level is reflected in the configuration of its protein network. Indeed, the alignments showed very little conservation suggesting that the patterns of protein interaction in Plasmodium, like its genome sequence, set it apart from other species (Suthram et al., 2005). Another application of local network alignment was performed by Tan et al. (2007) who combined transcriptional regulatory interactions with protein-protein interactions and identified co-regulated complexes between yeast and fly revealing different conservation of their regulators. This finding advocates that PPI networks may evolve more slowly than transcriptional interaction networks. In addition, Schwartz et al. (2009) and Dutkowski & Tiuryn (2009) used conserved complexes detected by network alignments for protein interaction prediction in a manner similar to the interologs transfer approach and demonstrated their usefulness. In particular, Schwartz et al. (2009) provided a

**3.4 Applications and future developments**

### **3.3 Multiple protein network alignment**

The methods on network alignment discussed so far perform alignment of two PPI networks of distinct species. The next natural extension is aligning more than two PPI networks, that is multiple network alignment. A first attempt to perform multiple local network alignment using three species was done by Sharan, Suthram, Kelley, Kuhn, McCuine, Uetz, Sittler, Karp & Ideker (2005), which exploited the scoring model of NetworkBLAST. However, the method scales exponentially with the number of input species and consequently it is ineffective for large scale comparisons.

Apart from the scalability problem, there are also other issues related to the problem of aligning more than two species. For instance, the putative orthologous mapping of certain proteins does not need to span across all species, meaning that proteins may be conserved only for a particular subset of species. This "orthology decay" is more evident when a large number of increasingly distant species are considered in the alignment. As a result, functional modules, such as pathways and complexes, can have a different degree of conservation, with some modules being strictly conserved across all species and some other modules being conserved only for a particular clade. Thus, a good alignment method should allow one to search for conserved modules at different degree of conservation. However, such requirement also increases the complexity of searching and consequently one may need to prune the number of all possible species combinations in alignment.

To the best of our knowledge, the first method capable of an efficient comparison of multiple PPI networks, called Graemlin, was introduced by Flannick et al. (2006). The alignment model of the method allows one to perform local as well as global alignment and is also applicable for querying tasks of particular biological modules of interest across PPI networks. It employs a rather involved scoring scheme which allows one to search for conserved pathways as well as for conserved complexes. It also outputs modules with a different conservation degree. Graemlin progressively aligns the closest pair of PPI networks according the species distance measured using a phylogenetic tree, until the last pair on the root of the tree is compared, corresponding to the most conserved parts of the aligned networks. The main disadvantage of this approach is that it involves to estimate many parameters. Recently, a supervised, automated parameter learner was proposed to lessen the burden of parameter tuning (Flannick et al., 2009).

Another phylogeny-guided local network alignment was proposed by Kalaev et al. (2008). Although the method uses the same probabilistic scoring for conserved complex as NetworkBLAST, it avoids its exponential scalability by redefining the alignment model such that it does not construct the merged representation of aligned networks but represents them as separate layers interconnected via orthologous mapping. Then a seed, that is, a group of putatively orthologous proteins spanning across all species, is selected using the species phylogeny and greedily expanded by adding other proteins being orthologous to each other in all respective species in order to maximize the alignment conservation score. The proposed method, however, identifies only protein complexes conserved across all species and does not detect complexes conserved only for a certain subset of species.

Notably, the functionally guided network alignment method of Ali & Deane (2009), previously mentioned as one of the methods for pairwise alignment, was also shown to perform efficiently local alignment of multiple networks.

18 Will-be-set-by-IN-TECH

The methods on network alignment discussed so far perform alignment of two PPI networks of distinct species. The next natural extension is aligning more than two PPI networks, that is multiple network alignment. A first attempt to perform multiple local network alignment using three species was done by Sharan, Suthram, Kelley, Kuhn, McCuine, Uetz, Sittler, Karp & Ideker (2005), which exploited the scoring model of NetworkBLAST. However, the method scales exponentially with the number of input species and consequently it is ineffective for

Apart from the scalability problem, there are also other issues related to the problem of aligning more than two species. For instance, the putative orthologous mapping of certain proteins does not need to span across all species, meaning that proteins may be conserved only for a particular subset of species. This "orthology decay" is more evident when a large number of increasingly distant species are considered in the alignment. As a result, functional modules, such as pathways and complexes, can have a different degree of conservation, with some modules being strictly conserved across all species and some other modules being conserved only for a particular clade. Thus, a good alignment method should allow one to search for conserved modules at different degree of conservation. However, such requirement also increases the complexity of searching and consequently one may need to prune the

To the best of our knowledge, the first method capable of an efficient comparison of multiple PPI networks, called Graemlin, was introduced by Flannick et al. (2006). The alignment model of the method allows one to perform local as well as global alignment and is also applicable for querying tasks of particular biological modules of interest across PPI networks. It employs a rather involved scoring scheme which allows one to search for conserved pathways as well as for conserved complexes. It also outputs modules with a different conservation degree. Graemlin progressively aligns the closest pair of PPI networks according the species distance measured using a phylogenetic tree, until the last pair on the root of the tree is compared, corresponding to the most conserved parts of the aligned networks. The main disadvantage of this approach is that it involves to estimate many parameters. Recently, a supervised, automated parameter learner was proposed to lessen the burden of parameter

Another phylogeny-guided local network alignment was proposed by Kalaev et al. (2008). Although the method uses the same probabilistic scoring for conserved complex as NetworkBLAST, it avoids its exponential scalability by redefining the alignment model such that it does not construct the merged representation of aligned networks but represents them as separate layers interconnected via orthologous mapping. Then a seed, that is, a group of putatively orthologous proteins spanning across all species, is selected using the species phylogeny and greedily expanded by adding other proteins being orthologous to each other in all respective species in order to maximize the alignment conservation score. The proposed method, however, identifies only protein complexes conserved across all species and does not

Notably, the functionally guided network alignment method of Ali & Deane (2009), previously mentioned as one of the methods for pairwise alignment, was also shown to perform

**3.3 Multiple protein network alignment**

number of all possible species combinations in alignment.

detect complexes conserved only for a certain subset of species.

efficiently local alignment of multiple networks.

large scale comparisons.

tuning (Flannick et al., 2009).

All these multiple local network alignments do not reconstruct a plausible evolutionary history of PPI networks based on a model of evolution, although they might be phylogeny-aware. Motivated by this observation, Dutkowski & Tiuryn (2007) introduced a new multiple local network alignment method, called CAPPI, which from the given PPI networks of distinct species aims to reconstruct an ancient PPI network of the common ancestor. The method uses a Bayesian inference framework based on a duplication and divergence model of network evolution which mimics the processes by which most protein interactions are formed. After the reconstruction step, the ancestral network is decomposed into connected components which correspond to the ancestral modules of protein interactions and are projected back to the original networks to obtain the actual conserved network residues. Although the demonstrated application of the method was restricted to orthologous groups spanning across all species (Dutkowski & Tiuryn, 2007), to the best of our knowledge CAPPI is the only model-based approach for large-scale ancestral network reconstruction.

Among the multiple alignment methods above mentioned, only Graemlin was shown to perform a global multiple network alignment, yet it relies on a involved parameter estimation step and phylogeny-guided approximation. Recently Liao et al. (2009) developed another global alignment technique which is fully unsupervised and phylogeny-free. The method, called IsoRankN, is built on the IsoRank algorithm mentioned above (Singh et al., 2007) and its extension to the multiple global network alignment (Singh et al., 2008a). At first IsoRankN scores topological and sequence similarity matching between putatively orthologous proteins of each pair of input networks using IsoRank. Then, a maximum k-partite graph matching problem is formulated on the induced graph of pairwise alignment scores (Singh et al., 2008a) and the exact solution is approximated by a spectral graph partitioning algorithm. IsoRankN also effectively identifies one-to-one orthologous mappings for all subset of species and appears to out-perform Graemlin in terms of coverage and quality of functional enrichments.

### **3.4 Applications and future developments**

Local and global alignment methods have been successfully applied to study evolution of species and to discover relevant biological knowledge. For example, Suthram et al. (2005) applied the network alignment of Sharan, Suthram, Kelley, Kuhn, McCuine, Uetz, Sittler, Karp & Ideker (2005) to examine the degree of conservation between the Plasmodium protein network and other model organisms, such as yeast, nematode worm, fruit fly and the bacterial pathogen Helicobacter pylori. They investigated whether the divergence of Plasmodium at the sequence level is reflected in the configuration of its protein network. Indeed, the alignments showed very little conservation suggesting that the patterns of protein interaction in Plasmodium, like its genome sequence, set it apart from other species (Suthram et al., 2005).

Another application of local network alignment was performed by Tan et al. (2007) who combined transcriptional regulatory interactions with protein-protein interactions and identified co-regulated complexes between yeast and fly revealing different conservation of their regulators. This finding advocates that PPI networks may evolve more slowly than transcriptional interaction networks. In addition, Schwartz et al. (2009) and Dutkowski & Tiuryn (2009) used conserved complexes detected by network alignments for protein interaction prediction in a manner similar to the interologs transfer approach and demonstrated their usefulness. In particular, Schwartz et al. (2009) provided a

**4. References**

*Biol* 6(6): e1000817.

*Biol* 3(9): e178.

5(6): e153.

Agarwal, S., Deane, C. M., Porter, M. A. & Jones, N. S. (2010). Revisiting date and party hubs:

A Survey on Evolutionary Analysis in PPI Networks 447

Ali, W. & Deane, C. M. (2009). Functionally guided alignment of protein interaction networks

Aragues, R., Sali, A., Bonet, J., Marti-Renom, M. A. & Oliva, B. (2007). Characterization of

Bandyopadhyay, S., Sharan, R. & Ideker, T. (2006). Systematic identification of functional orthologs based on protein network comparison, *Genome Research* 16(3): 428–435. Batada, N. N., Hurst, L. D. & Tyers, M. (2006). Evolutionary and physiological importance of

Batada, N. N., Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B.-J., Hurst, L. D. & Tyers,

Batada, N. N., Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B.-J., Hurst, L. D. & Tyers,

Beltrao, P., Trinidad, J. C., Fiedler, D., Roguev, A., Lim, W. A., Shokat, K. M., Burlingame,

Bloom, J. & Adami, C. (2003). Apparent dependence of protein evolutionary rate on number

Brown, K. & Jurisica, I. (2007). Unequal evolutionary conservation of human protein

Bruckner, S., Hüffner, F., Karp, R. M., Shamir, R. & Sharan, R. (2009). Torque: topology-free

Campillos, M., von Mering, C., Jensen, L. J. & Bork, P. (2006). Identification and analysis

Chen, C.-C., Lin, C.-Y., Lo, Y.-S. & Yang, J.-M. (2009). Ppisearch: a web server for

Chen, Y.-C., Lo, Y.-S., Hsu, W.-C. & Yang, J.-M. (2007). 3d-partner: a web server to

Chen, Y. & Xu, D. (2005). Understanding protein dispensability through machine-learning

analysis of high-throughput data, *Bioinformatics* 21(5): 575–581.

interactions in interologous networks, *Genome Biology* 8(5): R95.

phosphorylation patterns across yeast species, *PLoS Biol* 7(6): e1000134. Berg, J. & Lässig, M. (2006). Cross-species analysis of biological networks by Bayesian alignment, *Proceedings of the National Academy of Sciences* 103(29): 10967–10972. Bertin, N., Simonis, N., Dupuy, D., Cusick, M. E., Han, J.-D. J., Fraser, H. B., Roth, F. P. & Vidal,

for module detection, *Bioinformatics* 25(23): 3166–3173.

hub proteins, *PLoS Comput Biol* 2(7): e88.

network, *PLoS Biol* 4(10): e317.

distinction, *PLoS Biol* 5(6): e154.

*Evolutionary Biology* 3(1): 21.

*Acids Research* 37(suppl 2): W369–W375.

issue): W106–108.

16(3): 374–382.

2): W561–W567.

Novel approaches to role assignment in protein interaction networks, *PLoS Comput*

protein hubs by inferring interacting motifs from protein interactions, *PLoS Comput*

M. (2006). Stratus not altocumulus: A new view of the yeast protein interaction

M. (2007). Still stratus not altocumulus: Further evidence against the date/party hub

A. L. & Krogan, N. J. (2009). Evolution of phosphoregulation: Comparison of

M. (2007). Confirmation of organized modularity in the yeast interactome, *PLoS Biol*

of interactions is linked to biases in protein-protein interactions data sets, *BMC*

querying of protein interaction networks., *Nucleic Acids Research* 37(Web Server

of evolutionarily cohesive functional modules in protein networks, *Genome Research*

searching homologous protein-protein interactions across multiple species, *Nucleic*

infer interacting partners and binding models, *Nucleic Acids Research* 35(suppl

comprehensive experimental design which includes PPI prediction using network alignment, and demonstrated how effectively it reduces the cost of interactome mapping.

Furthermore, Bandyopadhyay et al. (2006) presented the first systematic identification of functional orthologs based on protein network comparison. They used the pairwise local alignment model of Kelley et al. (2003) to construct the orthology graph and then they resolved ambiguity of orthology mapping by fitting a logistic function previously trained on a known set of functional orthologs. In contrast, Singh et al. (2008b) predicted functional orthologs in unsupervised manner by using explicitly a global multiple network alignment method.

Finally, Kolar et al. (2008) performed a cross-species analysis of two herpes-viruses using the generalized Bayesian network alignment of Berg & Lässig (2006). Interestingly, the performed alignment employs in its probabilistic scoring system evolutionary rates of sequences and thus it goes beyond the narrow use of orthologous mapping as done in all other alignment techniques. The method predicted meaningful functional associations that could not be obtained from sequence or interaction data alone.

Despite the recent progress and increasing number of network alignment tools, their further development remains an ongoing research issue, in particular for multiple network comparison. Only a few methods perform the scoring of alignment according to evolutionary models and there is only one of them which fully reconstructs network evolutionary history. This clearly is in contrast with the numerous techniques for the reconstruction of evolutionary history of gene families. Also, actual alignment methods do not distinguish among diverse types of interactions, specifically between transient and permanent interactions. For example, the prior knowledge on different evolutionary behaviour of these types of physical interactions could be incorporated into a scoring scheme of alignment construction.

In addition, all but one network comparison methods just rely on the straightforward use of putative orthologous mapping as identified by sequence comparison or available in orthologous databases, but they do not employ evolutionary measures, such as evolutionary distances or retentions, which can be derived from the corresponding sequence alignments. These measures assess the level of evolutionary conservation and they could potentially improve the performance of network alignments.

Mostly all current applications of network alignments have worked with networks of physical interactome. However, the power of network alignment for functional annotation and other system biology applications could be explored when one performs comparison of more general, functional interaction networks. One may expect that such alignment could reveal a higher number of conserved modules as the interspecies conservation of modularity across protein networks increases with combined, integrated evidence for a pair of proteins to be functionally linked. Finally, all available methods here considered focused on conservation of modules but not on the more general concept of module evolutionary cohesiveness or co-evolution. The evolutionary cohesiveness can be assessed especially for the case of multiple alignments. Indeed, all conserved modules are inherently very cohesive, however not all evolutionary modules need to exhibit the correlated conservation at a level as expected by actual multiple network alignments. Protein functional modules differ in the degree of conservation and also in the degree of cohesiveness.

### **4. References**

20 Will-be-set-by-IN-TECH

comprehensive experimental design which includes PPI prediction using network alignment,

Furthermore, Bandyopadhyay et al. (2006) presented the first systematic identification of functional orthologs based on protein network comparison. They used the pairwise local alignment model of Kelley et al. (2003) to construct the orthology graph and then they resolved ambiguity of orthology mapping by fitting a logistic function previously trained on a known set of functional orthologs. In contrast, Singh et al. (2008b) predicted functional orthologs in unsupervised manner by using explicitly a global multiple network alignment method.

Finally, Kolar et al. (2008) performed a cross-species analysis of two herpes-viruses using the generalized Bayesian network alignment of Berg & Lässig (2006). Interestingly, the performed alignment employs in its probabilistic scoring system evolutionary rates of sequences and thus it goes beyond the narrow use of orthologous mapping as done in all other alignment techniques. The method predicted meaningful functional associations that could not be

Despite the recent progress and increasing number of network alignment tools, their further development remains an ongoing research issue, in particular for multiple network comparison. Only a few methods perform the scoring of alignment according to evolutionary models and there is only one of them which fully reconstructs network evolutionary history. This clearly is in contrast with the numerous techniques for the reconstruction of evolutionary history of gene families. Also, actual alignment methods do not distinguish among diverse types of interactions, specifically between transient and permanent interactions. For example, the prior knowledge on different evolutionary behaviour of these types of physical

In addition, all but one network comparison methods just rely on the straightforward use of putative orthologous mapping as identified by sequence comparison or available in orthologous databases, but they do not employ evolutionary measures, such as evolutionary distances or retentions, which can be derived from the corresponding sequence alignments. These measures assess the level of evolutionary conservation and they could potentially

Mostly all current applications of network alignments have worked with networks of physical interactome. However, the power of network alignment for functional annotation and other system biology applications could be explored when one performs comparison of more general, functional interaction networks. One may expect that such alignment could reveal a higher number of conserved modules as the interspecies conservation of modularity across protein networks increases with combined, integrated evidence for a pair of proteins to be functionally linked. Finally, all available methods here considered focused on conservation of modules but not on the more general concept of module evolutionary cohesiveness or co-evolution. The evolutionary cohesiveness can be assessed especially for the case of multiple alignments. Indeed, all conserved modules are inherently very cohesive, however not all evolutionary modules need to exhibit the correlated conservation at a level as expected by actual multiple network alignments. Protein functional modules differ in the degree of

interactions could be incorporated into a scoring scheme of alignment construction.

and demonstrated how effectively it reduces the cost of interactome mapping.

obtained from sequence or interaction data alone.

improve the performance of network alignments.

conservation and also in the degree of cohesiveness.


Fraser, H. B. (2005). Modularity and evolutionary constraint on proteins, *Nat Genet* 37(4): 351

A Survey on Evolutionary Analysis in PPI Networks 449

Fraser, H. B., Hirsh, A. E., Steinmetz, L. M., Scharfe, C. & Feldman, M. W. (2002). Evolutionary

Fraser, H. & Hirsh, A. (2004). Evolutionary rate depends on number of protein-protein

Fraser, H., Wall, D. & Hirsh, A. (2003). A simple dependence between protein evolution rate and the number of protein-protein interactions, *BMC Evolutionary Biology* 3(1): 11. Gallone, G., Simpson, T. I., Armstrong, J. D. & Jarman, A. (2011).

Gandhi, T. K. B., Zhong, J., Mathivanan, S., Karthick, L., Chandrika, K. N., Mohan, S. S.,

Geisler-Lee, J., O'Toole, N., Ammar, R., Provart, N. J., Millar, A. H. & Geisler, M. (2007). A predicted interactome for arabidopsis, *Plant Physiology* 145(2): 317–329. Gerke, M., Bornberg-Bauer, E., Jiang, X. & Fuellen, G. (2007). Finding common protein interaction patterns across organisms, *Evolutionary bioinformatics online* 2: 45–52. Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila,

Glazko, G. & Mushegian, A. (2004). Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns, *Genome Biology* 5(5): R32. Gu, H., Zhu, P., Jiao, Y., Meng, Y. & Chen, M. (2011). Prin: a predicted rice interactome

Guo, X. & Hartemink, A. J. (2009). Domain-oriented edge-based alignment of protein

Gustafson, A., Snitkin, E., Parker, S., DeLisi, C. & Kasif, S. (2006). Towards the identification

of essential genes using targeted genome sequencing and comparative analysis, *BMC*

interactions independently of gene expression level, *BMC Evolutionary Biology*

Bio::homology::interologwalk - a perl module to build putative protein-protein interaction networks through interolog mapping, *BMC Bioinformatics* 12(1): 289. Galperin, M. Y. & Koonin, E. V. (2000). Who's your neighbor? new computational approaches

Sharma, S., Pinkert, S., Nagaraju, S., Periaswamy, B., Mishra, G., Nandakumar, K., Shen, B., Deshpande, N., Nayak, R., Sarker, M., Boeke, J. D., Parmigiani, G., Schultz, J., Bader, J. S. & Pandey, A. (2006). Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets, *Nat Genet* 38(3): 285 – 293.

A., Anderson, K., Andre, B., Arkin, A. P., Astromoff, A., El Bakkoury, M., Bangham, R., Benito, R., Brachat, S., Campanaro, S., Curtiss, M., Davis, K., Deutschbauer, A., Entian, K.-D., Flaherty, P., Foury, F., Garfinkel, D. J., Gerstein, M., Gotte, D., Guldener, U., Hegemann, J. H., Hempel, S., Herman, Z., Jaramillo, D. F., Kelly, D. E., Kelly, S. L., Kotter, P., LaBonte, D., Lamb, D. C., Lan, N., Liang, H., Liao, H., Liu, L., Luo, C., Lussier, M., Mao, R., Menard, P., Ooi, S. L., Revuelta, J. L., Roberts, C. J., Rose, M., Ross-Macdonald, P., Scherens, B., Schimmack, G., Shafer, B., Shoemaker, D. D., Sookhai-Mahadeo, S., Storms, R. K., Strathern, J. N., Valle, G., Voet, M., Volckaert, G., Wang, C.-y., Ward, T. R., Wilhelmy, J., Winzeler, E. A., Yang, Y., Yen, G., Youngman, E., Yu, K., Bussey, H., Boeke, J. D., Snyder, M., Philippsen, P., Davis, R. W. & Johnston, M. (2002). Functional profiling of the saccharomyces cerevisiae genome, *Nature*

rate in the protein interaction network, *Science* 296(5568): 750–752.

for functional genomics, *Nat Biotech* 18(6): 609–613.

– 352.

4(1): 13.

418: 387–391.

*Genomics* 7(1): 265.

network, *BMC Bioinformatics* 12(1): 161.

interaction networks, *Bioinformatics* 25(12): i240–1246.


22 Will-be-set-by-IN-TECH

Cheng, Q., Berman, P., Harrison, R. & Zelikovsky, A. (2008). Fast alignments of

Chindelevitch, L., Liao, C.-S. & Berger, B. (2010). Local optimization for global alignment of protein interaction networks, *Pacific Symposium on Biocomputing* 15: 123–132. Clark, G. W., Dar, V.-u.-N., Bezginov, A., Yang, J. M., Charlebois, R. L. & Tillier, E. R. M. (2011).

Cordero, O. X., Snel, B. & Hogeweg, P. (2008). Coevolution of gene families in prokaryotes,

Coulomb, S., Bauer, M., Bernard, D. & Marsolier-Kergoat, M.-C. (2005). Gene essentiality

Deng, J., Deng, L., Su, S., Zhang, M., Lin, X., Wei, L., Minai, A. A., Hassett, D. J. & Lu,

Doyle, M., Gasser, R., Woodcroft, B., Hall, R. & Ralph, S. (2010). Drug target prediction

Drummond, D. A., Raval, A. & Wilke, C. O. (2006). A single determinant dominates the rate of yeast protein evolution, *Molecular Biology and Evolution* 23(2): 327–337. Dutkowski, J. & Tiuryn, J. (2007). Identification of functional modules from conserved ancestral protein-protein interactions, *Bioinformatics* 23(13): i149–158. Dutkowski, J. & Tiuryn, J. (2009). Phylogeny-guided interaction mapping in seven eukaryotes,

Ekman, D., Light, S., Björklund, A. K. & Elofsson, A. (2006). What properties characterize the

Evans, P., Sandler, T. & Ungar, L. (2008). Protein-protein interaction network alignment by

Fang, G., Rocha, E. & Danchin, A. (2005). How essential are nonessential genes?, *Molecular*

Flannick, J., Novak, A., Do, C. B., Srinivasan, B. S. & Batzoglou, S. (2009). Automatic

Flannick, J., Novak, A., Srinivasan, B. S., McAdams, H. H. & Batzoglou, S. (2006). Graemlin:

Fokkens, L. & Snel, B. (2009). Cohesive versus flexible evolution of functional modules in

hub proteins of the protein-protein interaction network of saccharomyces cerevisiae?,

quantitative simulation, *BIBM '08: Proceedings of the 2008 IEEE International Conference on Bioinformatics and Biomedicine*, IEEE Computer Society, Washington, DC, USA,

parameter learning for multiple local network alignment, *Journal of Computational*

General and robust alignment of multiple large interaction networks, *Genome Res.*

pp. 147–152.

Press, pp. 237–256.

*Genomics* 11(1): 222.

*BMC Bioinformatics* 10(1): 393.

*Biology and Evolution* 22(11): 2147–2156.

eukaryotes, *PLoS Comput Biol* 5(1): e1000276.

*Genome Biology* 7(6): R45.

*Biology* 16(8): 1001–1022.

16(9): 1169–1181.

pp. 325–328.

*Genome Research* 18(3): 462–468.

*Biological Sciences* 272(1573): 1721–1725.

metabolic networks, *BIBM '08: Proceedings of the 2008 IEEE International Conference on Bioinformatics and Biomedicine*, IEEE Computer Society, Washington, DC, USA,

Using coevolution to predict protein-protein interactions, *in* G. Cagney, A. Emili & J. M. Walker (eds), *Network Biology*, Vol. 781 of *Methods in Molecular Biology*, Humana

and the topology of protein interaction networks, *Proceedings of the Royal Society B:*

L. J. (2011). Investigating the predictability of essential genes across distantly related organisms using an integrative approach, *Nucleic Acids Research* 39(3): 795–807. Dost, B., Shlomi, T., Gupta, N., Ruppin, E., Bafna, V. & Sharan, R. (2008). Qnet: A tool for

querying protein interaction networks, *Journal of Computational Biology* 15(7): 913–925.

and prioritization: using orthology to predict essentiality in parasite genomes, *BMC*


Jothi, R., Cherukuri, P. F., Tasneem, A. & Przytycka, T. M. (2006). Co-evolutionary analysis

A Survey on Evolutionary Analysis in PPI Networks 451

Kafri, R., Dahan, O., Levy, J. & Pilpel, Y. (2008). Preferential protection of protein interaction

Kahali, B., Ahmad, S. & Ghosh, T. C. (2009). Exploring the evolutionary rate differences

Kalaev, M., Bafna, V. & Sharan, R. (2008). Fast and accurate alignment of multiple protein

Kelley, B. P., Sharan, R., Karp, R. M., Sittler, T., Root, D. E., Stockwell, B. R. & Ideker, T. (2003).

alignment, *Proceedings of the National Academy of Science* 100: 11394–11399. Kensche, P. R., van Noort, V., Dutilh, B. E. & Huynen, M. A. (2008). Practical and theoretical

Kim, P. M., Korbel, J. O. & Gerstein, M. B. (2007). Positive selection at the protein

Kim, P. M., Lu, L. J., Xia, Y. & Gerstein, M. B. (2006). Relating three-dimensional structures to protein networks provides evolutionary insights, *Science* 314(5807): 1938–1941. Kim, W. K., Bolser, D. M. & Park, J. H. (2004). Large-scale co-evolution analysis of protein

Kim, W. K. & Marcotte, E. M. (2008). Age-dependent evolution of the yeast protein interaction

Kolar, M., Lassig, M. & Berg, J. (2008). From protein interactions to functional annotation:

Koyutürk, M., Kim, Y., Subramaniam, S., Szpankowski, W. & Grama, A. (2006). Detecting

Koyutürk, M., Kim, Y., Topkara, U., Subramaniam, S., Grama, A. & Szpankowski, W. (2006).

Krylov, D. M., Wolf, Y. I., Rogozin, I. B. & Koonin, E. V. (2003). Gene loss, protein sequence

Kunin, V., Pereira-Leal, J. B. & Ouzounis, C. A. (2004). Functional evolution of the yeast protein interaction network, *Molecular Biology and Evolution* 21(7): 1171–1176.

Kimura, M. (1983). *The Neutral Theory of Molecular Evolution*, Cambridge University Press. Klau, G. (2009). A new graph-based method for pairwise global network alignment, *BMC*

graph alignment in herpes, *BMC Systems Biology* 2(1): 90.

eukaryotic evolution, *Genome Research* 13(10): 2229–2235.

networks, *Research in Computational Molecular Biology*, pp. 246–256.

*Proceedings of the National Academy of Sciences* 104(51): 20274–20279.

*the National Academy of Sciences* 105(4): 1243–1248.

*Journal of The Royal Society Interface* 5(19): 151–170.

*Bioinformatics* 20(7): 1138–1150.

*Bioinformatics* 10(Suppl 1): S59.

*Biol* 4(11): e1000232.

13(7): 1299–1322.

13(2): 182–199.

interaction network, *Gene* 429(1-2): 18 – 22.

of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions, *Journal of Molecular Biology* 362(4): 861 – 875.

network hubs in yeast: Evolved functionality of genetic redundancy, *Proceedings of*

of party hub and date hub proteins in saccharomyces cerevisiae protein-protein

Conserved pathways within bacteria and yeast as revealed by global protein network

advances in predicting the function of a protein by its phylogenetic distribution,

network periphery: Evaluation in terms of structural constraints and cellular context,

structural interlogues using the global protein structural interactome map (psimap),

network suggests a limited role of gene duplication and divergence, *PLoS Comput*

conserved interaction patterns in biological networks, *Journal of Computational Biology*

Pairwise alignment of protein interaction networks, *Journal of Computional Biology*

divergence, gene dispensability, expression level, and interactivity are correlated in


24 Will-be-set-by-IN-TECH

Hahn, M. W. & Kern, A. D. (2005). Comparative genomics of centrality and essentiality

Hakes, L., Lovell, S. C., Oliver, S. G. & Robertson, D. L. (2007). Specificity in protein

Han, J.-D. J., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Dupuy, D.,

He, X. & Zhang, J. (2006). Why do hubs tend to be essential in protein networks?, *PLoS Genet*

Hirsh, A. E. & Fraser, H. B. (2001). Protein dispensability and rate of evolution, *Nature*

Hirsh, A. E. & Fraser, H. B. (2003). Genomic function (communication arising): Rate of

Hirsh, E. & Sharan, R. (2007). Identification of conserved protein complexes based on a model

Huang, T.-W., Lin, C.-Y. & Kao, C.-Y. (2007). Reconstruction of human protein interolog network using evolutionary conserved network, *BMC Bioinformatics* 8(1): 152. Hurst, L. D. & Smith, N. G. (1999). Do essential genes evolve slowly?, *Current biology*

Itzhaki, Z., Akiva, E., Altuvia, Y. & Margalit, H. (2006). Evolutionary conservation of

Jaeger, S., Sers, C. & Leser, U. (2010). Combining modularity, conservation, and interactions of

Jancura, P. & Marchiori, E. (2010). Dividing protein interaction networks for modular network

Jancura, P., Mavridou, E., Carrillo-De Santa Pau, E. & Marchiori, E. (2012). A methodology for

Jeong, H., Mason, S. P., Barabasi, A.-L. & Oltvai, Z. N. (2001). Lethality and centrality in

Jonsson, P. F. & Bates, P. A. (2006). Global topological features of cancer proteins in the human

Jordan, I. K., Rogozin, I. B., Wolf, Y. I. & Koonin, E. V. (2002). Essential genes are more

Jordan, I. K., Wolf, Y. & Koonin, E. (2003a). Correction: No simple dependence between

prolific interactors tend to evolve slowly, *BMC Evolutionary Biology* 3(1): 5. Jordan, I. K., Wolf, Y. & Koonin, E. (2003b). No simple dependence between protein evolution

comparative analysis, *Pattern Recognition Letters* 31(14): 2083 – 2096.

proteins significantly increases precision and coverage of protein function prediction,

detecting the orthology signal in a ppi network at a functional complex level, *BMC*

evolutionarily conserved than are nonessential genes in bacteria, *Genome Research*

protein evolution rate and the number of protein-protein interactions: only the most

rate and the number of protein-protein interactions: only the most prolific interactors

evolution and gene dispensability, *Nature* 421(6922): 497–498.

of protein network evolution, *Bioinformatics* 23(2): e170–176.

domain-domain interactions, *Genome Biology* 7(12): R125.

*of the National Academy of Sciences* 104(19): 7999–8004.

22(4): 803–806.

*Nature* 430: 88–93.

411: 1046–1049.

2(6): e88.

9: 747–750.

12(6): 962–968.

*BMC Genomics* 11(1): 717.

*Bioinformatics* 13(Suppl 1). In press.

protein networks, *Nature* 411: 41–42.

interactome, *Bioinformatics* 22(18): 2291–2297.

tend to evolve slowly, *BMC Evolutionary Biology* 3(1): 1.

in three eukaryotic protein-interaction networks, *Molecular Biology and Evolution*

interactions and its relationship with sequence diversity and coevolution, *Proceedings*

Walhout, A. J. M., Cusick, M. E., Roth, F. P. & Vidal, M. (2004). Evidence for dynamically organized modularity in the yeast protein-protein interaction network,


Mintseris, J. & Weng, Z. (2005). Structure, function, and evolution of transient and

A Survey on Evolutionary Analysis in PPI Networks 453

Mirzarezaee, M., Araabi, B. & Sadeghi, M. (2010). Features analysis for identification of date

Moyle, W. R., Campbell, R. K., Myers, R. V., Bernard, M. P., Han, Y. & Wang, X. (1994).

Narayanan, M. & Karp, R. M. (2007). Comparing protein interaction networks via a graph match-and-split algorithm, *Journal of Computational Biology* 14(7): 892–907. Nooren, I. M. & Thornton, J. M. (2003). Diversity of protein-protein interactions, *EMBO J*

Pal, C., Papp, B. & Hurst, L. D. (2003). Genomic function (communication arising): Rate of

Pal, C., Papp, B. & Lercher, M. J. (2006). An integrated view of protein evolution, *Nat Rev Genet*

Pang, K., Cheng, C., Xuan, Z., Sheng, H. & Ma, X. (2010). Understanding protein evolutionary

Pang, K., Sheng, H. & Ma, X. (2010). Understanding gene essentiality by finely characterizing

Park, K. & Kim, D. (2009). Localized network centrality and essentiality in the yeast-protein

Pavithra, S. R., Kumar, R. & Tatu, U. (2007). Systems analysis of chaperone networks in the malarial parasite plasmodium falciparum, *PLoS Comput Biol* 3(9): e168. Pazos, F. & Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein-protein

Pazos, F. & Valencia, A. (2008). Protein co-evolution, co-adaptation and interactions, *EMBO J*

Pedamallu, C. S. & Posfai, J. (2010). Open source tool for prediction of genome wide

Pinter, R. Y., Rokhlenko, O., Yeger-Lotem, E. & Ziv-Ukelson, M. (2005). Alignment of

Plotkin, J. B. & Fraser, H. B. (2007). Assessing the determinants of evolutionary rates in the

Qian, W., He, X., Chan, E., Xu, H. & Zhang, J. (2011). Measuring the evolutionary

Qian, X., Sze, S.-H. & Yoon, B.-J. (2009). Querying Pathways in Protein Interaction Networks Based on Hidden Markov Models, *Journal of Computational Biology* 16(2): 145–157. Rocha, E. P. C. & Danchin, A. (2004). An analysis of determinants of amino acids substitution rates in bacterial proteins, *Molecular Biology and Evolution* 21(1): 108–116. Saeed, R. & Deane, C. (2006). Protein protein interactions, evolutionary rate, abundance and

presence of noise, *Molecular Biology and Evolution* 24(5): 1113–1121.

protein-protein interaction network based on ortholog information, *Source Code for*

rate of protein-protein interaction, *Proceedings of the National Academy of Sciences*

rate by integrating gene co-expression with protein interactions, *BMC Systems Biology*

hubs in the yeast protein interaction network, *Biochemical and Biophysical Research*

Co-evolution of ligand-receptor pairs, *Nature* 368(6468): 251–255.

evolution and gene dispensability, *Nature* 421(6922): 496–497.

interaction network, *PROTEOMICS* 9(22): 5143–5154.

metabolic pathways, *Bioinformatics* 21(16): 3401–3408.

interaction, *Protein Engineering* 14(9): 609–614.

102(31): 10930–10935.

*Systems Biology* 4(1): 172.

22(14): 3486–3492.

27(20): 2648–2655.

108(21): 8725–8730.

age, *BMC Bioinformatics* 7(1): 128.

*Biology and Medicine* 5(1): 8.

*Communications* 401(1): 112 – 116.

7: 337–348.

4(1): 179.

obligate protein-protein interactions, *Proceedings of the National Academy of Sciences*

and party hubs in protein interaction network of saccharomyces cerevisiae, *BMC*


26 Will-be-set-by-IN-TECH

Lee, S.-A., Chan, C.-h., Tsai, C.-H., Lai, J.-M., Wang, F.-S., Kao, C.-Y. & Huang, C.-Y.

Lee, W.-P., Jeng, B.-C., Pai, T.-W., Tsai, C.-P., Yu, C.-Y. & Tzou, W.-S. (2006). Differential

Lehner, B. & Fraser, A. (2004). A first-draft human protein-interaction map, *Genome Biology*

Lemos, B., Bettencourt, B. R., Meiklejohn, C. D. & Hartl, D. L. (2005). Evolution of proteins and

Li, H., Kristensen, D. M., Coleman, M. K. & Mushegian, A. (2009). Detection of biochemical pathways by probabilistic matching of phyletic vectors, *PLoS ONE* 4(4): e5326. Li, Y., de Ridder, D., de Groot, M. & Reinders, M. (2008). Metabolic pathway alignment

Li, Z., Zhang, S., Wang, Y., Zhang, X.-S. & Chen, L. (2007). Alignment of molecular networks by integer quadratic programming, *Bioinformatics* 23(13): 1631–1639. Liang, Z., Xu, M., Teng, M. & Niu, L. (2006). Comparison of protein interaction networks reveals species conservation and divergence, *BMC Bioinformatics* 7(1): 457. Liao, C.-S., Lu, K., Baym, M., Singh, R. & Berger, B. (2009). IsoRankN: spectral methods for global alignment of multiple protein networks, *Bioinformatics* 25(12): i253–258. Lin, Y.-S., Hwang, J.-K. & Li, W.-H. (2007). Protein complexity, gene duplicability and gene

Liu, Z., Liu, Q., Sun, H., Hou, L., Guo, H., Zhu, Y., Li, D. & He, F. (2011). Evidence for the

Lo, Y.-S., Chen, Y.-C. & Yang, J.-M. (2010). 3d-interologs: an evolution database of physical

Lo, Y.-S., Lin, C.-Y. & Yang, J.-M. (2010). Pcfamily: a web server for searching homologous

Makino, T. & Gojobori, T. (2006). The evolutionary rate of a protein is influenced by features of the interacting partners, *Molecular Biology and Evolution* 23(4): 784–789. Matthews, L. R., Vaglio, P., Reboul, J., Ge, H., Davis, B. P., Garrels, J., Vincent, S. & Vidal,

McDermott, J. E., Taylor, R. C., Yoon, H. & Heffron, F. (2009). Bottlenecks and hubs in

Michaut, M., Kerrien, S., Montecchi-Palazzi, L., Chauvat, F., Cassier-Chauvat, C., Aude, J.-C.,

conserved protein interaction networks, *Bioinformatics* 24(14): 1625–1631. Mika, S. & Rost, B. (2006). Protein-protein interactions more conserved within species than

networks from network motifs, *BMC Evolutionary Biology* 11(1): 133.

protein complexes, *Nucleic Acids Research* 38(suppl 2): W516–W522.

additions of clustered interacting nodes during the evolution of protein interaction

protein- protein interactions across multiple genomes, *BMC Genomics* 11(Suppl 3): S7.

M. (2001). Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs", *Genome Research*

inferred networks are important for virulence in salmonella typhimurium, *Journal*

Legrain, P. & Hermjakob, H. (2008). Interoporc: automated inference of highly

dispensability in the yeast genome, *Gene* 387(1-2): 109 – 117.

inter-species interactions, *BMC Bioinformatics* 9(Suppl 12): S11.

*Molecular Biology and Evolution* 22(5): 1345–1354.

*BMC Genomics* 7(1): 89.

*Systems Biology* 2(1): 111.

11(12): 2120–2126.

*of Computational Biology* 16: 169–180.

across species, *PLoS Comput Biol* 2(7): e79.

5(9): R63.

(2008). Ortholog-based protein-protein interaction prediction and its application to

evolutionary conservation of motif modes in the yeast protein interaction network,

gene expression levels are coupled in drosophila and are independently associated with mrna abundance, protein length, and number of protein-protein interactions,

between species using a comprehensive and flexible similarity measure, *BMC*


Ulitsky, I. & Shamir, R. (2007). Pathway redundancy and protein essentiality revealed in the

A Survey on Evolutionary Analysis in PPI Networks 455

Vergassola, M., Vespignani, A. & Dujon, B. (2005). Cooperative evolution in protein complexes

von Mering, C., Zdobnov, E. M., Tsoka, S., Ciccarelli, F. D., Pereira-Leal, J. B., Ouzounis, C. A.

modules, *Proceedings of the National Academy of Sciences* 100(26): 15428–15433. Walhout, A. J. M., Sordella, R., Lu, X., Hartley, J. L., Temple, G. F., Brasch, M. A., Thierry-Mieg,

Wall, D. P., Hirsh, A. E., Fraser, H. B., Kumm, J., Giaever, G., Eisen, M. B. & Feldman, M. W.

Wang, Z. & Zhang, J. (2009). Why is the correlation between gene importance and gene

Watanabe, R., Morett, E. & Vallejo, E. (2008). Inferring modules of functionally interacting proteins using the bond energy algorithm, *BMC Bioinformatics* 9(1): 285. Waterhouse, R. M., Zdobnov, E. M. & Kriventseva, E. V. (2011). Correlating traits of

Wernicke, S. & Rasche, F. (2007). Simple and fast alignment of metabolic pathways by

Williams, E. J. B. & Hurst, L. D. (2000). The proteins of linked genes evolve at similar rates,

Wolf, Y. I., Carmel, L. & Koonin, E. V. (2006). Unifying measures of gene function and evolution, *Proceedings of the Royal Society B: Biological Sciences* 273(1593): 1507–1515. Wuchty, S. (2004). Evolution and topology in the yeast protein interaction network, *Genome*

Wuchty, S., Barabasi, A.-L. & Ferdig, M. (2006). Stable evolutionary signal in a yeast protein

Wuchty, S., Oltvai, Z. N. & Barabási, A.-L. (2003). Evolutionary conservation of motif

Yamada, T., Kanehisa, M. & Goto, S. (2006). Extraction of phylogenetic network modules from

Yang, Q. & Sze, S.-H. (2007). Path matching and graph matching in biological networks,

Yang, Z. & Nielsen, R. (2000). Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models, *Molecular Biology and Evolution* 17(1): 32–43. Yeang, C.-H. & Haussler, D. (2007). Detecting coevolution in and among protein domains,

Yu, H., Braun, P., Yildirim, M. A., Lemmens, I., Venkatesan, K., Sahalie, J.,

Hirozane-Kishikawa, T., Gebreab, F., Li, N., Simonis, N., Hao, T., Rual, J.-F., Dricot, A., Vazquez, A., Murray, R. R., Simon, C., Tardivo, L., Tam, S., Svrzikapa, N., Fan, C., de Smet, A.-S., Motyl, A., Hudson, M. E., Park, J., Xin, X., Cusick, M. E., Moore, T.,

constituents in the yeast protein interaction network, *Nature Genetics* 35(2): 176–179.

of yeast from comparative analyses of its interaction network, *PROTEOMICS*

& Bork, P. (2003). Genome evolution reveals biochemical networks and functional

N. & Vidal, M. (2000). Protein interaction mapping in c. elegans using proteins

(2005). Functional genomic analysis of the rates of protein evolution, *Proceedings of*

gene retention, sequence divergence, duplicability and essentiality in vertebrates,

saccharomyces cerevisiae interaction networks, *Mol Syst Biol* 3: 1–7.

Vespignani, A. (2003). Evolution thinks modular, *Nature Genetics* 35(2): 118–119.

involved in vulval development, *Science* 287(5450): 116–122.

*the National Academy of Sciences* 102(15): 5483–5488.

evolutionary rate so weak?, *PLoS Genet* 5(1): e1000329.

arthropods, and fungi, *Genome Biology and Evolution* 3: 75–86.

exploiting local diversity, *Bioinformatics* 23(15): 1978–1985.

interaction network, *BMC Evolutionary Biology* 6(1): 8.

the metabolic network, *BMC Bioinformatics* 7(1): 130.

*Journal of Computational Biology* 14(1): 56–67.

5(12): 3116–3119.

*Nature* 407(6806): 900–903.

*Research* 14(7): 1310–1314.

*PLoS Comput Biol* 3(11): e211.


28 Will-be-set-by-IN-TECH

Saeed, R. & Deane, C. (2008). An assessment of the uses of homologous interactions,

Schuster-Bockler, B. & Bateman, A. (2007). Reuse of structural domain-domain interactions in

Schwartz, A. S., Yu, J., Gardenour, K. R., Finley Jr, R. L. & Ideker, T. (2009). Cost-effective

Seidl, M. & Schultz, J. (2009). Evolutionary flexibility of protein complexes, *BMC Evolutionary*

Sharan, R. & Ideker, T. (2006). Modeling cellular machinery through biological network

Sharan, R., Ideker, T., Kelley, B. P., Shamir, R. & Karp, R. M. (2005). Identification of protein

Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R. M.

Singh, R., Xu, J. & Berger, B. (2008a). Global alignment of multiple protein interaction

Singh, R., Xu, J. & Berger, B. (2008b). Global alignment of multiple protein interaction

Snel, B. & Huynen, M. A. (2004). Quantifying modularity in the evolution of biomolecular

Suthram, S., Sittler, T. & Ideker, T. (2005). The plasmodium protein network diverges from

Tan, K., Shlomi, T., Feizi, H., Ideker, T. & Sharan, R. (2007). Transcriptional regulation of

Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997). A genomic perspective on protein families,

Theis, F. J., Latif, N., Wong, P. & Frishman, D. (2011). Complex principal component and

Tian, W. & Samatova, N. F. (2009). Pairwise alignment of interaction networks by fast

Tillier, E. R. & Charlebois, R. L. (2009). The human protein coevolution network, *Genome*

Tirosh, I. & Barkai, N. (2005). Computational verification of protein-protein interactions by

Tuller, T., Kupiec, M. & Ruppin, E. (2009). Co-evolutionary networks of genes and cellular

networks, *Pacific Symposium on Biocomputing* 13: 303–314.

*National Academy of Sciences* 105(35): 12763–12768.

those of other eukaryotes, *Nature* 438(7064): 108–112.

orthologous co-expression, *BMC Bioinformatics* 6(1): 40.

processes across fungal species, *Genome Biology* 10(5): R48.

systems, *Genome Research* 14(3): 391–397.

*Sciences* 104(4): 1283–1288.

*Science* 278(5338): 631–637.

*Research* 19(10): 1861–1871.

28(9): 2501–2512.

14: 99–110.

complexes by comparative analysis of yeast and bacterial protein interaction data,

& Ideker, T. (2005). From the Cover: Conserved patterns of protein interaction in multiple species, *Proceedings of the National Academy of Sciences* 102(6): 1974–1979. Singh, R., Xu, J. & Berger, B. (2007). Pairwise global alignment of protein interaction networks

by matching neighborhood topology, *Research in Computational Molecular Biology*

networks with application to functional orthology detection, *Proceedings of the*

protein complexes within and across species, *Proceedings of the National Academy of*

correlation structure of 16 yeast genomic variables, *Molecular Biology and Evolution*

identification of maximal conserved patterns, *Pacific Symposium on Biocomputing*

strategies for completing the interactome, *Nat Meth* 6(1): 55–61.

*Bioinformatics* 24(5): 689–695.

*Biology* 9(1): 155.

pp. 16–31.

protein networks, *BMC Bioinformatics* 8(1): 259.

comparison, *Nature Biotechnology* 24(4): 427–433.

*Journal of Computional Biology* 12(6): 835–846.


**22** 

*1,3Canada 2Italy* 

**Scalable, Integrative Analysis** 

*University Health Network, Toronto, Ontario, 2CRO Aviano, National Cancer Institute, Aviano,* 

*University of Toronto, Toronto,* 

David Otasek1, Chiara Pastrello2,1 and Igor Jurisica1,3

*3Department of Computer Science and Medical Biophysics,* 

**and Visualization of Protein Interactions** 

*1Ontario Cancer Institute the Campbell Family Institute for Cancer Research,* 

Biology offers a diversity of problems, leading to many computational biology workflows, including tasks where network visualization is helpful to interpret and analyse data. Highthroughput screening techniques generate large amounts of data useful for the comprehension of the biological mechanisms underlying different diseases. The need for agile tools to handle such data and analyse it correctly has become continuously more

Individual network visualization systems differ greatly in terms of the features and standards they support, and consequently the analyses they enable. Importantly, users have a broad range of skills and expectations, ranging from biology to computational biology. As a result, network visualization tools must satisfy diverse requirements and thus offer different user interfaces and features. In this role, they are also fundamental in helping scientists in different fields integrate their knowledge and their data in an interdisciplinary

The number of '-omics' disciplines that use high-throughput techniques and that can benefit from a network approach are increasing. The diverse data that can be represented as a graph includes physical protein-protein interactions (PPIs), metabolic networks (Swainston et al., 2011), genetic co-expression (Helaers et al., 2011), gene regulatory networks (Longabaugh, 2012), microRNA-target (Shirdel et al., 2011) and drug-target associations (Morrow et al.,

Proteins are key players in virtually all biological events that take place within and between cells and often accomplish their function as part of large molecular machines, whose action is coordinated through intricate regulatory networks of transient PPIs. The understanding of the interrelationships between molecules is the basis for an understanding of the behaviour

**1. Introduction** 

approach to research.

2010). In this chapter we focus on physical PPIs.

of biological systems (Stein et al., 2011).

evident.

Boone, C., Snyder, M., Roth, F. P., Barabási, A.-L., Tavernier, J., Hill, D. E. & Vidal, M. (2008). High-quality binary protein interaction map of the yeast interactome network, *Science* 322(5898): 104–110.


## **Scalable, Integrative Analysis and Visualization of Protein Interactions**

David Otasek1, Chiara Pastrello2,1 and Igor Jurisica1,3 *1Ontario Cancer Institute the Campbell Family Institute for Cancer Research, University Health Network, Toronto, Ontario, 2CRO Aviano, National Cancer Institute, Aviano, 3Department of Computer Science and Medical Biophysics, University of Toronto, Toronto, 1,3Canada 2Italy* 

## **1. Introduction**

30 Will-be-set-by-IN-TECH

456 Protein-Protein Interactions – Computational and Experimental Tools

Yu, H., Kim, P. M., Sprecher, E., Trifonov, V. & Gerstein, M. (2007). The importance of

Yu, H., Luscombe, N. M., Lu, H. X., Zhu, X., Xia, Y., Han, J.-D. J., Bertin, N., Chung, S., Vidal,

Zotenko, E., Mestre, J., O'Leary, D. P. & Przytycka, T. M. (2008). Why do hubs in the

yeast protein interaction network tend to be essential: Reexamining the connection between the network topology and essentiality, *PLoS Comput Biol* 4(8): e1000140.

interologs and protein-dna regulogs, *Genome Research* 14(6): 1107–1118. Zaslavskiy, M., Bach, F. & Vert, J.-P. (2009). Global alignment of protein-protein interaction networks by graph matching methods, *Bioinformatics* 25(12): i259–1267. Zhang, J. & He, X. (2005). Significant impact of protein dispensability on the instantaneous rate of protein evolution, *Molecular Biology and Evolution* 22(4): 1147–1155. Zhao, J., Ding, G.-H., Tao, L., Yu, H., Yu, Z.-H., Luo, J.-H., Cao, Z.-W. & Li, Y.-X. (2007). Modular co-evolution of metabolic networks, *BMC Bioinformatics* 8(1): 311. Zinman, G., Zhong, S. & Bar-Joseph, Z. (2011). Biological interaction networks are conserved

*Science* 322(5898): 104–110.

dynamics, *PLoS Comput Biol* 3(4): e59.

at the module level, *BMC Systems Biology* 5(1): 134.

Boone, C., Snyder, M., Roth, F. P., Barabási, A.-L., Tavernier, J., Hill, D. E. & Vidal, M. (2008). High-quality binary protein interaction map of the yeast interactome network,

bottlenecks in protein networks: Correlation with gene essentiality and expression

M. & Gerstein, M. (2004). Annotation transfer between genomes: Protein-protein

Biology offers a diversity of problems, leading to many computational biology workflows, including tasks where network visualization is helpful to interpret and analyse data. Highthroughput screening techniques generate large amounts of data useful for the comprehension of the biological mechanisms underlying different diseases. The need for agile tools to handle such data and analyse it correctly has become continuously more evident.

Individual network visualization systems differ greatly in terms of the features and standards they support, and consequently the analyses they enable. Importantly, users have a broad range of skills and expectations, ranging from biology to computational biology. As a result, network visualization tools must satisfy diverse requirements and thus offer different user interfaces and features. In this role, they are also fundamental in helping scientists in different fields integrate their knowledge and their data in an interdisciplinary approach to research.

The number of '-omics' disciplines that use high-throughput techniques and that can benefit from a network approach are increasing. The diverse data that can be represented as a graph includes physical protein-protein interactions (PPIs), metabolic networks (Swainston et al., 2011), genetic co-expression (Helaers et al., 2011), gene regulatory networks (Longabaugh, 2012), microRNA-target (Shirdel et al., 2011) and drug-target associations (Morrow et al., 2010). In this chapter we focus on physical PPIs.

Proteins are key players in virtually all biological events that take place within and between cells and often accomplish their function as part of large molecular machines, whose action is coordinated through intricate regulatory networks of transient PPIs. The understanding of the interrelationships between molecules is the basis for an understanding of the behaviour of biological systems (Stein et al., 2011).

Scalable, Integrative Analysis and Visualization of Protein Interactions 459

etc. Edges can be directed or undirected, weighted or not. In a case like gene regulation, Gene A may regulate Gene B, but the relationship may not be symmetric, meaning Gene B does not regulate Gene A. These models of biological networks have differing levels of support across various applications. An application may only support a small subset of node and edge types in order to specialize on one particular model, such as VisANT, which integrates many specialized tools for tasks such as Gene Ontology (GO) annotation, name resolution and online searches. Other applications may be more open-ended to provide support for as many models as possible, such as NAViGaTOR, Cytoscape and VANTED. The advantage of such a model is versatility, but it comes at the cost of having to manually

To populate a graph within an application, the application must support one or more input formats. Often, the most basic level of input is either plain text or spreadsheet files such as Excel XLS format. For more graph-specific data, such as layout, GML can be used. To support more complex and structured biological data, several community standards exist:

Adding new nodes and edges to an existing graph can generally be done manually or by adding additional interactions from a supported database or file format. Some applications may have a workspace that supports concurrent, multiple graphs, which can then be combined or compared in various ways. Cytoscape and NAViGaTOR both support this type

Once the graph is loaded within an application, a researcher may wish to add additional annotations, such as gene or protein expression, experimental confidence measures or Gene Ontology (The Gene Ontology Consortium, 2000) to their graph objects. Data from in-house sources must generally conform to the application used; generally, this is in the form of spreadsheets or text data with varying degrees of format flexibility. The researcher can also call upon more specialized data from public databases, such as UniProt, Entrez, KEGG or Genbank, either through the import of files or from direct access to the database through the

The amount of biological networks available to the researcher is ever expanding, and the size of the networks involved in many types of analysis is in order of thousands of nodes and edges. For example, the yeast interactome comprises 23,918 interactions according to DIP and 152,877 known and predicted interactions in I2D, the Interologous Interaction Database (http://ophid.utoronto.ca/i2d), an integrated database of PPIs from curated databases, experimental sources and predicted interactions (Niu et al., 2010; Brown and Jurisica, 2007; Brown and Jurisica 2005). While the researcher may only be interested in a small portion of the network in question, the scalability of an individual application and its

Part of the challenge of visualizing a network is the laying out of the graph in a comprehensible manner. For smaller graphs, manual editing of node positions may be sufficient. With the aforementioned instances of graphs in the order of thousands of interactions, more robust tools are available with which to lay out a graph. Automated

analysis methods to networks of such size can be a considerable advantage.

define the nature of each node or edge via annotation.

PSI-MI, BioPAX, and SBML.

of workspace.

application or a plug-in.

**2.2 Network visualization** 

The analysis of the full proteome is possible with techniques such as mass spectrometry and protein microarray, which can be integrated with targeted approaches such as yeast-2 hybrid screen, immune precipitation and affinity purification. So far, PPI discovery methods are not accurate enough to be used alone, but the combination of different techniques can help to build an accurate interactome map (Remmerie et al. 2011). Still, this kind of analysis can only indicate that two proteins interact but does not reveal the molecular details or the mechanism of binding captured in high resolution three-dimensional (3D) structures, in which individual residue contacts are resolved and the interaction interfaces characterized. Moreover, they do not capture transient interactions and post translational modifications (PTMs) that can be addressed by techniques such as immobilized metal affinity chromatography (IMAC) mass spectrometry for protein phosphorylation analysis.

It becomes evident that the analysis of protein interactions is already a huge field with a plethora of data coming from different sources that can be improved by computational techniques and integrative network visualization and analysis. It is even more interesting to integrate PPI data with protein-target interaction data to have a wider view of the environmental context that influences network operations.

In this context, a pathway-centric analysis can help to elucidate the role and the importance of proteins in the context of the cell environment, specifically when the pathways can be related to the process/disease being studied. However, it is mandatory to be aware of the limits of this analysis, due to the cross-talk among pathways: a singular protein, in fact, can be associated or interact with multiple pathways so none of the pathways can be considered a single actor but rather a piece of a bigger puzzle (Kreeger & Lauffenburger, 2010).

Another intriguing aspect that protein-target interactions can describe is the relationship between protein exogenous molecules like drugs or toxins (Yu, 2011). The analysis of networks generated from drug-target and protein-target interactions can highlight different molecules that can be responsible of the response or resistance to a certain drug as well as alternative drugs that can target disease specific proteins.

## **2. Network visualization tools**

There are dozens of applications available for the visualization of biological networks, each with its own focus, work-flow and tools (Pavlopoulos et al., 2008; Gehlenborg et al., 2010). We will describe some of the most common features and workflows involved in using these applications, with brief discussion of NAViGaTOR (Brown et al., 2009; McGuffin and Jurisica, 2009; Djebbari et al., 2011), Cytoscape (Smoot et al., 2011), VANTED (Björn et al., 2006) and VisANT (Mellor et al., 2004), four popular multi-platform biological network visualization applications.

## **2.1 Biological networks as annotated graphs**

The most basic mathematical structure common to all of these applications is the graph, a collection of objects connected by links, referred to as nodes and edges. These objects are abstractions of real-world biological entities, where nodes could represent proteins, genes, molecules, drugs, etc. and edges could represent physical protein-protein, metabolic, or genetic interactions, microRNA to target associations, correlation, similarity relationships,

The analysis of the full proteome is possible with techniques such as mass spectrometry and protein microarray, which can be integrated with targeted approaches such as yeast-2 hybrid screen, immune precipitation and affinity purification. So far, PPI discovery methods are not accurate enough to be used alone, but the combination of different techniques can help to build an accurate interactome map (Remmerie et al. 2011). Still, this kind of analysis can only indicate that two proteins interact but does not reveal the molecular details or the mechanism of binding captured in high resolution three-dimensional (3D) structures, in which individual residue contacts are resolved and the interaction interfaces characterized. Moreover, they do not capture transient interactions and post translational modifications (PTMs) that can be addressed by techniques such as immobilized metal affinity

chromatography (IMAC) mass spectrometry for protein phosphorylation analysis.

environmental context that influences network operations.

alternative drugs that can target disease specific proteins.

**2.1 Biological networks as annotated graphs** 

**2. Network visualization tools** 

visualization applications.

It becomes evident that the analysis of protein interactions is already a huge field with a plethora of data coming from different sources that can be improved by computational techniques and integrative network visualization and analysis. It is even more interesting to integrate PPI data with protein-target interaction data to have a wider view of the

In this context, a pathway-centric analysis can help to elucidate the role and the importance of proteins in the context of the cell environment, specifically when the pathways can be related to the process/disease being studied. However, it is mandatory to be aware of the limits of this analysis, due to the cross-talk among pathways: a singular protein, in fact, can be associated or interact with multiple pathways so none of the pathways can be considered

Another intriguing aspect that protein-target interactions can describe is the relationship between protein exogenous molecules like drugs or toxins (Yu, 2011). The analysis of networks generated from drug-target and protein-target interactions can highlight different molecules that can be responsible of the response or resistance to a certain drug as well as

There are dozens of applications available for the visualization of biological networks, each with its own focus, work-flow and tools (Pavlopoulos et al., 2008; Gehlenborg et al., 2010). We will describe some of the most common features and workflows involved in using these applications, with brief discussion of NAViGaTOR (Brown et al., 2009; McGuffin and Jurisica, 2009; Djebbari et al., 2011), Cytoscape (Smoot et al., 2011), VANTED (Björn et al., 2006) and VisANT (Mellor et al., 2004), four popular multi-platform biological network

The most basic mathematical structure common to all of these applications is the graph, a collection of objects connected by links, referred to as nodes and edges. These objects are abstractions of real-world biological entities, where nodes could represent proteins, genes, molecules, drugs, etc. and edges could represent physical protein-protein, metabolic, or genetic interactions, microRNA to target associations, correlation, similarity relationships,

a single actor but rather a piece of a bigger puzzle (Kreeger & Lauffenburger, 2010).

etc. Edges can be directed or undirected, weighted or not. In a case like gene regulation, Gene A may regulate Gene B, but the relationship may not be symmetric, meaning Gene B does not regulate Gene A. These models of biological networks have differing levels of support across various applications. An application may only support a small subset of node and edge types in order to specialize on one particular model, such as VisANT, which integrates many specialized tools for tasks such as Gene Ontology (GO) annotation, name resolution and online searches. Other applications may be more open-ended to provide support for as many models as possible, such as NAViGaTOR, Cytoscape and VANTED. The advantage of such a model is versatility, but it comes at the cost of having to manually define the nature of each node or edge via annotation.

To populate a graph within an application, the application must support one or more input formats. Often, the most basic level of input is either plain text or spreadsheet files such as Excel XLS format. For more graph-specific data, such as layout, GML can be used. To support more complex and structured biological data, several community standards exist: PSI-MI, BioPAX, and SBML.

Adding new nodes and edges to an existing graph can generally be done manually or by adding additional interactions from a supported database or file format. Some applications may have a workspace that supports concurrent, multiple graphs, which can then be combined or compared in various ways. Cytoscape and NAViGaTOR both support this type of workspace.

Once the graph is loaded within an application, a researcher may wish to add additional annotations, such as gene or protein expression, experimental confidence measures or Gene Ontology (The Gene Ontology Consortium, 2000) to their graph objects. Data from in-house sources must generally conform to the application used; generally, this is in the form of spreadsheets or text data with varying degrees of format flexibility. The researcher can also call upon more specialized data from public databases, such as UniProt, Entrez, KEGG or Genbank, either through the import of files or from direct access to the database through the application or a plug-in.

The amount of biological networks available to the researcher is ever expanding, and the size of the networks involved in many types of analysis is in order of thousands of nodes and edges. For example, the yeast interactome comprises 23,918 interactions according to DIP and 152,877 known and predicted interactions in I2D, the Interologous Interaction Database (http://ophid.utoronto.ca/i2d), an integrated database of PPIs from curated databases, experimental sources and predicted interactions (Niu et al., 2010; Brown and Jurisica, 2007; Brown and Jurisica 2005). While the researcher may only be interested in a small portion of the network in question, the scalability of an individual application and its analysis methods to networks of such size can be a considerable advantage.

## **2.2 Network visualization**

Part of the challenge of visualizing a network is the laying out of the graph in a comprehensible manner. For smaller graphs, manual editing of node positions may be sufficient. With the aforementioned instances of graphs in the order of thousands of interactions, more robust tools are available with which to lay out a graph. Automated

Scalable, Integrative Analysis and Visualization of Protein Interactions 461

size do create a demand for both memory and processing power to render, layout and navigate, the conservation of important paths and data is important to end-user analysis, particularly since most graphs of interest are subsets of a much larger interaction networks. NAViGaTOR approaches the problem of limited computing resources through the combination of a powerful OpenGL rendering engine through JOGL, and a suite of efficient layout, search and analysis tools. The JOGL rendering system gives the application access to the graphic processing power of the OpenGL compliant hardware of most graphics cards,

NAViGaTOR supports several layout algorithms tailored for large graphs, including GRIP (Graph Drawing with Intelligent Placement) and several variants of the force directed algorithm. These algorithms come in both single and multi-threaded modes to take

When the structure and data contained within a graph are sufficient, the user can then interact with the graph, identifying significant nodes, edges or subsets of the graph using a variety of searches, spreadsheet tables and algorithms. Online or file supported databases

Users can highlight interesting structures within a graph with a variety of methods. Nodes and edges can be assigned visual properties to differentiate them from each other. Nodes can be given different colors, sizes, and highlighting styles. Edges can be given different colors, widths and styles and have the option to be rendered as user adjustable curves. Transparency can be used on both nodes and edges to either increase or decrease the

The user can save the file in native NAViGaTOR format, GML, PSI-MI or delimited plain text. In addition, for presentation or publication purposes, the graph can be exported to one

The increasing amount of data that can be collected from high-throughput analyses is accelerating research in the field of molecular biology; however, data of this type is also challenging due to its size. It can be used either for knowledge-based targeted analyses, meaning to improve the understanding of the role of an important well-known player in a specific field of interest (for example of BRCA1 in breast cancer), or unbiased analyses to understand the processes involved in a specific behaviour without a priori knowledge (for example, which genes/proteins are responsible for the poor survival of patients with

For our example, we have a list of potential interactors for a hypothetical protein of interest, PRO1, generated by computational PPI prediction. Also at our disposal are two meta analyses efforts, specifying the number of ovarian or prostate cancer related studies found in which the gene and its interactors were significantly deregulated. All other data will be collected from publicly available resources, including a PPI database, and a catalogue of

For our example, we will start with our experimental data in a tabular format. Data such as this can be obtained from any number of sources, from high-throughput experiments to

allowing the application to use the CPU for more intensive graph operations.

can also be used to indicate known pathways and complexes within the data.

of several graphical formats, including JPEG, PNG, TIFF, SVG and PDF.

**3. Iterative expansion of a protein interaction network** 

advantage of computers with multi-core CPUs.

visibility of graph objects.

pancreatic cancer?)

drugs and their gene targets.

graph layout algorithms, such as the force-directed and hierarchical, make the process easier, but often produce messy, uninterpretable graphs. Manual control over the placement of nodes and specialized tools for doing so are often necessary, from simple movement of single nodes to alignments in circles and lines to manipulate groups of nodes.

Algorithms for graph analysis are generally included in each application. Here, the number and type of analyses available are wildly variable. Algorithms can be used to find important graph properties, such as node degree, centrality, shortest paths, cliques and clusters. In addition, diverse biology-specific algorithms exist such as GeneMANIA (Montojo et al., 2010). Some applications may be designed specifically for one type of analysis while others contain a variety of analysis methods and in some cases allow for the addition of third party methods through plug-ins (NAViGaTOR, Cytoscape) or scripting languages (VisANT).

How an application chooses to visualize a graph is also variable. Nodes can be represented as anything from basic geometric shapes with variable size, color and transparency to application specific or user supplied bit-mapped images (Cytoscape, VANTED) or even other data visualizations such as bar charts (VANTED, VisANT). Edges can be straight, curved, displayed with various dot or dash schemes and can have variable widths, colors and transparencies. To make certain attributes readily visible, it is also possible in some instances to map an attribute to a visual property, such as color or size. All four of our example applications have different implementations of such mapping; the utility of a specific implementation is dependent upon the needs and competencies of the individual researcher.

Once the graph satisfies the requirements envisioned by the researcher, its state must be stored or exported. Proprietary formats are generally the norm for most programs, as visualization and data are often application specific and must be stored for later editing. Export formats often take the form of community standards (PSI-MI, BioPAX) and graphical exports. Graphical export is generally the final stage before publication. Usually, this can be done in bitmap (JPEG, TIFF, PNG, etc.) or vector (SVG, PDF, etc) formats, the latter being preferable for publication, as it can be resized and manipulated without loss of quality.

## **2.3 NAViGaTOR**

NAViGaTOR (Network Analysis, Visualization and Graphing Toronto; http://ophid.utoronto.ca/navigator) is a network and graph visualization application with an emphasis on large graphs with integrated data (Brown et al., 2009). Data can be imported using diverse formats, ranging from community standards such as PSI-MI XML (Kerrien et al., 2007), BioPAX (Demir et al., 2010) or GML (Himsolt, 1996), to user-defined text files. Though the application is geared towards protein-protein interactions, the graph implementation within NAViGaTOR is not PPI specific, and can be used to model many types of real world or theoretical objects. Nodes and edges can have data associated with them, from simple numeric or text data to structured XML. Once imported, graphs can be combined from within a multi-graph workspace using combinations of cut, copy and paste operations. Additional data for the annotation of existing graphs can be imported using compatible files or online resources, such as I2D, cPath, or the one of the many online databases implementing the PSICQUIC web service.

Graphs generated by the above methods can quickly increase in size to thousands of nodes and edges. NAViGaTOR was designed with networks of this size in mind. While graphs this

graph layout algorithms, such as the force-directed and hierarchical, make the process easier, but often produce messy, uninterpretable graphs. Manual control over the placement of nodes and specialized tools for doing so are often necessary, from simple movement of

Algorithms for graph analysis are generally included in each application. Here, the number and type of analyses available are wildly variable. Algorithms can be used to find important graph properties, such as node degree, centrality, shortest paths, cliques and clusters. In addition, diverse biology-specific algorithms exist such as GeneMANIA (Montojo et al., 2010). Some applications may be designed specifically for one type of analysis while others contain a variety of analysis methods and in some cases allow for the addition of third party methods through plug-ins (NAViGaTOR, Cytoscape) or scripting languages (VisANT).

How an application chooses to visualize a graph is also variable. Nodes can be represented as anything from basic geometric shapes with variable size, color and transparency to application specific or user supplied bit-mapped images (Cytoscape, VANTED) or even other data visualizations such as bar charts (VANTED, VisANT). Edges can be straight, curved, displayed with various dot or dash schemes and can have variable widths, colors and transparencies. To make certain attributes readily visible, it is also possible in some instances to map an attribute to a visual property, such as color or size. All four of our example applications have different implementations of such mapping; the utility of a specific implementation is dependent upon the needs and competencies of the individual researcher. Once the graph satisfies the requirements envisioned by the researcher, its state must be stored or exported. Proprietary formats are generally the norm for most programs, as visualization and data are often application specific and must be stored for later editing. Export formats often take the form of community standards (PSI-MI, BioPAX) and graphical exports. Graphical export is generally the final stage before publication. Usually, this can be done in bitmap (JPEG, TIFF, PNG, etc.) or vector (SVG, PDF, etc) formats, the latter being preferable for publication, as it can be resized and manipulated without loss of quality.

NAViGaTOR (Network Analysis, Visualization and Graphing Toronto; http://ophid.utoronto.ca/navigator) is a network and graph visualization application with an emphasis on large graphs with integrated data (Brown et al., 2009). Data can be imported using diverse formats, ranging from community standards such as PSI-MI XML (Kerrien et al., 2007), BioPAX (Demir et al., 2010) or GML (Himsolt, 1996), to user-defined text files. Though the application is geared towards protein-protein interactions, the graph implementation within NAViGaTOR is not PPI specific, and can be used to model many types of real world or theoretical objects. Nodes and edges can have data associated with them, from simple numeric or text data to structured XML. Once imported, graphs can be combined from within a multi-graph workspace using combinations of cut, copy and paste operations. Additional data for the annotation of existing graphs can be imported using compatible files or online resources, such as I2D, cPath, or the one of the many online

Graphs generated by the above methods can quickly increase in size to thousands of nodes and edges. NAViGaTOR was designed with networks of this size in mind. While graphs this

single nodes to alignments in circles and lines to manipulate groups of nodes.

**2.3 NAViGaTOR** 

databases implementing the PSICQUIC web service.

size do create a demand for both memory and processing power to render, layout and navigate, the conservation of important paths and data is important to end-user analysis, particularly since most graphs of interest are subsets of a much larger interaction networks. NAViGaTOR approaches the problem of limited computing resources through the combination of a powerful OpenGL rendering engine through JOGL, and a suite of efficient layout, search and analysis tools. The JOGL rendering system gives the application access to the graphic processing power of the OpenGL compliant hardware of most graphics cards, allowing the application to use the CPU for more intensive graph operations.

NAViGaTOR supports several layout algorithms tailored for large graphs, including GRIP (Graph Drawing with Intelligent Placement) and several variants of the force directed algorithm. These algorithms come in both single and multi-threaded modes to take advantage of computers with multi-core CPUs.

When the structure and data contained within a graph are sufficient, the user can then interact with the graph, identifying significant nodes, edges or subsets of the graph using a variety of searches, spreadsheet tables and algorithms. Online or file supported databases can also be used to indicate known pathways and complexes within the data.

Users can highlight interesting structures within a graph with a variety of methods. Nodes and edges can be assigned visual properties to differentiate them from each other. Nodes can be given different colors, sizes, and highlighting styles. Edges can be given different colors, widths and styles and have the option to be rendered as user adjustable curves. Transparency can be used on both nodes and edges to either increase or decrease the visibility of graph objects.

The user can save the file in native NAViGaTOR format, GML, PSI-MI or delimited plain text. In addition, for presentation or publication purposes, the graph can be exported to one of several graphical formats, including JPEG, PNG, TIFF, SVG and PDF.

## **3. Iterative expansion of a protein interaction network**

The increasing amount of data that can be collected from high-throughput analyses is accelerating research in the field of molecular biology; however, data of this type is also challenging due to its size. It can be used either for knowledge-based targeted analyses, meaning to improve the understanding of the role of an important well-known player in a specific field of interest (for example of BRCA1 in breast cancer), or unbiased analyses to understand the processes involved in a specific behaviour without a priori knowledge (for example, which genes/proteins are responsible for the poor survival of patients with pancreatic cancer?)

For our example, we have a list of potential interactors for a hypothetical protein of interest, PRO1, generated by computational PPI prediction. Also at our disposal are two meta analyses efforts, specifying the number of ovarian or prostate cancer related studies found in which the gene and its interactors were significantly deregulated. All other data will be collected from publicly available resources, including a PPI database, and a catalogue of drugs and their gene targets.

For our example, we will start with our experimental data in a tabular format. Data such as this can be obtained from any number of sources, from high-throughput experiments to

Scalable, Integrative Analysis and Visualization of Protein Interactions 463

Fig. 2. Example graph with experimental interaction confidence mapped to edge width and

Fig. 3. Example graph enriched with interactions loaded from I2D, and laid out using the

transparency.

GRID algorithm.

computational predictions. In our case, we have 21,302 predicted PPIs. Our analysis has produced a confidence metric associated with each interaction, ranging from 0 to 1.0. This confidence metric can be used to reduce the number of interactions we are dealing with to a more manageable size by removing lower confidence interactions. Our cut-off for high confidence will be 0.892, a value determined by cross validation. This leaves us with only 39 interactions, a far more manageable number for the next analysis steps. More complex filtering can be done through a simple spreadsheet application, such as Excel, or with a mathematical application such as R or Matlab.

At this point, we translate this data into a pair-wise table of PPIs, and import this table into NAViGaTOR. While NAViGaTOR supports several formats for loading interactions, we have chosen the tab-delimited format to facilitate easy translation from our original data. Other interaction data sets can be imported using community standard file formats, such as BioPAX, GML, PSI-MI XML and PSI-MITAB. Though these formats are harder to construct, they can contain more structured data, and facilitate easier data interchange among diverse programs and databases.

Fig. 1. Example graph containing hypothetical protein PRO1, with interactors loaded from experimental data. Tabular view of the data is available as a supplemental material (http:// http://www.cs.utoronto.ca/~juris/data/intech12/).

Loading our pair-wise data, we get a very basic view (Figure 1). The visualization of this network at this stage is a spoke diagram with PRO1 in the center, and offers little information to the researcher that could not have been seen through a simple spreadsheet. We already have data regarding 39 interactions in the form of the confidence metric imported from our initial study. This can be mapped to one or more visual attributes using NAViGaTORs filter framework. In this case, we can make the highest confidence interactions more visible by applying a filter to map confidence to both edge width and transparency (Figure 2).

computational predictions. In our case, we have 21,302 predicted PPIs. Our analysis has produced a confidence metric associated with each interaction, ranging from 0 to 1.0. This confidence metric can be used to reduce the number of interactions we are dealing with to a more manageable size by removing lower confidence interactions. Our cut-off for high confidence will be 0.892, a value determined by cross validation. This leaves us with only 39 interactions, a far more manageable number for the next analysis steps. More complex filtering can be done through a simple spreadsheet application, such as Excel, or with a

At this point, we translate this data into a pair-wise table of PPIs, and import this table into NAViGaTOR. While NAViGaTOR supports several formats for loading interactions, we have chosen the tab-delimited format to facilitate easy translation from our original data. Other interaction data sets can be imported using community standard file formats, such as BioPAX, GML, PSI-MI XML and PSI-MITAB. Though these formats are harder to construct, they can contain more structured data, and facilitate easier data interchange among diverse

Fig. 1. Example graph containing hypothetical protein PRO1, with interactors loaded from experimental data. Tabular view of the data is available as a supplemental material

Loading our pair-wise data, we get a very basic view (Figure 1). The visualization of this network at this stage is a spoke diagram with PRO1 in the center, and offers little information to the researcher that could not have been seen through a simple spreadsheet. We already have data regarding 39 interactions in the form of the confidence metric imported from our initial study. This can be mapped to one or more visual attributes using NAViGaTORs filter framework. In this case, we can make the highest confidence interactions more visible by

applying a filter to map confidence to both edge width and transparency (Figure 2).

(http:// http://www.cs.utoronto.ca/~juris/data/intech12/).

mathematical application such as R or Matlab.

programs and databases.

Fig. 2. Example graph with experimental interaction confidence mapped to edge width and transparency.

Fig. 3. Example graph enriched with interactions loaded from I2D, and laid out using the GRID algorithm.

Scalable, Integrative Analysis and Visualization of Protein Interactions 465

nodes starting with PRO1 in the centre, with its immediate interactors arranged circularly

When combining data from different sources, the users' choice of protein nomenclature becomes extremely important. Although a researcher knows which genes or proteins they are referring to, queries to a database require additional levels of specificity to resolve

For example, DLC1 has the following SwissProt identifiers: Q96QB1, Q9Y238, P63167, Q7Z5R8, Q45XF9, Q86UC6. However, names in literature could be ambiguous and

• **DLC1** (ARHGAP7) (KIAA1723) (STARD12) [**Rho GTPase-activating protein 7** (Rhotype GTPase-activating protein 7) (Deleted in liver cancer 1 protein) (Dlc-1) (StARrelated lipid transfer protein 12) (START domain-containing protein 12) (StARD12) (HP

• **DLEC1** (DLC1) [**Deleted in lung and esophageal cancer protein 1** (Deleted in lung

• **DYNLL1** (DLC1) (DNCL1) (DNCLC1) (HDLC1) **[Dynein light chain 1, cytoplasmic**  (Dynein light chain LC8-type 1) (8 kDa dynein light chain) (DLC8) (Protein inhibitor of

Similarly, many papers refer to SHC – but details about which variant and which species are frequently "hidden" in the supplemental information (http://www.cs.utoronto.ca/~juris/data/intech12/). Yet, there are at least four variants in mouse and human. Sometimes, a radical change in nomenclature is required, such as in case of Caspases (Alnemri et al., 1986). Systematic analysis led to redefying various ICE, MACH,

There are many different standards of referring to genes and proteins: UniProt (http://www.uniprot.org) (Jain et al., 2009), Ensembl (http://www.ensembl.org) (Flicek et al., 2012), EBI IPI (http://www.ebi.ac.uk) (Kersey et al., 2004), Gene Cards (http://www.genecards.org) (Safran et al., 2010), NCBI Gene (http://www.ncbi.nlm.nih.gov) (Maglott et al., 2010) are just a few examples of databases that attempt to systematically characterize and describe genes and proteins. Each database has its own focus and strengths, and different interaction or annotation databases may choose any one of these standards to organize their data. In this example, and in many other case uses of NAViGaTOR, the user may have to import data from one or more databases that use different nomenclatures. To facilitate the use of multiple nomenclatures, NAViGaTOR can store multiple IDs per node as a text feature, allowing alternative keys for node identification. When combining data from two or more databases using different formats, the user must translate between these different nomenclatures. This must be done very carefully and methodically, as this additional translation step often effects the data returned. For example, UniProt stores mappings from its own accession IDs to Ensembl

around it, and their interactors in turn arranged around them (Figure 4).

confusing, potentially resulting in incorrect interpretation and analyses:

**3.1 Ambiguity of protein names** 

cancer protein 1) (DLC-1)]

• **DLC1** [**DLC1** protein]

neuronal nitric oxide synthase) (PIN)]

MCH genes into Caspase1-10 (Alnemri et al., 1986).

• **DLC1** [**Deleted in liver cancer 1** variant 2 (Fragment)]

ambiguities in entity names.

protein)]

Fig. 4. Example graph laid out hierarchically, with PRO1 in a central position.

This is better, but still not that much more informative. One way of enriching our isolated data is by viewing it in the context of known and predicted interactions. I2D, the Interologous Interaction Database (http://ophid.utoronto.ca/i2d; (Brown et al., 2005, Brown et al., 2007)), will be our source for these interactions. NAViGaTOR offers an I2D plug-in, which enables the researcher to easily add interactions to the existing graph. NAViGaTOR also has the PSICQUIC search plug-in, which supports the searching of databases that implement the PSICQUIC interface (Aranda et al., 2011). To further support the openness and versatility of PPI integration, NAViGaTOR can import additional interactions from the same file formats listed above. If a database does not support any of these formats, finding or building a representation of the database in tab-delimited format may be an option as well. Our interaction search returns 1,367 nodes and 3,192 edges (Figure 3).

At this point, the graph has become more complex, and the force-directed layout is not helpful in interpreting it. Several options exist at this point for manually laying out objects in the graph. The user can select 'fix' nodes within the graph and either move them manually (which would be very labor intensive and inflexible) or lay them out with an array of tools such as linear, circular, arc or radial layout. We will use the radial layout method, starting with PRO1 as our central node and extending to a depth of 2. This gives us a hierarchical arrangement of nodes starting with PRO1 in the centre, with its immediate interactors arranged circularly around it, and their interactors in turn arranged around them (Figure 4).

## **3.1 Ambiguity of protein names**

464 Protein-Protein Interactions – Computational and Experimental Tools

Fig. 4. Example graph laid out hierarchically, with PRO1 in a central position.

well. Our interaction search returns 1,367 nodes and 3,192 edges (Figure 3).

This is better, but still not that much more informative. One way of enriching our isolated data is by viewing it in the context of known and predicted interactions. I2D, the Interologous Interaction Database (http://ophid.utoronto.ca/i2d; (Brown et al., 2005, Brown et al., 2007)), will be our source for these interactions. NAViGaTOR offers an I2D plug-in, which enables the researcher to easily add interactions to the existing graph. NAViGaTOR also has the PSICQUIC search plug-in, which supports the searching of databases that implement the PSICQUIC interface (Aranda et al., 2011). To further support the openness and versatility of PPI integration, NAViGaTOR can import additional interactions from the same file formats listed above. If a database does not support any of these formats, finding or building a representation of the database in tab-delimited format may be an option as

At this point, the graph has become more complex, and the force-directed layout is not helpful in interpreting it. Several options exist at this point for manually laying out objects in the graph. The user can select 'fix' nodes within the graph and either move them manually (which would be very labor intensive and inflexible) or lay them out with an array of tools such as linear, circular, arc or radial layout. We will use the radial layout method, starting with PRO1 as our central node and extending to a depth of 2. This gives us a hierarchical arrangement of When combining data from different sources, the users' choice of protein nomenclature becomes extremely important. Although a researcher knows which genes or proteins they are referring to, queries to a database require additional levels of specificity to resolve ambiguities in entity names.

For example, DLC1 has the following SwissProt identifiers: Q96QB1, Q9Y238, P63167, Q7Z5R8, Q45XF9, Q86UC6. However, names in literature could be ambiguous and confusing, potentially resulting in incorrect interpretation and analyses:


Similarly, many papers refer to SHC – but details about which variant and which species are frequently "hidden" in the supplemental information (http://www.cs.utoronto.ca/~juris/data/intech12/). Yet, there are at least four variants in mouse and human. Sometimes, a radical change in nomenclature is required, such as in case of Caspases (Alnemri et al., 1986). Systematic analysis led to redefying various ICE, MACH, MCH genes into Caspase1-10 (Alnemri et al., 1986).

There are many different standards of referring to genes and proteins: UniProt (http://www.uniprot.org) (Jain et al., 2009), Ensembl (http://www.ensembl.org) (Flicek et al., 2012), EBI IPI (http://www.ebi.ac.uk) (Kersey et al., 2004), Gene Cards (http://www.genecards.org) (Safran et al., 2010), NCBI Gene (http://www.ncbi.nlm.nih.gov) (Maglott et al., 2010) are just a few examples of databases that attempt to systematically characterize and describe genes and proteins. Each database has its own focus and strengths, and different interaction or annotation databases may choose any one of these standards to organize their data. In this example, and in many other case uses of NAViGaTOR, the user may have to import data from one or more databases that use different nomenclatures. To facilitate the use of multiple nomenclatures, NAViGaTOR can store multiple IDs per node as a text feature, allowing alternative keys for node identification. When combining data from two or more databases using different formats, the user must translate between these different nomenclatures. This must be done very carefully and methodically, as this additional translation step often effects the data returned. For example, UniProt stores mappings from its own accession IDs to Ensembl

Scalable, Integrative Analysis and Visualization of Protein Interactions 467

We can for example integrate PPIs with the gene expression results obtained from our literature studies. Each file contains several values associated with each gene, specifying the number of studies in which the gene was down-regulated, up-regulated and a total representing both (Figure 5). We will also generate a third file representing the total studies in which the gene was found to have been significantly deregulated, which simply sums the totals for the previous two files. Similarly to the opening of the initial experiment, NAViGaTOR requires a unique identifier column to be specified. In this case, because we are only concerned with data to be associated with nodes, the program only requires a single Node ID column. This process is the same for the prostate, ovarian cancer and generated data sets. To visualize this data, we will add another filter, this time mapping the total number of significantly deregulated studies in ovarian cancer to node width, and the total number of significantly deregulated studies in prostate cancer to its height. It is immediately evident which nodes have already been described to be up/down regulated in either one or both types of cancer. This can be useful to parallel the information already known from one cancer to the other. In addition, we can map the generated total of studies

to node transparency, making genes with less disease evidence less obtrusive.

Fig. 6. Example graph with GO Annotation mapped to a color scheme.

Gene IDs, and Ensembl stores mappings from its own IDs to UniProt. However, respectively, they return 55,639 unique UniProt accession IDs for 20,995 unique Ensembl gene IDs and 21,735 unique Ensembl gene IDs for 63,370 unique UniProt accession IDs. The mapping is clearly different depending on which method is used. There is no definitive mapping available in situations such as these: it is up to the individual researcher to choose and document the translations used to amalgamate their data in a fashion that is replicable. Bearing this in mind during the earlier stages of experiment design will make this process much easier and less prone to confusion or ambiguity.

## **3.2 Associating data with an existing graph**

Though better organized, we still have in excess of 1,000 nodes and 3,000 interactions, and to better identify nodes and edges that represent novel research material, we must associate more data with those objects.

Fig. 5. Example graph with numbers of referencing studies in ovarian and prostate mapped to node width and height.

Gene IDs, and Ensembl stores mappings from its own IDs to UniProt. However, respectively, they return 55,639 unique UniProt accession IDs for 20,995 unique Ensembl gene IDs and 21,735 unique Ensembl gene IDs for 63,370 unique UniProt accession IDs. The mapping is clearly different depending on which method is used. There is no definitive mapping available in situations such as these: it is up to the individual researcher to choose and document the translations used to amalgamate their data in a fashion that is replicable. Bearing this in mind during the earlier stages of experiment design will make this process

Though better organized, we still have in excess of 1,000 nodes and 3,000 interactions, and to better identify nodes and edges that represent novel research material, we must associate

Fig. 5. Example graph with numbers of referencing studies in ovarian and prostate mapped

much easier and less prone to confusion or ambiguity.

**3.2 Associating data with an existing graph** 

more data with those objects.

to node width and height.

We can for example integrate PPIs with the gene expression results obtained from our literature studies. Each file contains several values associated with each gene, specifying the number of studies in which the gene was down-regulated, up-regulated and a total representing both (Figure 5). We will also generate a third file representing the total studies in which the gene was found to have been significantly deregulated, which simply sums the totals for the previous two files. Similarly to the opening of the initial experiment, NAViGaTOR requires a unique identifier column to be specified. In this case, because we are only concerned with data to be associated with nodes, the program only requires a single Node ID column. This process is the same for the prostate, ovarian cancer and generated data sets. To visualize this data, we will add another filter, this time mapping the total number of significantly deregulated studies in ovarian cancer to node width, and the total number of significantly deregulated studies in prostate cancer to its height. It is immediately evident which nodes have already been described to be up/down regulated in either one or both types of cancer. This can be useful to parallel the information already known from one cancer to the other. In addition, we can map the generated total of studies to node transparency, making genes with less disease evidence less obtrusive.

Fig. 6. Example graph with GO Annotation mapped to a color scheme.

Scalable, Integrative Analysis and Visualization of Protein Interactions 469

Integrated databases and resources are only useful when they can be effectively accessed, navigated and analyzed. Several biological network visualization tools are currently available, providing a diverse range of approaches and algorithms. While many existing visualization tools are effective and widely used, there are several critical areas where these applications require improvement. Scalability is essential to visualize the tens of thousands of known PPIs, which is a challenge for current layout algorithms and software. Biological graph drawing software must also be able to handle richly annotated data, including genomic and proteomic profiles, pathways, Gene Ontology annotations and data in PSI-MI and BioPAX formats, in addition to the vast quantity of microarray and proteomic data that

Individual tools need a good balance of performance and useful features. The features that are needed for each use are highly dependent on the available data and the workflow. As in any creative activity, a tool may enable new workflows by providing novel features, but the tool may also lack certain important features, or offer features that are not needed. There is no single solution that satisfies all of these requirements at the present time, and as data and

As the data grow more complex, the performance of layout algorithms will need to improve, and new options of differentiating multiple attributes will be required. As certain workflows become more main-stream, they may be turned into *analysis patterns* and implemented as plug-ins. Standardizing file formats, APIs and plug-ins will further intertwine existing tools,

With new data and advances in computational biology, user tasks are modified, which must be reflected by types of algorithms that support analyses and the user interfaces that effectively enable them. New graph theory algorithms for faster and biologically meaningful network layouts and algorithms for network structure analysis will need to be integrated into network visualization tools. Importantly, none of these algorithms would make a broad difference unless a user interface appropriate for biologists is available (Viau et al., 2010).

This research was funded in part by Ontario Research Fund (GL2-01-030), Canada Foundation for Innovation (CFI #12301 and CFI #203383), and the Ontario Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of the OMOHLTC. CP was funded in part by Friuli Exchange Program. IJ is supported in part by

The authors would like to thank Max Kotlyar, Dan Strumpf, Fiona Broackes-Carter and the

Alnemri, Emad, S., David J Livingston, Donald W Nicholson, Guy Salvesen, Nancy A

Thornberry, Winnie W Wong, Junying Yuan (1986). Human ICE/CED-3 protease

workflows change over time, network visualization tools must also evolve.

enabling their easier integration and specialization.

**4. Conclusions** 

is available.

**5. Acknowledgment** 

**6. References** 

the Canada Research Chair Program.

entire Jurisica lab for useful comments and discussions.

nomenclature, *Cell*, 87(2):171.

We can also import structured data, in the form of GO attributes, retrieved from the I2D plug-in(Figure 6). We can view this data per individual node in the Node side panel, revealing the list of individual GO attributes and their descriptions. To get a graph-wide view of these attributes, we will add a filter to map the GO data to one of several categories, each with its own colour. The same result can be obtained by applying GO terms or other attributes, like pathways to which the node belongs, retrieved from other sources to the nodes as features and editing the filter in the desired way.

### **3.3 Importing drug-protein interactions**

Finally, we will import a list of drugs and their gene targets as additional interactions. This expands our network to 2,707 nodes and 5,257 edges (Figure 7). Through a combination of manual layout and radial layout tools, we arrange the drugs in a circle around PRO1, its interactors, and their interactors from I2D. The edges connecting drugs to proteins are coloured blue to differentiate them from PPIs. To see the impact of individual drugs to this network, we map their degree to node size and transparency. Thus, large nodes represent drugs that target many of the proteins in the network. The top six of these drugs are labelled for convenience. Analogously, some proteins have a high degree of blue edges and connect to small nodes, such as ProX. These drugs show strong specificity to ProX. The initial data will be available in ASCII tab-delimited format and the final figure in NAViGaTOR 2 XML file at http://www.cs.utoronto.ca/~juris/data/intech12/.

Fig. 7. Final graph, with drug interactions included and the size of nodes representing drugs derived from number of interactions within the graph. NAViGaTOR 2 XML file for the final figure is available in supplemental material

(http://www.cs.utoronto.ca/~juris/data/intech12/).

## **4. Conclusions**

468 Protein-Protein Interactions – Computational and Experimental Tools

We can also import structured data, in the form of GO attributes, retrieved from the I2D plug-in(Figure 6). We can view this data per individual node in the Node side panel, revealing the list of individual GO attributes and their descriptions. To get a graph-wide view of these attributes, we will add a filter to map the GO data to one of several categories, each with its own colour. The same result can be obtained by applying GO terms or other attributes, like pathways to which the node belongs, retrieved from other sources to the

Finally, we will import a list of drugs and their gene targets as additional interactions. This expands our network to 2,707 nodes and 5,257 edges (Figure 7). Through a combination of manual layout and radial layout tools, we arrange the drugs in a circle around PRO1, its interactors, and their interactors from I2D. The edges connecting drugs to proteins are coloured blue to differentiate them from PPIs. To see the impact of individual drugs to this network, we map their degree to node size and transparency. Thus, large nodes represent drugs that target many of the proteins in the network. The top six of these drugs are labelled for convenience. Analogously, some proteins have a high degree of blue edges and connect to small nodes, such as ProX. These drugs show strong specificity to ProX. The initial data will be available in ASCII tab-delimited format and the final figure in NAViGaTOR 2 XML

Fig. 7. Final graph, with drug interactions included and the size of nodes representing drugs derived from number of interactions within the graph. NAViGaTOR 2 XML file for the final

nodes as features and editing the filter in the desired way.

file at http://www.cs.utoronto.ca/~juris/data/intech12/.

figure is available in supplemental material

(http://www.cs.utoronto.ca/~juris/data/intech12/).

**3.3 Importing drug-protein interactions** 

Integrated databases and resources are only useful when they can be effectively accessed, navigated and analyzed. Several biological network visualization tools are currently available, providing a diverse range of approaches and algorithms. While many existing visualization tools are effective and widely used, there are several critical areas where these applications require improvement. Scalability is essential to visualize the tens of thousands of known PPIs, which is a challenge for current layout algorithms and software. Biological graph drawing software must also be able to handle richly annotated data, including genomic and proteomic profiles, pathways, Gene Ontology annotations and data in PSI-MI and BioPAX formats, in addition to the vast quantity of microarray and proteomic data that is available.

Individual tools need a good balance of performance and useful features. The features that are needed for each use are highly dependent on the available data and the workflow. As in any creative activity, a tool may enable new workflows by providing novel features, but the tool may also lack certain important features, or offer features that are not needed. There is no single solution that satisfies all of these requirements at the present time, and as data and workflows change over time, network visualization tools must also evolve.

As the data grow more complex, the performance of layout algorithms will need to improve, and new options of differentiating multiple attributes will be required. As certain workflows become more main-stream, they may be turned into *analysis patterns* and implemented as plug-ins. Standardizing file formats, APIs and plug-ins will further intertwine existing tools, enabling their easier integration and specialization.

With new data and advances in computational biology, user tasks are modified, which must be reflected by types of algorithms that support analyses and the user interfaces that effectively enable them. New graph theory algorithms for faster and biologically meaningful network layouts and algorithms for network structure analysis will need to be integrated into network visualization tools. Importantly, none of these algorithms would make a broad difference unless a user interface appropriate for biologists is available (Viau et al., 2010).

## **5. Acknowledgment**

This research was funded in part by Ontario Research Fund (GL2-01-030), Canada Foundation for Innovation (CFI #12301 and CFI #203383), and the Ontario Ministry of Health and Long Term Care. The views expressed do not necessarily reflect those of the OMOHLTC. CP was funded in part by Friuli Exchange Program. IJ is supported in part by the Canada Research Chair Program.

The authors would like to thank Max Kotlyar, Dan Strumpf, Fiona Broackes-Carter and the entire Jurisica lab for useful comments and discussions.

## **6. References**

Alnemri, Emad, S., David J Livingston, Donald W Nicholson, Guy Salvesen, Nancy A Thornberry, Winnie W Wong, Junying Yuan (1986). Human ICE/CED-3 protease nomenclature, *Cell*, 87(2):171.

Scalable, Integrative Analysis and Visualization of Protein Interactions 471

Gehlenborg N., O'Donoghue S.I., Baliga N.S., Goesmann A., Hibbs M.A., Kitano H.,

Himsolt, M. (1996). GML: A portable Graph File Format. Syntax. Retrieved from

Himsolt, M. (1996). GML: A portable Graph File Format. Syntax. Retrieved from

Hu Z., Hung J.H., Wang Y., Chang Y.C., Huang C.L., Huyck M., DeLisi C. (2009). VisANT

Hu, Z., Mellor, J., Wu, J. and DeLisi, C. (2004). VisANT: an online visualization and analysis

Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J., McGarvey P.,

Junker B.H., Klukas C. & Schreiber F. (2006). VANTED: a system for advanced data analysis and visualization in the context of biological networks. *BMC Bioinformatics,* 7(109). Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod N, Bader GD,

Kersey P. J., Duarte J., Williams A., Karavidopoulou Y., Birney E., Apweiler R. (2004). The

Kreeger P.K., Lauffenburger D.A. (2010). Cancer systems biology: a network modeling

Longabaugh WJ. (2012). BioTapestry: a tool to visualize the dynamic properties of gene

Maglott D, Ostell J, Pruitt KD, Tatusova T. (2011). Entrez Gene: gene-centered information at

McGuffin, M, and Jurisica, I. (2009). Interaction techniques for selecting and manipulating subgraphs in network visualizations. *IEEE Trans Vis Comput Graph*, 15 (6): 937-944. Montojo J, Zuberi K, Rodriguez H, Kazi F, Wright G, Donaldson SL, Morris Q, Bader GD

Morrow JK, Tian L, Zhang S. (2010). Molecular networks in drug discovery. *Crit Rev Biomed* 

(2010). GeneMANIA Cytoscape plugin: fast gene function predictions on the

tool for biological interaction data. *BMC Bioinformatics*, 5, 17.

the HUPO-PSI format for molecular interactions. *BMC Biol*. 5, 44

4(1):452.

report.pdf

report.pdf

http://www.fim.uni-

http://www.fim.uni-

*Proteomics* 4(7): 1985-1988

desktop. *Bioinformatics*, 26: 22

*Eng.* 38(2):143-56.

perspective*, Carcinogenesis*, 31(1):2-8.

regulatory networks. *Methods Mol Biol.* 786:359-94.

NCBI. *Nucleic Acids Res*. 39(Database issue):D52-7.

ontology. *Nucleic Acids Res,* 37, W115–W121.

of the UniProt website. *BMC Bioinformatics*, 10:136.

Kohlbacher O., Neuweger H., Schneider R., Tenenbaum D., Gavin A.C. (2003). Visualization of omics data for systems biology. *Nat Methods,* 7(3 Suppl):S56-68. Helaers R, Bareke E, De Meulder B, Pierre M, Depiereux S, Habra N, Depiereux E. (2011).

gViz, a novel tool for the visualization of co-expression networks. *BMC Res Notes*.

passau.de/fileadmin/files/lehrstuhl/brandenburg/projekte/gml/gml-technical-

passau.de/fileadmin/files/lehrstuhl/brandenburg/projekte/gml/gml-technical-

3.5: multi-scale network visualization, analysis and inference based on the gene

Gasteiger E. (2009). Infrastructure for the life sciences: design and implementation

Xenarios I, Wojcik J, Sherman D, Tyers M, Salama JJ , Moore S, Ceol A, Chatr-Aryamontri A, Oesterheld M, Stümpflen V, Salwinski L, Nerothin J, Cerami E, Cusick ME, Vidal M, Gilson M, Armstrong J, Woollard P, Hogue C, Eisenberg D, Cesareni G, Apweiler R, Hermjakob H (2007). Broadening the horizon--level 2.5 of

International Protein Index: An integrated database for proteomics experiments.


Aranda, B.,Blankenburg, H.,Kerrien, S.,Brinkman, F.S.L., Ceol, A., Chautard, E., Dana,

Accessing and scoring molecular interactions. *Nature Methods*, 8(7): 28-529. Björn H. Junker, Christian Klukas and Falk Schreiber (2006). VANTED: A system for

Brown, K.R., and Jurisica, I. (2005). Online Predicted Human Interaction Database.

Brown, K.R., and Jurisica, I. (2007). Unequal evolutionary conservation of human protein

Brown, K.R., Otasek D, Ali M, McGuffin, M.J., Xie W, Devani B, van Toch I.L., and Jurisica I.

Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D'Eustachio P, Schaefer C,

Djebbari, A., Ali, M., Otasek, D., Kotlyar. M., Fortney, K., Wong, S., Hrvojic, A. and Jurisica,

Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G,

SM. (2012). Ensembl. *Nucleic Acids Res*. [Epub ahead of print]

I. (2011). NAViGaTOR: Scalable and Interactive Navigation and Analysis of Large

Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, Ritchie GR, Ruffier M, Schuster M, Sobral D, Tang YA, Taylor K, Trevanion S, Vandrovcova J, White S, Wilson M, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fernández-Suarez XM, Harrow J, Herrero J, Hubbard TJ, Parker A, Proctor G, Spudich G, Vogel J, Yates A, Zadissa A, Searle

interactions in interologous networks. *Genome Biol*, 8(5):R95.

*BMC Bioinformatics*, 7:109

*Bioinformatics*, 21(9):2076-82.

*Bioinformatics,* 25(24): 3327-3329.

Graphs. *Internet Mathematics*, 7(4):314-347.

28(9):935-42.

J.M.,De Las Rivas, J., Dumousseau, M.,Galeota, E., Gaulton, A., Goll, J., Hancock, R.E.W., Isserlin, R., Jimenez, R.C., Kerssemakers, J., Khadake, J., Lynn, D.J., Michaut, M.,O'Kelly, G., Ono, K., Orchard, S., Prieto, C., Razick, S., Rigina, O., Salwinski, L., Simonovic, M., Velankar, S., Winter, A., Wu, G., Bader, G.D., Cesareni, G., Donaldson, I.M., Eisenberg, D., Kleywegt, G.J., Overington, J., Ricard-Blum, S., Tyers, M., Albrecht, M.,Hermjakob, H. (2011). PSICQUIC and PSISCORE:

advanced data analysis and visualization in the context of biological networks.

(2009). NAViGaTOR: Network Analysis, Visualization and Graphing Toronto,

Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinath G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Reubenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Whirl-Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD. (2010). The BioPAX community standard for pathway data sharing. *Nature Biotechnology*.


Niu, Y., Otasek, D., Jurisica, I. (2011). Evaluation of linguistic features useful in extraction of

Pavlopoulos G.A., Wegener A.L., Schneider R. (2008). A survey of visualization tools for

Remmerie N., De Vijlder T., Laukens K., Dang T.H., Lemière F., Mertens I., Valkenborg D.,

Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski

Shirdel EA, Xie W, Mak TW, Jurisica I. (2011) NAViGaTing the micronome--using multiple

Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. (2011). Cytoscape 2.8: new features for data integration and network visualization. *Bioinformatics*. 1;27(3):431-2. Stein A., Mosca R., Aloy P. (2011). Three-dimensional modeling of protein interactions and

Swainston N, Smallbone K, Mendes P, Kell D, Paton N. (2011). The SuBliMinaL Toolbox:

The Gene Ontology Consortium. (2000). Gene ontology: tool for the unification of biology.

Viau, C., McGuffin, M J., Chiricota, Y., and Jurisica, I. (2010). The FlowVizMenu and parallel

Yu L.R. (2011). Pharmacoproteomics and toxicoproteomics: The field of dreams*. J Proteomics*,

biomolecular interaction networks. *Genome Res*, 13:2498–2504.

complexes is going 'omics*. Curr Opin Struct Biol*, 21(2):200-8.

exploration. *IEEE Trans Vis Comput Graph*, 16(6):1100-8.

predicted interactions in I2D. *Bioinformatics*, 26(1): 111-9.

biological network analysis, *BioData Min,* 1(12).

10.1093/database/baq020

8(2):186.

*Nat. Genet*. 25(1):25-9.

74(12):2549-53.

microRNAs. *PLoS One*. 6(2):e17429.

interactions from PubMed; Application to annotating known, high-throughput and

Blust R., Witters E. (2011). Next generation functional proteomics in non-model plants: A survey on techniques and applications for the analysis of protein complexes and post-translational modifications*. Phytochemistry*, 72(10):1192-218. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger

T, Krug H, Sirota-Madi A, Olender T, Golan Y, Stelzer G, Harel A and Lancet D. (2010). GeneCards Version 3: the human gene integrator *Database*; doi:

B., Ideker T. (2003). Cytoscape: a software environment for integrated models of

microRNA prediction databases to identify signalling pathway-associated

automating steps in the reconstruction of metabolic networks. *J Integr Bioinform*.

scatterplot matrix: Hybrid multidimensional visualizations for network

## *Edited by Weibo Cai and Hao Hong*

Proteins are indispensable players in virtually all biological events. The functions of proteins are coordinated through intricate regulatory networks of transient proteinprotein interactions (PPIs). To predict and/or study PPIs, a wide variety of techniques have been developed over the last several decades. Many in vitro and in vivo assays have been implemented to explore the mechanism of these ubiquitous interactions. However, despite significant advances in these experimental approaches, many limitations exist such as false-positives/false-negatives, difficulty in obtaining crystal structures of proteins, challenges in the detection of transient PPI, among others. To overcome these limitations, many computational approaches have been developed which are becoming increasingly widely used to facilitate the investigation of PPIs. This book has gathered an ensemble of experts in the field, in 22 chapters, which have been broadly categorized into Computational Approaches, Experimental Approaches, and Others.

Protein-Protein Interactions - Computational and Experimental Tools

Protein-Protein Interactions

Computational and Experimental Tools

*Edited by Weibo Cai and Hao Hong*

Photo by Ugreen / iStock