**4. Biological Pathway Data Integration**

 An integration model may serve as a tool to the user for a specific type of pathway. An algorithm for integration is presented next.

**Metabolic Pathways:** Integrating pathways from different data sources for the same species extract similar structures in them as the first step; this step integrates vertically given pathway within a species across data sources. (Database is the variable) this includes sorting a graph *G (V, E)* for common V's and E's in *Gi (Vi, Ei)* and *Gj (Vj, Ej).* In the discussion that follows, *i*ntegrating pathway as the TCA cycle given by two data sources namely; *KEGG (Dij1)* and BioCyc *(Dij2)* for *E. coli K-12* is considered. For metabolic pathways the details associated with each graph include the nodes and edges as given below. For Protein-Protein interaction the nomenclature and associated fields for nodes and edges may change. However, it is possible to come up with a structure that can describe the Protein-Protein interactions or signal transduction pathways.


**Signal Transduction Pathways**: The information contained in signal transduction pathways is not similar to the metabolic pathways. In signal transduction pathways, the interactions can be represented as a class hierarchy. Our aim is also to integrate a sample pathway like insulin from sources like KEGG, SPAD to see the performance of our algorithm. Interestingly, SPAD assigns evidence code to the edges (interactions) and KEGG assigns only one evidence code to the pathway (nodes and edges). The format of the table for integration is given above. Before integration information associated with every object (node) and edge (interactions) should be considered.

Before proceeding for integration of a pathway across data sources following steps need to be carried out.

Step 1



#### Step 2

22 Bioinformatics

formats (e.g., between PSI MI and BIND XML, and between Genome Object Net and SBML). However, the complexity of such a pair-wise translation approach increases dramatically with a growing number of different pathway data formats. To address this issue, a standard pathway data exchange format is needed. While the Resource Description framework (RDF) is an important first step towards the unification of XML formats in describing metadata (ontologies), it is not expressive enough to support formal knowledge representation [50]. To address this problem, more sophisticated XML-based ontological languages such as the Web Ontology Language (OWL) have been developed. An OWL-based pathway exchange

An integration model may serve as a tool to the user for a specific type of pathway. An

**Metabolic Pathways:** Integrating pathways from different data sources for the same species extract similar structures in them as the first step; this step integrates vertically given pathway within a species across data sources. (Database is the variable) this includes sorting a graph *G (V, E)* for common V's and E's in *Gi (Vi, Ei)* and *Gj (Vj, Ej).* In the discussion that follows, *i*ntegrating pathway as the TCA cycle given by two data sources namely; *KEGG (Dij1)* and BioCyc *(Dij2)* for *E. coli K-12* is considered. For metabolic pathways the details associated with each graph include the nodes and edges as given below. For Protein-Protein interaction the nomenclature and associated fields for nodes and edges may change. However, it is possible to come up with a structure that can describe the Protein-Protein

*Node:* Biological Name, ID, Neighbor, Type, Context, Pathway, Data Source, PubList,

*Edge:* EdgeID, EdgeSource, EdgeDest, Reactiontype (Rev/ Irreversible), Data Source,

**Signal Transduction Pathways**: The information contained in signal transduction pathways is not similar to the metabolic pathways. In signal transduction pathways, the interactions can be represented as a class hierarchy. Our aim is also to integrate a sample pathway like insulin from sources like KEGG, SPAD to see the performance of our algorithm. Interestingly, SPAD assigns evidence code to the edges (interactions) and KEGG assigns only one evidence code to the pathway (nodes and edges). The format of the table for integration is given above. Before integration information associated with every object

Before proceeding for integration of a pathway across data sources following steps need to

standard, called BioPAX, has been released to the research community [51].

**4. Biological Pathway Data Integration** 

algorithm for integration is presented next.

interactions or signal transduction pathways.

SynList, empirical formula, Structure

(node) and edge (interactions) should be considered.


Enzyme, Genes

be carried out.

Step 1

	- if column names are same then continue, else see alternate tag for the column, and match them.
	- If the order matches, then continue, else reorder the columns as given in output table.
	- If the number of columns is not same, then append the table with new columns.

#### Step 3


The notations used in our algorithm are presented next.

### **4.1. Notations**


Consider a tuple *(Si , (Pij, (Diji))* (3)

Where, *Dijk= {dij1, dij2, dij3, … dijk}* is a set of 'k' data sources for *(Si , Pij)* (4)


Then, the tuple *(v111n, e111m)* gives *(node, edge)* in Biocyc for TCA cycle in *E.coli*, and the tuple *(v112p, e112p)* gives *(node, edge)* in KEGG for TCA cycle in *E.coli* 

• *s2 = {(s2, p2j (D2jk)} = {(s2, p2j, d2j1), (s2, p2j, d2j2),……, (s2, p2j, d2jr)} for 'r'* databases*,* 

For example, *s2: Arabidopsis; p2j:* TCA Cycle*; d2j1=* BioCyc*, d2j2=* AraCyc

Then, the tuple *(v221p, e221p)* gives the *(node, edge)* in AraCyc for TCA cycle in *Arabidopsis*, and the tuple *(v222p, e222p)* gives the *(node, edge)* in KEGG for TCA cycle in *Arabidopsis.* 

Each pathway *pij for a dijk* is given by a graph *G (Vijk, Eijk),* where,

*Pijk = G (Vijk, Eijk)* represents Pathway *'j'* from *kth* datasourcesS for species *i'*… .(5)

Where, *Vijk = {v ijk1, v ijk2,…v ijkn*} = set of nodes in *dijk*,. (6)

*E ijk = {e ijk1, e ijk2,….e ijkm*}= set of edges in *dijk*,. (7)

• *SynList {pathway name} = SynList {Pij}*

	- *SynList {entity name} = SynList {v1jkn}*
	- *EVijk = {EVijk1, …. EVijkh}* set of *'h'* EV Codes for *{si , pij , dijk },* for example;
		- *EV1j1= {*Set of EVcodes given by Biocyc for E.coli for TCA cycle*}*
		- *EV1j2= {*Set of EV codes given by KEGG for E.coli for TCA cycle*}*
		- *EV2j3= {*Set of EV codes given by AraCyc for Arabidopsis for TCA cycle*}*
		- *EV2j2= {*Set of EV codes given by KEGG for Arabidopsis for TCA cycle*}*
	- *RIijk:* Reference index for a database *dijk*
	- *RIijint:* Reference index for the integrated pathway
	- *CWijk:* Confidence weight for a database *dijk*
	- *CWijint:* Confidence weight of the integrated pathway *pij* within a species
	- *Vijint:* Integrated node table for a species *Si ,* for a pathway *pij*
	- *Eijint:* Integrated edge table for a species *Si ,* for a pathway *pij*
	- *(v1jkn, e ijkm) = (node 'n', edge 'm')* in *d1jk* of *s1* for *p1j;*
	- *ATT {(v1jkn ,(A)}= {v1jkn, (A1, A2, A3, A4, …As)} =* set of attributes of the node *v1jkn*
	- *ATT {(e ijkm, (B)} ={(e ijkm, (B1, B2, B3, … Bt)} =* set of attributes of edge *e ijkm*
	- *DATT {v1jkn ,(δA)} =* set of derived attributes of the node *v1jkn (EVi, CWi, RIi)*
	- *DATT {e1jkn ,(δB)}=* set of derived attributes of the edge *e1jkn (EVi, CWi, RIi)*
	- *δVijk =* Set of derived node attributes for Integrated pathway *{EVint, CWint, RIint}*
	- *δEijk=* Set of derived edge attributes for Integrated pathway *{EVint, CWint, RIint}*
	- *Vijint = {Σ Vijk } for k= 1 to n*
	- *E ijint= {Σ E ijk} for k= 1 to n*
	- *Pijint =* Integrated pathway from multiple DSs *= {Σ Pijk } for k=1 to n*

#### **4.2. Biological Pathway Data Integration Algorithm**

Following selections and inputs are defined by the user.


Step 1.

For each user selected pathway *Pij for a species si*

*List Dij (d1j1,… dnjk*), \*\*\*(KEGG, BioCyc, MetNetDB etc)\*\*\*

Hierarchical Biological Pathway Data Integration and Mining 25

where, *Vijk= {vijk1, vijk2, ….vijkt*} and *Eij1= {e ijk1, e ijk2,….e ijku}*

**Check** *for vijk, 1 Є Vijp* (node name match across data sources)

**Check if** *vijk,1 Є SynList {vijp,t }* (node name(A) with Synlist(B))

**Generate SynListInt =** {**SynList** *(vijk, 1)* U *Synlist (vijp,l)*U*…}* without duplication

**Include** information associated with the edge, as given by 'edges' such as reaction,

enzyme, by products and substrates along with attributes like evidence, reference

*\*\** Outputs *Eijint* table for *si* using *(d1j1,… d1jk),* with *EVijint, CWijint* and *RIijint .* **Level 1.** *\*\** 

 *= Σ {ATT [(v1jkn, (A)]}, ATT [(e ijkm, (B)]} + Σ {DATT {v1jkn, (δA), DATT {e1jkn, (δB)}for* 

 *Generate DATT {v1jkn, (δA)}, DATT {e1jkn, (δB)},* 

 *If YES, then Apply EV* integration algorithm*, Generate DATT {v1jkn, (δA)}, DATT {e1jkn, (δB)},* 

**Check if** *SynList {vijk,1*} has a match with *vijp,t If YES, then Apply EV integration Algorithm* 

 *If vijk,1 = vijp,l is TRUE,* 

C**ompute** *(δVijk, δEijk)*

**Generate** *Pijint* 

publications, context etc.

*all n, m {δVijk δEijk}*

**Associate DOI** (date of integration)

**Repeat** *Step 2-3 for eijk Є Eijk in (dij1,… dijk), for pij* 

**Check if** *SynList {vijk,1}* has a match with *SynList {vijp,t }* 

**Include** *vijk, 1* with the matched node name *vijk-1, p Є Vijk-1*

*Pijint = {Σ Pijk } for k=1 to n = [{ Vijint, E ijint} + { Σ δVijkt, Σ δEijk} for k=1 to n ]at t= t1*

*\*\*\**This is the node name for the integrated database for the species. **Level 1***\*\*\** 

If *YES,* then **Apply** *EV* integration algorithm

Step 3.

*For k = 1, …, q (d1j1,… d1jk),* 

 Else, *For p = 1 to n, For t = 1, z* 

 *Else,* 

 *Else,* 

Step 4.

 *Then,* 

*For s = 1,.., n,* and *q = 1, …m,* **List** *ATT {(v1jks, (A)}* **List** *ATT {(e ijkq, (B)}*   **Select** *vijk1* Є *Vijk* **C** *d1jk* For all *p =1 to n* 

Step 2. **Define** rules to classify the interactions, for example;


 **Sort** (*d1j1,… dnjk*) according to species (*si ,dij1), (sj ,djj1*) etc.

 **Generate** a set of (nodes, edges) from all the input data sources *{(Vij, Eij)} = {(Vij1, Eij1), (Vij2, Eij2)….. (Vijs, Eijs)}*

```
 where, Vijk= {vijk1, vijk2, ….vijkt} and Eij1= {e ijk1, e ijk2,….e ijku}
Step 3. 
For k = 1, …, q (d1j1,… d1jk), 
 For s = 1,.., n, and q = 1, …m, 
 List ATT {(v1jks, (A)} 
 List ATT {(e ijkq, (B)} 
 Select vijk1 Є Vijk C d1jk 
 For all p =1 to n 
 Check for vijk, 1 Є Vijp (node name match across data sources)
 If YES, then Apply EV integration algorithm 
 Generate DATT {v1jkn, (δA)}, DATT {e1jkn, (δB)}, 
 Else, For p = 1 to n, 
 For t = 1, z 
               Check if vijk,1 Є SynList {vijp,t } (node name(A) with Synlist(B))
 If YES, then Apply EV integration algorithm, 
 Generate DATT {v1jkn, (δA)}, DATT {e1jkn, (δB)}, 
 Else, 
 Check if SynList {vijk,1} has a match with vijp,t
 If YES, then Apply EV integration Algorithm 
 Else, 
 Check if SynList {vijk,1} has a match with SynList {vijp,t } 
 If vijk,1 = vijp,l is TRUE, 
 Then, 
 Include vijk, 1 with the matched node name vijk-1, p Є Vijk-1
 Compute (δVijk, δEijk)
```
*\*\*\**This is the node name for the integrated database for the species. **Level 1***\*\*\** 

**Generate SynListInt =** {**SynList** *(vijk, 1)* U *Synlist (vijp,l)*U*…}* without duplication **Associate DOI** (date of integration) **Generate** *Pijint* 

*Pijint = {Σ Pijk } for k=1 to n = [{ Vijint, E ijint} + { Σ δVijkt, Σ δEijk} for k=1 to n ]at t= t1 = Σ {ATT [(v1jkn, (A)]}, ATT [(e ijkm, (B)]} + Σ {DATT {v1jkn, (δA), DATT {e1jkn, (δB)}for all n, m {δVijk δEijk}*

Step 4.

24 Bioinformatics

• *SynList {entity name} = SynList {v1jkn}*

• *RIijk:* Reference index for a database *dijk* 

• *Vijint:* Integrated node table for a species *Si*

• *Eijint:* Integrated edge table for a species *Si*

• *Vijint = {Σ Vijk } for k= 1 to n* • *E ijint= {Σ E ijk} for k= 1 to n*

• *(v1jkn, e ijkm) = (node 'n', edge 'm')* in *d1jk* of *s1* for *p1j;* 

• *EVijk = {EVijk1, …. EVijkh}* set of *'h'* EV Codes for *{si*

• *RIijint:* Reference index for the integrated pathway • *CWijk:* Confidence weight for a database *dijk* 

 *EV1j1= {*Set of EVcodes given by Biocyc for E.coli for TCA cycle*} EV1j2= {*Set of EV codes given by KEGG for E.coli for TCA cycle*}* 

• *CWijint:* Confidence weight of the integrated pathway *pij* within a species

• *ATT {(v1jkn ,(A)}= {v1jkn, (A1, A2, A3, A4, …As)} =* set of attributes of the node *v1jkn* • *ATT {(e ijkm, (B)} ={(e ijkm, (B1, B2, B3, … Bt)} =* set of attributes of edge *e ijkm* • *DATT {v1jkn ,(δA)} =* set of derived attributes of the node *v1jkn (EVi, CWi, RIi)*  • *DATT {e1jkn ,(δB)}=* set of derived attributes of the edge *e1jkn (EVi, CWi, RIi)* • *δVijk =* Set of derived node attributes for Integrated pathway *{EVint, CWint, RIint}*  • *δEijk=* Set of derived edge attributes for Integrated pathway *{EVint, CWint, RIint}* 

• *Pijint =* Integrated pathway from multiple DSs *= {Σ Pijk } for k=1 to n* 

User selected inputs: Species, Pathway, Data sources/database

For each user selected pathway *Pij for a species si*

Step 2. **Define** rules to classify the interactions, for example;

 **Sort** (*d1j1,… dnjk*) according to species (*si*

User defined filters (**UDF**) for entities like substrate nodes, H2O, CO2 etc. for integrated


 **Generate** a set of (nodes, edges) from all the input data sources *{(Vij, Eij)} = {(Vij1,* 

*,dij1), (sj*

*,djj1*) etc.

*List Dij (d1j1,… dnjk*), \*\*\*(KEGG, BioCyc, MetNetDB etc)\*\*\*

**4.2. Biological Pathway Data Integration Algorithm** 

Following selections and inputs are defined by the user.

User inputs: Confidence assigned to each database

pathway [*Pijint = G (Vint, Eint)*]*,* 

interaction

*Eij1), (Vij2, Eij2)….. (Vijs, Eijs)}*

Step 1.

 *EV2j3= {*Set of EV codes given by AraCyc for Arabidopsis for TCA cycle*} EV2j2= {*Set of EV codes given by KEGG for Arabidopsis for TCA cycle*}* 

*, pij , dijk },* for example;

*,* for a pathway *pij* 

*,* for a pathway *pij* 

**Repeat** *Step 2-3 for eijk Є Eijk in (dij1,… dijk), for pij* 

**Include** information associated with the edge, as given by 'edges' such as reaction, enzyme, by products and substrates along with attributes like evidence, reference publications, context etc.

*\*\** Outputs *Eijint* table for *si* using *(d1j1,… d1jk),* with *EVijint, CWijint* and *RIijint .* **Level 1.** *\*\** 

Step 5.

**Generate** integrated pathway by consolidating outputs *G(Vijint, Eijint)* for *si* 

Hierarchical Biological Pathway Data Integration and Mining 27

[54] uses manually curated metabolic networks, orthologue and their related reactions to

Arrendondo [55] Proposes to develop a process for the continuous improvement of the inference system used, which is applicable to any such data mining application. It involves the comparison of several classifiers like Support Vector Machines (SVMs), Human Expert generated Fuzzy, and Genetic Algorithm (GA) generated Fuzzy and Neural Networks using various different training data models. In his approach, all classifiers were trained and tested with four different data sets: three biological and a synthetically generated mixture data set. The obtained results showed a highly accurate prediction capability with the

Biological database integration is a challenging task as the databases are created all over the world and updated frequently. For biological data sources that may be derived from an earlier existing data source, it is also important to identify the evidence of the data source represented by the evidence code, to be included as a candidate for integration. In most data integration algorithms the user does not participate thus leading to an integrated data

Large scale integration of pathway databases promises to help biologists gain insight into the deep biological context of a pathway. In this chapter, we presented algorithms that help user to select their choice of data sources and apply Evidence code algorithm to compute an integrated EV code and RI for the pathway data of interest. The ultimate goal is to generate a large-scale composite database containing the entire metabolic network for an organism. This qualitative approach includes aspects like user confidence scores for databases for mapping EV and generating RI for a given pathway. For the TCA pathway results show that generating such a mapping is helpful in visualizing the integrated database that highlights the common entities as well as the specifics of each database. As the database confidence weight selection is user specific, the integration yields different results for different users for the same database which will allow users to explore the effects of different hypotheses on the overall network. Once the integrated evidence code is generated, then data integration algorithm is applied to get the integrated pathway data. To best attempt integration of such data it is imperative to include user participation as user mostly identifies the associations and behavior of various compounds, reactions, genes in a given biological pathway leading

mixture data set providing some of the best and most reliable results.

compare predicted gene-reaction associations.

source with any effective utility towards analysis.

*Electrical Engineering, Arkansas State University, USA* 

*Samuel Roberts Noble Foundation, USA* 

**6. Conclusion** 

to significant diagnosis.

**Author details** 

Shubhalaxmi Kher

Jianling Peng

Step 6.

*For i =1,…n* 

**Repeat** steps 2- 4 to integrate *Pij* for all species *si* 

*\*\** **This generates Table** *(Vj int, Ej int) = {(Vijint, Eijint) U (Vjjint, Ejjint)U…}* **for** *Si ,* **for all** *i =1, ..n),*  **for a** *pij.* **Level 2\*\*\***

Step 7.

*For (j = 1,….p)* 

**Integrate for all** *Pij*

\*\*This **generates output table** *(Vint, Eint) = (Vj int,Ej int) U (Vkint,Ekint) U….***for all** *(j = 1,….p).*  **Level 3\*\*\*** 

Step 8.

**Apply UDF** (User defined filter)
