**2.2. Types of pathways**

10 Bioinformatics

CSNDB Cell signalling networks database

genes

SPAD Signalling pathway database

Blue Print Biological interaction database

Database

DIP Database of interacting proteins

STKE Pathway information

**Protein-protein interactions**

Proteome Bio knowledge

Domains

Pathway

Protein Interaction

Yeast Interaction

**Table 1.** Various Data integration Efforts

at the level of cytogenic bands.

of structure learning algorithms.

orthologous genes.

GeneNet Information on gene networks, groups of co-ordinately working

TransPath Pathways involved in the regulation of transcription factors

CytoScape Visualization and analysis of biological network

GenMAPP Gene Map Annotator and Pathway Profiler GRID The General Repository for Interaction Datasets

Knowledge Library

Signal transduction

Reactome A knowledgebase of biological processes

K.U.Leuven Center for Computational Systems Biology include;

genome of patients with congenital anomalies.

regulators and corresponding motifs to a set of co-expressed genes

CYGD Protein-protein interaction map at Comprehensive Yeast Genome

PathCalling Yeast Interaction Database at Curagen

Other efforts towards designing new applications for data mining and integration at the






Biological information about proteins comprise Incyte's Proteome Bio

GeNet Information on functional organization of regulatory gene networks

Biological networks are studied and modeled at different description levels establishing different pathway types, For example; metabolic pathways describe the conversion of metabolites by enzyme-catalyzed chemical reactions given by their stoichiometric equations, such as the main pathways of the energy household as Glycolysis or Pentose Phosphate pathway. Another pathway type is signal transduction pathways, also known as information metabolism, explaining how cells receive, process, and responds to information from the environment. A brief description about various types of pathways is given below.

**A. Metabolic Pathways** describe the network of enzyme-catalyzed reactions that release energy by breaking down nutrients (catabolism) and building up the essential compounds necessary for growth (anabolism). Experimentally determined metabolic pathways have established for a few model organisms, but most metabolic pathways databases contain pathway data that has been computationally inferred from the genomes annotations. Because most genome annotations are incomplete, metabolic pathway databases contain pathway holes which can only be addressed by experiment or computational inference. A good test of a reconstructed metabolic network is to ask if it can produce the set of essential compounds necessary for growth, given a known minimal nutrient set. To solve this problem, metabolism can be represented as a bipartite directed graph, where one set of nodes represents metabolites, the other set represents biochemical reactions with labeled edges used to indicate relationships between nodes (reaction X produces metabolite Y, or metabolite Y is-consumed-by reaction X.

**B. Gene Regulatory Networks** describe the network of transcription factors that bind regulatory regions of specific genes and activate or repress their transcription. Gene regulatory networks or transcription networks have been found to contain recurring biochemical wiring patterns, termed network motifs, which carry out key functions. How does one find the most significant recurring network motif in a given transcriptional network? To answer this question, transcription networks can be described as directed graphs, in which nodes are genes, and edges represent transcription interactions, where a transcription factor encoded by one gene modulates and transcription rate of the second gene.

Hierarchical Biological Pathway Data Integration and Mining 13

various species and different databases. A hierarchical pathway data integration scheme is

Each database also defines supporting evidence codes specifically defined to consider criteria for selection, however may not be explicitly illustrated and that may not be similar across various sources. This heterogeneity in evidence codes and their representation needs consideration [40]. Since the evidence code may originate as a result of experimentation or as evidence from published text, integration of the plant pathway data across databases involves standardizing the evidence code prior to the integration. The first step is to integrate the evidence codes for a given pathway across database. Biological databases are results of experiments carried out with different conditions and controls, mostly open source, and employs a variety of formats [41]. Integrating such databases is a multi-step procedure and involves handling the complexities associated with heterogeneous data

Since isolation of ontologies complicates data integration, so in order to use ontologies at their full potential, concepts, relations, and axioms must be shared when possible. Domain ontologies must also be anchored to an upper ontology in order to enable the sharing and

While integrating information about a pathway from a database, entities require independent approach. One such entity is synonym. Each database lists a set of synonyms that need integration to configure a pool of synonyms without causing duplication. In the

presented in Figures 1 and 2 below.

**Figure 1.** Hierarchical Pathway data Integration Scheme

*A. Ontology Development* 

reuse of knowledge.

*B. Synonym Integration* 

integration.

**C. Signaling Pathways** describe biochemical reactions for information transmission and processing. Unlike metabolic pathways that catalyze small molecule reactions, signaling pathways involve the post translational modification of proteins leading to the downstream activation of transcriptional factors. They are often formed by cascades of activated/deactivated proteins or protein complexes. Such signal transduction cascades may be seen as molecular circuits which mediate the sensing and processing of stimuli. They detect, amplify and integrate diverse external signals to generate responses, such as changes in enzyme activity, gene expression, or ion channel activity. Integration of signaling pathways poses a greater challenge than with metabolic pathways because of diversity of representation schemes for signaling. Some Signaling databases like; PATIKA [35] and INHO [36] use compound graphs to represent signaling pathways, while other object oriented databases use inheritance to establish relationships between post translational modifications of proteins.

**D. Protein-Protein Interaction**: In proteomic analysis, target genes are used as bait in immuno-precipitation to identify potential binding patterns in cell lysate. The higher level databases such as; KEGG [3], TRANSPATH [37], ReactomeSTKE [38], and MetaCyc [39] networks of interacting proteins with definite cellular processes including metabolism, signal transduction and gene regulation. These resources typically represent biological information in the form of individual pathway diagrams summarizing experimental results collected during years of research on particular cellular functions. Currently, no single method is capable of predicting all possible protein interactions and such integrative resources as SPRING and predictome combine multiple theoretical approaches to increase prediction accuracy and coverage. A problem with these networks is the high number of false alarms.

**E. Ontology Vocabulary Mapping**: Ontology provides a formal written description of a specific set of concepts and their relationships in a particular domain. GO ontology has three categories molecular function, biological process and cellular composition. Integration of signaling pathways poses a greater challenge than with metabolic pathways because of the diversity of representation schemes for signaling.

#### **2.3. Integration issues**

Biological plant pathway data integration is a multi-step process. It includes integration of various types of pathways, interactions, and gene expression. On another level, it includes various species and different databases. A hierarchical pathway data integration scheme is presented in Figures 1 and 2 below.

Each database also defines supporting evidence codes specifically defined to consider criteria for selection, however may not be explicitly illustrated and that may not be similar across various sources. This heterogeneity in evidence codes and their representation needs consideration [40]. Since the evidence code may originate as a result of experimentation or as evidence from published text, integration of the plant pathway data across databases involves standardizing the evidence code prior to the integration. The first step is to integrate the evidence codes for a given pathway across database. Biological databases are results of experiments carried out with different conditions and controls, mostly open source, and employs a variety of formats [41]. Integrating such databases is a multi-step procedure and involves handling the complexities associated with heterogeneous data integration.

**Figure 1.** Hierarchical Pathway data Integration Scheme

#### *A. Ontology Development*

12 Bioinformatics

modifications of proteins.

false alarms.

**2.3. Integration issues** 

diversity of representation schemes for signaling.

**B. Gene Regulatory Networks** describe the network of transcription factors that bind regulatory regions of specific genes and activate or repress their transcription. Gene regulatory networks or transcription networks have been found to contain recurring biochemical wiring patterns, termed network motifs, which carry out key functions. How does one find the most significant recurring network motif in a given transcriptional network? To answer this question, transcription networks can be described as directed graphs, in which nodes are genes, and edges represent transcription interactions, where a transcription factor encoded by

**C. Signaling Pathways** describe biochemical reactions for information transmission and processing. Unlike metabolic pathways that catalyze small molecule reactions, signaling pathways involve the post translational modification of proteins leading to the downstream activation of transcriptional factors. They are often formed by cascades of activated/deactivated proteins or protein complexes. Such signal transduction cascades may be seen as molecular circuits which mediate the sensing and processing of stimuli. They detect, amplify and integrate diverse external signals to generate responses, such as changes in enzyme activity, gene expression, or ion channel activity. Integration of signaling pathways poses a greater challenge than with metabolic pathways because of diversity of representation schemes for signaling. Some Signaling databases like; PATIKA [35] and INHO [36] use compound graphs to represent signaling pathways, while other object oriented databases use inheritance to establish relationships between post translational

**D. Protein-Protein Interaction**: In proteomic analysis, target genes are used as bait in immuno-precipitation to identify potential binding patterns in cell lysate. The higher level databases such as; KEGG [3], TRANSPATH [37], ReactomeSTKE [38], and MetaCyc [39] networks of interacting proteins with definite cellular processes including metabolism, signal transduction and gene regulation. These resources typically represent biological information in the form of individual pathway diagrams summarizing experimental results collected during years of research on particular cellular functions. Currently, no single method is capable of predicting all possible protein interactions and such integrative resources as SPRING and predictome combine multiple theoretical approaches to increase prediction accuracy and coverage. A problem with these networks is the high number of

**E. Ontology Vocabulary Mapping**: Ontology provides a formal written description of a specific set of concepts and their relationships in a particular domain. GO ontology has three categories molecular function, biological process and cellular composition. Integration of signaling pathways poses a greater challenge than with metabolic pathways because of the

Biological plant pathway data integration is a multi-step process. It includes integration of various types of pathways, interactions, and gene expression. On another level, it includes

one gene modulates and transcription rate of the second gene.

Since isolation of ontologies complicates data integration, so in order to use ontologies at their full potential, concepts, relations, and axioms must be shared when possible. Domain ontologies must also be anchored to an upper ontology in order to enable the sharing and reuse of knowledge.

#### *B. Synonym Integration*

While integrating information about a pathway from a database, entities require independent approach. One such entity is synonym. Each database lists a set of synonyms that need integration to configure a pool of synonyms without causing duplication. In the

data integration platform developed the synonym integration has issues like avoiding duplication and accommodating number of synonyms associated with one entity. Some pathways may include two compounds with different names but having same empirical formula. In such cases integration is challenging as biologists may be further interested in reviewing the chemical structure along with the integrated output. However, almost all biological pathways are vertically extendable and can associate further details. The point here is to include all the salient features (from a biologist's standpoint) of the pathway. There is no thumb rule to define biologist's interests.

Hierarchical Biological Pathway Data Integration and Mining 15

2. Since pathway information cannot be assessed with any reliability, it is hard to assign a measure of the orrectness/authenticity to any one database. We propose assignment to be user selective to resolve the issue. To combine the information, a heuristic rule set computes the composite EVs for the integrated database. The unification can be done using any one EV code set as a key. Since each database follows their own standard, it is likely that EVs may not find a perfect match among the databases or that there may be more than one likely match. To handle these situations, two matching sets, a *perfect match* and a *likely match* are considered. The EVs to find a match for *IEP* and *ND* from GO in EV set above with those in BioCyc result in more than one likely match *{GO: IEP* 

3. Integrated Evidence Code (EVint) for Perfect Matches: The EV codes encompass the quantitative information giving an insight into how the data was obtained. They define

For biological databases, the pathway information is mostly inferred by the curators based on experimental, computational, literature or other evidence. The references associated with the database are mostly accounted as a measure of support for the data. We introduce a qualitative approach to associate the references supporting the pathway or organism (or compounds or reactions). The reference index *RIint* is computed using a

the conditions/ constraint associated with obtaining the data.

3. For all other combinations of *Rank* and *VF,* compute the average.

Finally, based on the *Rank* and *VF,* the *Reference index (RI)* is computed.

*Given:* Set of *n* databases *{D1, D2, D3, D4,……, Dn},* 

*List*: *EV codes across the databases*. (See Tables III(a) and III(b))

Citations may be a robust way of supporting the claim in a database. However, some journals are ranked over other journals and citations from those journals will be valued more than citations in other sources. To accommodate this, we associate ranks with the journals. The *Rank* specifies the order of importance of journal as designated by the user. Additionally, we classify citations based on both the journal *Rank* and the *value factor (VF)*.

(For illustration, only three data sources namely, Bio-Cyc, KEGG and MetNetDB

*List: Evidence Codes (EVi)* for the object/entity *(Ei)* among the databases *(Di),* 

→ *BioCyc: EV1*, *BioCyc: EV2}.* 

heuristic:

are considered)

4. Computing the Reference Index (RIint)

1. For *Rank* = *High*, Ignore *VF.* 2. For Rank = Low, Use only VF.

**3. Evidence codes integration algorithm** 

*User input*: C*onfidence weight (CW)* 

The steps below list the mapping process.

Step 1*.* For a given pathway/organism/entity,

for example; *D1/E1 {EV1}*, *D2/E1 {EV2}*,….

#### *C. Evidence Codes and issues*

For defining an evidence code with an entity, granularity is another variable. Depending on the database, EV may be either for an entity within a pathway such as a gene, a compound, reaction or enzyme or for the pathway itself. In other words, many databases use the same evidence code for an entire pathway and map that code to each interaction in the pathway. Others assign different EV codes to each interaction and sometimes to each compound or gene.

The Gene Ontology (GO) defines a set of thirteen EVs that assign evidence to gene function. BioCyc defines a class hierarchy structure of four basic EVs with subclasses. MetNetDB incorporates four EVs [42]. KEGG defines only one EV. Ideally, the EVs also reflect on the individual nodes within a specific pathway. Figure 2 depicts the data integration platform highlighting multiple data sources and integration based on user inputs.

**Figure 2.** Data Integration Platform

1. Many databases use the same evidence code for an entire pathway and map that code to each interaction in the pathway. Others assign codes to each interaction and sometimes each compound or gene. In other words, the granularity to which we can assign an EV may be either an entity such as a gene, a compound, reaction or enzyme within or across the pathway itself. The Gene Ontology (GO) defines a set of thirteen EVs that assign evidence to gene function [43]. BioCyc defines a class hierarchy structure of four basic EVs with subclasses [17]. MetNetDB incorporates four EVs. KEGG defines only one EV. Ideally, the EVs also reflect on the individual nodes within a specific pathway.

	- 1. For *Rank* = *High*, Ignore *VF.*

gene.

data integration platform developed the synonym integration has issues like avoiding duplication and accommodating number of synonyms associated with one entity. Some pathways may include two compounds with different names but having same empirical formula. In such cases integration is challenging as biologists may be further interested in reviewing the chemical structure along with the integrated output. However, almost all biological pathways are vertically extendable and can associate further details. The point here is to include all the salient features (from a biologist's standpoint) of the pathway.

For defining an evidence code with an entity, granularity is another variable. Depending on the database, EV may be either for an entity within a pathway such as a gene, a compound, reaction or enzyme or for the pathway itself. In other words, many databases use the same evidence code for an entire pathway and map that code to each interaction in the pathway. Others assign different EV codes to each interaction and sometimes to each compound or

The Gene Ontology (GO) defines a set of thirteen EVs that assign evidence to gene function. BioCyc defines a class hierarchy structure of four basic EVs with subclasses. MetNetDB incorporates four EVs [42]. KEGG defines only one EV. Ideally, the EVs also reflect on the individual nodes within a specific pathway. Figure 2 depicts the data integration platform

1. Many databases use the same evidence code for an entire pathway and map that code to each interaction in the pathway. Others assign codes to each interaction and sometimes each compound or gene. In other words, the granularity to which we can assign an EV may be either an entity such as a gene, a compound, reaction or enzyme within or across the pathway itself. The Gene Ontology (GO) defines a set of thirteen EVs that assign evidence to gene function [43]. BioCyc defines a class hierarchy structure of four basic EVs with subclasses [17]. MetNetDB incorporates four EVs. KEGG defines only one EV. Ideally, the EVs also reflect on the individual nodes within

highlighting multiple data sources and integration based on user inputs.

There is no thumb rule to define biologist's interests.

*C. Evidence Codes and issues* 

**Figure 2.** Data Integration Platform

a specific pathway.


Citations may be a robust way of supporting the claim in a database. However, some journals are ranked over other journals and citations from those journals will be valued more than citations in other sources. To accommodate this, we associate ranks with the journals. The *Rank* specifies the order of importance of journal as designated by the user. Additionally, we classify citations based on both the journal *Rank* and the *value factor (VF)*. Finally, based on the *Rank* and *VF,* the *Reference index (RI)* is computed.
