**3. Evidence codes integration algorithm**

*Given:* Set of *n* databases *{D1, D2, D3, D4,……, Dn},* 

(For illustration, only three data sources namely, Bio-Cyc, KEGG and MetNetDB are considered)

*User input*: C*onfidence weight (CW)* 

*List: Evidence Codes (EVi)* for the object/entity *(Ei)* among the databases *(Di),* 

for example; *D1/E1 {EV1}*, *D2/E1 {EV2}*,….

The steps below list the mapping process.

Step 1*.* For a given pathway/organism/entity,

*List*: *EV codes across the databases*. (See Tables III(a) and III(b))

#### *Assign*: *Direct = 1.0; Indirect = 0.8; Computational =0.6; Hypothetical = 0.5.*

Hierarchical Biological Pathway Data Integration and Mining 17

For *t* = current year, compute *VF (t) = |P (t-2)| / |P|* where,


can have different representations in different databases.

integration models are defined below.

protein agent.

Step 8. Reference Index *(RIint)* 

**3.1. Integration models** 

*|P|* = Total number of publications listed in *Di*.

*|P (t-2)|* = Number of publications in the last *(t-2)* years for *Di*, and

Data integration aims to work with repositories of data from a variety of sources. As such, two databases may not provide identical information, and integrating these two databases may yield a richer resource for analysis. The conditions under which data is collected and the supporting references play a crucial role in making the analysis more meaningful. So far, the integration approaches have focused on different types of pathways. The same pathway

For example, a known pathway like Glycolysis is represented in different ways in KEGG and BioCyc as shown in Figure 3. A universal tool to integrate all types of pathways may not be a focus. Additionally, different databases employ various data representations that may not provide easy user access or user friendly. Figure 3(a) and 3(b) illustrate representational difference between two data sources for the same pathway. Various data

 *Syntactic Networks*: Syntactic networks adhere to the syntax of a set of words as given by the representation of the data and do not interpret the meaning associated. Syntactic

 *Semantic Networks* (SN): Semantic heterogeneity is a result of differences in interpretation of the 'meaning' of data. Semantic models aim to achieve semantic interoperability, a dynamic computational capability to integrate and communicate both the explicit and implicit meanings of digital content without human intervention. Several features of SN make it particularly useful for integrating biological data include, ability to easily define an inheritance hierarchy between concepts in a network format, allow economic information storage and deductive reasoning, represent assertions and cause effect through abstract relationships, cluster related information for fast retrieval, and adapt to new information by dynamic modification of network structures [44]. An important feature of SN is the ease and speed to retrieve information concerning a particular concept. The use of semantic relationships ensures clustering together related concepts in a network. For example, protein synonyms, functional descriptions, coding sequences, interactions, experimental data or even relevant research articles can all be represented by semantic agents, each of which is directly linked to the corresponding

heterogeneity is a result of differences in representation format of data.

#### Step 2*.* EV Unification (Rule Set –I)

BioCyc is a collection of 371 pathway/genome databases. Each pathway/genome database in the BioCyc collection describes the genome and metabolic pathways of a single organism. It considers a class hierarchy with four main classes. Since BioCyc and MetNetDB virtually use the same number of EV codes, the mapping is framed considering four major EV codes. KEGG uses only one EV for pathways namely '*manually entered from published materials'*. The EV code for KEGG to *Direct* is mapped using the rules like;

*If Di = BioCyc/AraCyc/MetaCyc, and EV = EV-Exp, then Change EV = Direct* 

Unification of the EV codes for the databases is based on the expert knowledge. EV code mapping is done with respect to a reference data source and unified according to the set of rules above.

Step 3*.* Confidence Weight (*CWi*) Assignment

Researchers typically have databases that they treat as favored sources for different types of information. Since there is no precise rule for deciding which database is more correct and up to date, a user defined score, a *confidence weight (CW)* is applied. The EV mapping process is interactive and provides flexibility in choice for databases. Confidence is defined as,

C*Wi = {Very Strong, Strong, Moderate, Poor, Very Poor}*  For example: *CW KEGG: Strong*, *CWBioCyc*: *Normal* 

Step 4. EVint (Rule Set-II)

Using heuristic rules, integrated EVcode is calculated.

Step 5. Decode *EVint* value

The EV value from Step 4 is decoded using: *EVint = Σ (CWi\* EV)/|i| = x* 

Step 6*.* Rank Index


If the publication in not in the list, Then, *Rank = low*  Else, *Rank* = *as defined by the list* 

Step 7. Value Factor *(VF)* 

The *VF* measures support for the entity using the publication evidence. This is a quantitative index with a temporal function.

For *t* = current year, compute *VF (t) = |P (t-2)| / |P|* where,

*|P (t-2)|* = Number of publications in the last *(t-2)* years for *Di*, and *|P|* = Total number of publications listed in *Di*.

Step 8. Reference Index *(RIint)* 

16 Bioinformatics

rules above.

Step 4. EVint (Rule Set-II)

Step 5. Decode *EVint* value

*- Assign:* 

Step 7. Value Factor *(VF)* 

index with a temporal function.

Step 6*.* Rank Index

Step 2*.* EV Unification (Rule Set –I)

EV code for KEGG to *Direct* is mapped using the rules like;

C*Wi = {Very Strong, Strong, Moderate, Poor, Very Poor}*  For example: *CW KEGG: Strong*, *CWBioCyc*: *Normal* 

Using heuristic rules, integrated EVcode is calculated.

The EV value from Step 4 is decoded using:


If the publication in not in the list, Then, *Rank = low* 

The *VF* measures support for the entity using the publication evidence. This is a quantitative


Else, *Rank* = *as defined by the list* 

*EVint = Σ (CWi\* EV)/|i| = x* 

Step 3*.* Confidence Weight (*CWi*) Assignment

*Assign*: *Direct = 1.0; Indirect = 0.8; Computational =0.6; Hypothetical = 0.5.* 

BioCyc is a collection of 371 pathway/genome databases. Each pathway/genome database in the BioCyc collection describes the genome and metabolic pathways of a single organism. It considers a class hierarchy with four main classes. Since BioCyc and MetNetDB virtually use the same number of EV codes, the mapping is framed considering four major EV codes. KEGG uses only one EV for pathways namely '*manually entered from published materials'*. The

*If Di = BioCyc/AraCyc/MetaCyc, and EV = EV-Exp, then Change EV = Direct*  Unification of the EV codes for the databases is based on the expert knowledge. EV code mapping is done with respect to a reference data source and unified according to the set of

Researchers typically have databases that they treat as favored sources for different types of information. Since there is no precise rule for deciding which database is more correct and up to date, a user defined score, a *confidence weight (CW)* is applied. The EV mapping process is interactive and provides flexibility in choice for databases. Confidence is defined as,


#### **3.1. Integration models**

Data integration aims to work with repositories of data from a variety of sources. As such, two databases may not provide identical information, and integrating these two databases may yield a richer resource for analysis. The conditions under which data is collected and the supporting references play a crucial role in making the analysis more meaningful. So far, the integration approaches have focused on different types of pathways. The same pathway can have different representations in different databases.

For example, a known pathway like Glycolysis is represented in different ways in KEGG and BioCyc as shown in Figure 3. A universal tool to integrate all types of pathways may not be a focus. Additionally, different databases employ various data representations that may not provide easy user access or user friendly. Figure 3(a) and 3(b) illustrate representational difference between two data sources for the same pathway. Various data integration models are defined below.


Hierarchical Biological Pathway Data Integration and Mining 19

Biological information can be retrieved effectively through simple relationship traversal starting from a query agent in the semantic network. Two approaches primarily in practice

In the memory-mapped data structure approach, subsets of data from various sources are collected, normalized, and integrated in memory for quick access. While this approach performs actual data integration and addresses the problem of poor performance in the federated approach, it requires additional calls to traditional relational databases to integrate descriptive data. While data cleaning is being performed on some of the data sources, it is not being done across all sources or in the same place. This makes it difficult to quickly add new data sources. In the indexing flat files approach, flat text files are indexed

 *Causal Models*: A causal model is an abstract model that uses cause and effect logic to describe the behaviour of a system. Ex: Expression Quantitative Trait Loci: (eQTLs) eQTL analysis is to study the relationship between genome and transcriptome. Gene expression QTLs that contain the gene encoding the mRNA are distinguished from other transacting eQTLs. eQTL mapping tries to find genomic variation to explain expression traits. One difference between eQTL mapping and traditional QTL mapping is that, traditional mapping study focuses on one or a few traits, while in most of eQTL studies, thousands of expression traits get analyzed and thousands of QTLs are

 *Context likelihood of relatedness* (*CLR*): It uses transcriptional profiles of an organism across a diverse set of conditions to systematically determine transcriptional regulatory interactions. *CLR* is an extension of the relevance network approach. (http://gardnerlab.bu.edu/software&tools.html). [34] Presented architecture for contextbased information integration to solve semantic difference problem, defined some novel modeling primitives of translation ontology and propose an algorithm for translation. *Bayes Networks (BN)*: Probabilistic graphical models that represent a set of variables and their probabilistic independencies. For example, a BN could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Bayes networks focus on score-based structure inference. Available heuristic search strategies include simulated annealing and greedy hill-climbing, paired with evaluation of a single random local move or all local moves at each step. [45] Bases his approach on the wellstudied statistical tool of Bayesian network*s* [46]. These networks represent the dependence structure between multiple interacting quantities (e.g., expression levels of different genes). His approach, probabilistic in nature, is capable of handling noise and

estimating the confidence in the different features of the network.

 *Hidden Markov Models (HMM)*: HMM is a statistical model that assumes the system being modeled to be a Markov process with unknown parameters, and determines the

for SNs are;

2. indexing flat files.

declared.

1. memory-mapped data structure and

and linked thus supporting fast query performance.

(a)

**Figure 3.** (a) Pathway from KEGG- Glycolysis (b) BioCyc- Glycolysis

Biological information can be retrieved effectively through simple relationship traversal starting from a query agent in the semantic network. Two approaches primarily in practice for SNs are;


18 Bioinformatics

**Figure 3.** (a) Pathway from KEGG- Glycolysis (b) BioCyc- Glycolysis

(a)

(b)

In the memory-mapped data structure approach, subsets of data from various sources are collected, normalized, and integrated in memory for quick access. While this approach performs actual data integration and addresses the problem of poor performance in the federated approach, it requires additional calls to traditional relational databases to integrate descriptive data. While data cleaning is being performed on some of the data sources, it is not being done across all sources or in the same place. This makes it difficult to quickly add new data sources. In the indexing flat files approach, flat text files are indexed and linked thus supporting fast query performance.


hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network. HMMs are being applied to the analysis of biological sequences, in particular DNA since 1998 [47].

Hierarchical Biological Pathway Data Integration and Mining 21

Specific data integration problems in the biological field include:

Derived wrapper (operate in two modes)

Miscellaneous: Network environment, Security

approach to data integration is proposed in [49].

Traditional wrapper

 Data Schema Inconsistencies Schema matching: error-prone task Mapping info: systematically managed

Domain Expert participation

as:

formats

Some biological data sources do not provide an expressive language

 Virtual source that buffers the execution result of a local application Data Model Inconsistencies requires complex data transformation coding

Along with the data schema consistencies there may be data level inconsistencies such

Data conflict as each object has its own data type, and may be represented in different

**File formats**: For biological pathways, various data sources incorporate information about an entity/reaction/pathway to a level of detail and define their own data format. This includes information like number of fields, column label/tag, pathway name(s), etc. At the outset, common information across the tables may look limited and hard to extract mainly because of the tag or synonyms (other names) of pathway. One of the other important differences in the way these data sources are developed lies in the synonym representations. Some of the data sources limit the synonyms to 10 others may not result into may be over 40 synonyms. While we look at the data integration mechanism, if the names of the compounds do not match, then the search should be carried forward with the list of synonyms. In integrating different data bases this will take different search time. Also, since the field names (compound names) did

Different Query Capabilities affect the query optimization of data integration system

not match, the search must unify the field names and generate a new list of synonyms.

**Granularity of information:** Different pathway databases may model pathway data with different levels of details. This primarily depends on the process definition. For example, one database might treat processes together as a single process, while another database might treat these as separate processes. Also, one database might include specific steps to be part of the process, while another database might not consider these steps. Additionally, the levels of details associated with a certain data base necessitate pathway data modeling with different levels of granularity. Different pathway data formats (e.g., SBML and BIND XML) have been used to represent data with different levels of details. A semantic net based

**Heterogeneous formats**: As the eXtensible Markup Language (XML) has become the lingua franca for representing different types of biological data, there has been a proliferation of semantically-overlapping XML formats that are used to represent diverse types of pathway data. Examples include the XML-derivatives KGML, SBML, CellML, PSI MI, BIND XML, and Genome Object Net XML. Efforts have been underway to translate between these
