**3.3. Handling the heterogeneity in data representation among databases**

For biological plant pathways, various databases incorporate information about an entity/reaction/pathway to a level of detail and define their own data format. This includes information like number of fields, column label/tag, pathway name(s), etc. At the outset, common information across the tables may look limited and hard to extract mainly because of the tag or synonyms (other names) of pathway. Before proceeding for integration of a pathway across data sources following steps need to be carried out. For biological pathway integration, following needs to be considered.

	- To query autonomous and heterogeneous data sources through a common, uniform schema (TARGET SCHEMA).
	- Resolving various conflicts between source and target schema.
	- Offering a common interface to access integrated information.
	- Preserving the autonomy of participating systems.
	- Easily integrating data sources without major modification.

Specific data integration problems in the biological field include:

	- Traditional wrapper

20 Bioinformatics

since 1998 [47].

**integration** 

parameter [48].

integration, following needs to be considered.

What is the aim of integration?

schema (TARGET SCHEMA). How will the integrated data be used?

hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network. HMMs are being applied to the analysis of biological sequences, in particular DNA

**3.2. Need to use open grid service architecture ogsa-dai for data access and** 

**3.3. Handling the heterogeneity in data representation among databases** 

 Resolving various conflicts between source and target schema. Offering a common interface to access integrated information.

Easily integrating data sources without major modification.

Preserving the autonomy of participating systems.

Is it within a single data source or across sources?

Does it encompass the dynamic nature of the data?

Does it support web based integration?

integration system?

For biological plant pathways, various databases incorporate information about an entity/reaction/pathway to a level of detail and define their own data format. This includes information like number of fields, column label/tag, pathway name(s), etc. At the outset, common information across the tables may look limited and hard to extract mainly because of the tag or synonyms (other names) of pathway. Before proceeding for integration of a pathway across data sources following steps need to be carried out. For biological pathway

To query autonomous and heterogeneous data sources through a common, uniform

What are the data, source, user models, and assumptions underlying the design of

Apart from the ubiquitous call for more functionality, bioinformatics projects with commercial users/partners are very anxious about the security of their data. The issue is further complicated by the lack of coherent security models with the evolving WS-RF and WS-I specifications which OGSA-DAI now supports. This issue needs to be resolved if bioinformatics projects with commercial users/partners are not to be deterred from adopting the product despite its utility. In contrast to the diversity of its data resources, a limited range of operations on these resources is typically required. For instance, one operation is to create a study data set by aggregating data from iterative searches of remote data collections using the same taxonomy object (representing a species or other group) as the search


**File formats**: For biological pathways, various data sources incorporate information about an entity/reaction/pathway to a level of detail and define their own data format. This includes information like number of fields, column label/tag, pathway name(s), etc. At the outset, common information across the tables may look limited and hard to extract mainly because of the tag or synonyms (other names) of pathway. One of the other important differences in the way these data sources are developed lies in the synonym representations. Some of the data sources limit the synonyms to 10 others may not result into may be over 40 synonyms. While we look at the data integration mechanism, if the names of the compounds do not match, then the search should be carried forward with the list of synonyms. In integrating different data bases this will take different search time. Also, since the field names (compound names) did not match, the search must unify the field names and generate a new list of synonyms.

**Granularity of information:** Different pathway databases may model pathway data with different levels of details. This primarily depends on the process definition. For example, one database might treat processes together as a single process, while another database might treat these as separate processes. Also, one database might include specific steps to be part of the process, while another database might not consider these steps. Additionally, the levels of details associated with a certain data base necessitate pathway data modeling with different levels of granularity. Different pathway data formats (e.g., SBML and BIND XML) have been used to represent data with different levels of details. A semantic net based approach to data integration is proposed in [49].

**Heterogeneous formats**: As the eXtensible Markup Language (XML) has become the lingua franca for representing different types of biological data, there has been a proliferation of semantically-overlapping XML formats that are used to represent diverse types of pathway data. Examples include the XML-derivatives KGML, SBML, CellML, PSI MI, BIND XML, and Genome Object Net XML. Efforts have been underway to translate between these

formats (e.g., between PSI MI and BIND XML, and between Genome Object Net and SBML). However, the complexity of such a pair-wise translation approach increases dramatically with a growing number of different pathway data formats. To address this issue, a standard pathway data exchange format is needed. While the Resource Description framework (RDF) is an important first step towards the unification of XML formats in describing metadata (ontologies), it is not expressive enough to support formal knowledge representation [50]. To address this problem, more sophisticated XML-based ontological languages such as the Web Ontology Language (OWL) have been developed. An OWL-based pathway exchange standard, called BioPAX, has been released to the research community [51].

Hierarchical Biological Pathway Data Integration and Mining 23




The notations used in our algorithm are presented next.

Where, *Dijk= {dij1, dij2, dij3, … dijk}* is a set of 'k' data sources for *(Si*

For example*; s1: E.coli; p1j:* TCA Cycle*; d1j1=* BioCyc*, d1j2=* KEGG.

*(v112p, e112p)* gives *(node, edge)* in KEGG for TCA cycle in *E.coli* 

Each pathway *pij for a dijk* is given by a graph *G (Vijk, Eijk),* where,

• *SynList {pathway name} = SynList {Pij}*








 *S = {s1, s2, s3,…sn}* is set of species. (1) *Pij = {pi1, pi2, … pip}* is a set of pathways within *si* (2)

*s1= {(s1, p1j (D1jk)} = {(s1, p1j,d1j1}) (s1, p1j, d1j2}),…( s1,p1j,d1jk)}* for '*k*' databases,

Then, the tuple *(v111n, e111m)* gives *(node, edge)* in Biocyc for TCA cycle in *E.coli*, and the tuple

Then, the tuple *(v221p, e221p)* gives the *(node, edge)* in AraCyc for TCA cycle in *Arabidopsis*, and the tuple *(v222p, e222p)* gives the *(node, edge)* in KEGG for TCA cycle in *Arabidopsis.* 

*Pijk = G (Vijk, Eijk)* represents Pathway *'j'* from *kth* datasourcesS for species *i'*… .(5)

Where, *Vijk = {v ijk1, v ijk2,…v ijkn*} = set of nodes in *dijk*,. (6)

*E ijk = {e ijk1, e ijk2,….e ijkm*}= set of edges in *dijk*,. (7)

• *s2 = {(s2, p2j (D2jk)} = {(s2, p2j, d2j1), (s2, p2j, d2j2),……, (s2, p2j, d2jr)} for 'r'* databases*,* 

For example, *s2: Arabidopsis; p2j:* TCA Cycle*; d2j1=* BioCyc*, d2j2=* AraCyc

*, (Pij, (Diji))* (3)

*, Pij)* (4)

Step 2

Step 3

**4.1. Notations** 

Consider a tuple *(Si*

tag)

the input data files,

match them.

table.
