**2. Data management and integration**

6 Bioinformatics

interesting field.

Formats can be modified anytime.

each format can be different.

integration along with the integrated output file.

Some conventional bioinformatics approaches identify hypothetical interactions between proteins based on their three dimensional structures or by applying text mining techniques. Emerging protein chip technologies are expected to permit the large scale measurement of protein expression levels. Corresponding structural data are stored in data source such as protein data bank and represent invaluable sources of understanding of protein structures, functions and interactions. Successful use of high throughput protein interaction determination techniques such as yeast two hybrids, affinity purification followed by mass spectrometry and phage display has shifted research focus from a single gene/protein to more coherent network perspectives. Large scale protein-protein interaction data and their complexes are currently available for a number of organisms and data are stored in several interaction data sources such as BIND [6], DIP [20], IntAct [21], GRID [22] and MINT [23] that is all equipped with basic bioinformatics tools for protein network analysis and visualization. INCLUSive is a web portal and service registry for microarray and regulatory sequence analysis [24]. This provides a comprehensive index for all data integration research projects.

The integration and management technique of heterogeneous sequence data from public sequence data source is widely used to manage diverse information and prediction. It is important for the biologists to investigate these heterogeneous sources and connect the public biological data source and retrieve sequences which are similar to sequences they have, and the results of their retrieval are used in homology research, functional analysis, and predication. However, there are few software packages available to deal with the sequence data in most biological laboratories and they are stored in file formats. File formats is another important issue for biological pathway data sources. XMl, SBML (systems biology markup language), KBML (KEGG), BSML (Bioinformatic Sequence Markup Language) based on XML, and a variety of versions of XML are used for representing the complex and hierarchical biological data. Each flat file from public biological database has different format. Recent tools which convert formats among standards are implemented in JAVA or Perl module. The constraints associated with biological pathway formats are the following; Conversion among different formats needs different parsers to extract the user

Understand the range of field, its value is difficult, and data types in the same field in

From the discussions above, one of the major challenges of the modern bioinformatics research is therefore to store, process, and integrate biological data to understand the inner working of the cell defined by complex interaction networks. Additionally, the integration mechanisms may not register the important details like, copies of inputs files and time of

In this chapter, issues related to biological pathway data integration system are discussed and a user friendly data integration algorithm across data sources for biological pathway, particularly, metabolic pathway as a case is presented. i.e. the data integration (BPDI) algorithm that integrates pathway information across data sources and also extracts the The Pathway Resource List contains over 150 biological pathway databases and is growing [26]. Usually, first step for the user is to identify a subset of these data sources for integration. To consolidate all the knowledge for a particular organism, extract the pathways from each database need to be extracted and transformed into a standard data representation before integration. Representation of the pathway data in each data source poses another challenge as each pathway modality has its own specific representation issues which must be understood before attempting integration across modalities. For example, metabolic pathways, signal transduction pathways, protein-protein interaction, gene regulation etc.

Commonly employed styles of data integration may be implemented in different contexts and under requirements, in order to reuse the data across applications for research collaboration. Some of the data integration and management efforts are presented in [27-32]. Several major approaches have been proposed for data integration, which can be roughly classified into five groups [33-34] namely; data warehousing, federated databasing, serviceoriented integration, semantic integration and wiki-based integration. Across all of these groups, to a significant extent, an increasingly important component of data integration is the community effort in developing a variety of biomedical ontologies to deal in a more specific manner with the technicality and globality of descriptors and identifiers of information that has to be shared and integrated across various resources. Variety of approaches for data integration is discussed below.

#### **Data Warehousing**

The data warehouse approach offers a "one-stop shop" solution to ease access and management of a large variety of biological data from different data sources. The user does not need to access many web sites for multiple data sources. Despite its advantages, the data warehouse approach has a major problem; it requires continuous and often human-guided updates to keep the data comprehensive of the evolution of data sources, resulting in high costs for maintenance. Many biological data sources change their data structures roughly twice a year.

#### **Data integration with Federated Approach**

Unlike data warehousing (with its focus on data translation), federated databasing focuses on query translation. The federated database fetches the data from the disparate data

sources and then displays the fetched data for its user base. Queries in federated databases are executed within remote data sources and results displayed in federated databases are extracted remotely from the data sources. Due to this capability, federated databasing has two major advantages.

Hierarchical Biological Pathway Data Integration and Mining 9

infeasible to integrate such large amounts of data into a single point (such as a data warehouse). Data sources are developed for different purposes and fulfill different functions. Therefore, it is promising to establish an efficient way for data exchange among these distributed and heterogeneous data sources. However, a dozen of data sources are

Table 1 below shows various data integration efforts and projects for biological pathways

BRITE Bio molecular Relations in Information Transmission and Expression

Enzyme database and link to biochemical pathway map

Description of several metabolic and biochemical pathways

EcoCyc/MetaCyc Encyclopaedia of E. coli genes and metabolism; Metabolic

Kohn molecular interaction maps

aMAZE Protein function and biochemical pathways project at EBI

UM-BBD Microbial bio catalytic reactions and biodegradation pathways primarily for xenobiotic, chemical compounds WIT Function assignments to genes and the development of metabolic

BBID Database of images of biological pathways, macromolecular

BIND The bio molecular interaction network database

structures, gene families, and cellular relationships

designed merely for data storage, but not for data exchange.

Biochemical pathways Description

EMP Metabolic pathways

encyclopedia

Interactive Fly Biochemical pathways in Drosophila Metabolic Pathway Metabolic pathways of biochemistry

Malaria parasite Malaria Parasite metabolic pathways

PathDB Metabolic pathway information

models

Apoptosis Pathways of apoptosis at KEGG

BioCarta Several signalling pathways

KEGG Kyoto encyclopaedia of genes and genomes

worldwide.

Biochemical Pathways

Molecular interaction

THCME Medical Biochemistry

**Signaling pathways**

**2.1. Survey of Pathway Databases and Integration Efforts** 


#### **Service –Oriented Approach**

A decentralized approach is also being developed, in which individual data sources agree to open their data via Web Services (WS). The service-oriented approach enables data integration from multiple heterogeneous data sources through computer interoperability. The service-oriented approach features data integration through computer-to-computer communication via Web API and up-to-date data retrieval from diverse data sources. Heterogeneous data integration requires that many data sources should become service providers by opening their data via WS and by standardizing data identities and nomenclature to ease data exchange and analysis.

#### **Semantic Web**

Most web pages in biological data sources are designed for human reading. RDF provides standard formats for data interchange and describes data as a simple statement, containing a set of triples: a subject, a predicate, and an object. Any two statements can be linked by an identical subject or object. OWL builds on RDF and Uniform Resource Identifier (URI) and describes data structure and meaning based on ontology, which enables automated data reasoning and inferences by computers. Application of semantic Web technologies is a significant advancement for bioinformatics, enabling automated data processing and reasoning. The semantic integration uses ontologies for data description and thus represents ontology-based integration. [27] reviews the current development of semantic network technologies and their applications to the integration of genomic and proteomic data. His work elaborates on applying a semantic network approach to modeling complex cell signaling pathways and simulating the cause-effect of molecular interactions in human macrophages. [31] Illustrates his approach by comparing federated approach versus warehousing versus semantic web using multiple sources.

#### **Wiki-based Integration**

A weakness common to all the above approaches is that the quantity of users' participations in the process is inadequate. With the increasing volume of biological data, data integration inevitably will require a large number of users' participations. A successful example that harnesses collective intelligence for data aggregation and knowledge collection is Wikipedia: an online encyclopedia that allows any user to create and edit content. It is infeasible to integrate such large amounts of data into a single point (such as a data warehouse). Data sources are developed for different purposes and fulfill different functions. Therefore, it is promising to establish an efficient way for data exchange among these distributed and heterogeneous data sources. However, a dozen of data sources are designed merely for data storage, but not for data exchange.
