**1.1. Overview**

There has been a dramatic increase in the number of large scale comprehensive biological databases that provide useful resources to the community like; Biochemical Pathways (KEGG, AraCyc, and MapMan), Protein Interactions (biomolecular interaction network database), or systems like; Dragon Plant Biology Explorer and Pathway Miner for integrating associations in metabolic networks and ontologies [3-8]. Other databases such as Regulon DB, PlantCARE, PLACE, EDP:Eurokaryotic promoter database, Transcription Regulatory Regions Database, Athamap, and TRANSFAC store information related to transcriptional regulation[9-15].

The aim of molecular biology is to understand the regulation of protein synthesis and its reactions to external and internal signals. All the cells in an organism carry the same genomic data, yet their protein makeup can be drastically different; both temporally and spatially, due to regulation. Protein synthesis is regulated by many mechanisms at its different stages. These include mechanisms for controlling transcription initiation, RNA splicing, mRNA transport, translation initiation, post-translational modifications, and degradation of mRNA/protein. One of the main junctions at which regulation occurs is mRNA transcription. A major role in this machinery is played by proteins themselves that bind to regulatory regions along the DNA, greatly affecting the transcription of the genes they regulate [16]. Friedman introduces a new approach for analyzing gene expression patterns that uncovers properties of the transcriptional program by examining statistical properties of dependence and conditional independence in the data.

4 Bioinformatics

[2].

mean.

**1.1. Overview** 

laboratories. These intellectual gaps can be bridged by adopting new technology, mergers,

For the open source biological databases, it is common for the biologists and researchers to refer to many databases in order to pursue inference or analysis; though it is one of the most challenging tasks. Biological pathway data integration is aimed to work with repositories of data from a variety of sources. As such, two or more databases may not provide identical information for a given pathway, but integrating these two databases may yield a richer resource for analysis. Additionally, the conditions under which data is collected, either by experimentation or by collecting evidence of the published material, in either case the supporting references play a crucial role and is of interest to the biologists in making the analysis more meaningful. At present there are over 200 biological pathway databases. However, very few of them are independently created. Some of these databases may be derived from different data sources. Unfortunately, the documentation often does not reveal details of the data collection, sources, and dates. Further, the research groups involved in analysis of the data usually selectively use data from a single data source. For example, for yeast studies, the Saccharomyces Genome Database (SGD) is the reference for most analyses

In case of biological pathway data, rapid accumulation of genomic and proteomic data have

The lack of communication between different bioinformatics data resources; whether

Biological data are hierarchical and highly related yet are conventionally stored

Additionally, they are governed more by how data is obtained rather than by what they

Most commercially available bioinformatics systems perform functional analysis using a single data source; an approach that emphasizes pathway mapping and relationship inference based on the data acquired from multiple data sources. Each pathway modality in the data has its own specific representation issues which must be understood before

There has been a dramatic increase in the number of large scale comprehensive biological databases that provide useful resources to the community like; Biochemical Pathways (KEGG, AraCyc, and MapMan), Protein Interactions (biomolecular interaction network database), or systems like; Dragon Plant Biology Explorer and Pathway Miner for integrating associations in metabolic networks and ontologies [3-8]. Other databases such as Regulon DB, PlantCARE, PLACE, EDP:Eurokaryotic promoter database, Transcription Regulatory Regions Database, Athamap, and TRANSFAC store information related to

acquisitions, and geographic coordination of collaborating groups [1].

made two major bioinformatics problems apparent.

attempting to integrate across modalities.

transcriptional regulation[9-15].

they are databases or individual analysis programs.

separately in individual database and in different formats.

For protein interactions, it is intended to connect related proteins and link biological functions in the context of larger cellular processes [17]. The content of these data sources typically complements the experimentally determined protein interactions with the ones that are predicted from gene proximity, fusion, co-expressed data, as well as those determined by using phylogenetic profiling. Each pathway modality in the data has its own specific representation issues which must be understood before integration across modalities is attempted. At present, the bioinformatics database owner only develops private system to provide user with data query and analysis services; such as NCBI develops Entrez database query system which is used on GenBank. European Molecular Biology Laboratory (EMBL) develops Sequence Retrieval Systems. The EMBL Nucleotide Sequence Database maintained at the European Bioinformatics Institute (EBI), incorporates, organizes, and distributes nucleotide sequences from public sources [18]. The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA). Data are exchanged between the collaborating databases on a daily basis to achieve optimal synchrony. The key point is how to share the heterogeneous databases and make a common query platform for users [19].

Friedman [16] describes early microarray experiments that examined few samples and mainly focused on differential display across tissues or conditions of interest. Such experiments collect enormous amounts of data, which clearly reflects many aspects of the underlying biological processes. An important challenge is to develop methodologies that are both statistically sound and computationally tractable for analyzing such data sets and inferring biological interactions from them. Most of the analysis tools currently used are based on clustering algorithms. The clustering algorithms attempt to locate groups of genes that have similar expression patterns over a set of experiments. Such analysis has proven to be useful in discovering genes that are co-regulated and/or have similar function. A more ambitious goal for analysis is to reveal the structure of the transcriptional regulation process. This is clearly a hard problem. Not only the current data is extremely noisy, but, mRNA expression data alone only gives a partial picture that does not reflect key events such as; translation and protein (in) activation. Finally, the amount of samples, even in the largest experiments in the foreseeable future, does not provide enough information to construct a fully detailed model with high statistical significance.

Some conventional bioinformatics approaches identify hypothetical interactions between proteins based on their three dimensional structures or by applying text mining techniques. Emerging protein chip technologies are expected to permit the large scale measurement of protein expression levels. Corresponding structural data are stored in data source such as protein data bank and represent invaluable sources of understanding of protein structures, functions and interactions. Successful use of high throughput protein interaction determination techniques such as yeast two hybrids, affinity purification followed by mass spectrometry and phage display has shifted research focus from a single gene/protein to more coherent network perspectives. Large scale protein-protein interaction data and their complexes are currently available for a number of organisms and data are stored in several interaction data sources such as BIND [6], DIP [20], IntAct [21], GRID [22] and MINT [23] that is all equipped with basic bioinformatics tools for protein network analysis and visualization. INCLUSive is a web portal and service registry for microarray and regulatory sequence analysis [24]. This provides a comprehensive index for all data integration research projects.

Hierarchical Biological Pathway Data Integration and Mining 7

abstract information embedded within them are addressed. Today, a bioinformatics information system typically deals with large data sets reaching a total volume of about one

It infers relationships based on the stored data and subsequently predicts missing

The Pathway Resource List contains over 150 biological pathway databases and is growing [26]. Usually, first step for the user is to identify a subset of these data sources for integration. To consolidate all the knowledge for a particular organism, extract the pathways from each database need to be extracted and transformed into a standard data representation before integration. Representation of the pathway data in each data source poses another challenge as each pathway modality has its own specific representation issues which must be understood before attempting integration across modalities. For example, metabolic pathways, signal

Commonly employed styles of data integration may be implemented in different contexts and under requirements, in order to reuse the data across applications for research collaboration. Some of the data integration and management efforts are presented in [27-32]. Several major approaches have been proposed for data integration, which can be roughly classified into five groups [33-34] namely; data warehousing, federated databasing, serviceoriented integration, semantic integration and wiki-based integration. Across all of these groups, to a significant extent, an increasingly important component of data integration is the community effort in developing a variety of biomedical ontologies to deal in a more specific manner with the technicality and globality of descriptors and identifiers of information that has to be shared and integrated across various resources. Variety of

The data warehouse approach offers a "one-stop shop" solution to ease access and management of a large variety of biological data from different data sources. The user does not need to access many web sites for multiple data sources. Despite its advantages, the data warehouse approach has a major problem; it requires continuous and often human-guided updates to keep the data comprehensive of the evolution of data sources, resulting in high costs for maintenance. Many biological data sources change their data structures roughly

Unlike data warehousing (with its focus on data translation), federated databasing focuses on query translation. The federated database fetches the data from the disparate data

User can select the data sources and assign confidence to each selected data source

attribute values and incoming information based on multidimensional data. Data marts (extension of data warehouse) support different query requests.

terabyte [25]. Such a system serves many purposes;

**2. Data management and integration** 

approaches for data integration is discussed below.

**Data integration with Federated Approach** 

**Data Warehousing** 

twice a year.

It organizes existing data to facilitate complex queries

transduction pathways, protein-protein interaction, gene regulation etc.

The integration and management technique of heterogeneous sequence data from public sequence data source is widely used to manage diverse information and prediction. It is important for the biologists to investigate these heterogeneous sources and connect the public biological data source and retrieve sequences which are similar to sequences they have, and the results of their retrieval are used in homology research, functional analysis, and predication. However, there are few software packages available to deal with the sequence data in most biological laboratories and they are stored in file formats. File formats is another important issue for biological pathway data sources. XMl, SBML (systems biology markup language), KBML (KEGG), BSML (Bioinformatic Sequence Markup Language) based on XML, and a variety of versions of XML are used for representing the complex and hierarchical biological data. Each flat file from public biological database has different format. Recent tools which convert formats among standards are implemented in JAVA or Perl module. The constraints associated with biological pathway formats are the following;


From the discussions above, one of the major challenges of the modern bioinformatics research is therefore to store, process, and integrate biological data to understand the inner working of the cell defined by complex interaction networks. Additionally, the integration mechanisms may not register the important details like, copies of inputs files and time of integration along with the integrated output file.

In this chapter, issues related to biological pathway data integration system are discussed and a user friendly data integration algorithm across data sources for biological pathway, particularly, metabolic pathway as a case is presented. i.e. the data integration (BPDI) algorithm that integrates pathway information across data sources and also extracts the abstract information embedded within them are addressed. Today, a bioinformatics information system typically deals with large data sets reaching a total volume of about one terabyte [25]. Such a system serves many purposes;

