**4. State of the art of approaches to integration**

A data integration system remedies to the problems associated with the expansion of public data sources by giving the possibility to have a unified view of them. Such a system is the interface between user and data sources simplifying requests to perform (a request to query all sources covered by the system). The user is not obliged to know where the data are and how they are structured.

#### **4.1. Current integration approaches**

There are two major approaches for integration of information: (1) the data warehouse (DW) or materialized approach and (2) virtual approach (mediator based).In DW approach, huge

amount of historic data is stored in the DW. In the virtual approach, on the other hand, the data is not materialized, but rather is globally manipulated using views. Each of these approaches is suitable in some kinds of applications.

#### *4.1.1. Data warehouse*

DW is a powerful tool for decision support and querying the data because it explicitly stores information from heterogeneous sources locally. However, some external data, such as new product announcements from opponents and currency exchange rates, may be needed to support the accuracy of the business decisions. We should not neglect the importance of such data to avoid the problems of incompleteness, inexact, or sometimes wrong results. Warehousing huge and frequently changed information is a big challenge for the following reasons.

Firstly, since the data in the DW is loaded in snapshots and the DW is a huge information repository. Secondly, as the data sources change frequently, the maintenance becomes a complicated and costly issue

Here are two examples of using data warehouses:


**Figure 1.** Simple schematic for a data warehouse

#### *4.1.2. The virtual approach (mediator based)*

154 Lipoproteins – Role in Health and Diseases

*4.1.1. Data warehouse* 

complicated and costly issue

on molecular biology;

reasons.

approaches is suitable in some kinds of applications.

Here are two examples of using data warehouses:

of the transcriptome of human liver.

**Figure 1.** Simple schematic for a data warehouse

amount of historic data is stored in the DW. In the virtual approach, on the other hand, the data is not materialized, but rather is globally manipulated using views. Each of these

DW is a powerful tool for decision support and querying the data because it explicitly stores information from heterogeneous sources locally. However, some external data, such as new product announcements from opponents and currency exchange rates, may be needed to support the accuracy of the business decisions. We should not neglect the importance of such data to avoid the problems of incompleteness, inexact, or sometimes wrong results. Warehousing huge and frequently changed information is a big challenge for the following

Firstly, since the data in the DW is loaded in snapshots and the DW is a huge information repository. Secondly, as the data sources change frequently, the maintenance becomes a

Genomics Unified Schema, GUS [4] is a system for creating a data warehouse focused

Gene Expression Data Warehouse, GEDAW [5] is a warehouse dedicated to the analysis

In this approach, the actual data resides in the sources, and queries against the integrated 'virtual' view will be decomposed into sub queries and posed to the sources. This approach is preferred over the materialized approach DW when the information sources change very often. On the other hand, the DW approach may be desired when a quick query answer is required and the information sources change rarely.

The most important step in the construction of a mediator is the creation of the global schema. The mapping consists on the relations between the global schema and local sources. Specification of this mapping, depending on the method, determines the difficulty of query reformulation and the facility of adding or removing sources within the system. Two methods are commonly used to determine the global schema


**Figure 2.** GAV vs. LAV

**Figure 3.** Architecture of a mediator

In fact, these two approaches are not opposite, but complementary; depending on the problem to be solved. To integrate a few sources, most of which are stable, better to use the GAV method. By cons, as part of a large-scale integration, the LAV method is preferable as a material change at a local source with little or no impact on the global schema.

Two examples of systems integration based mediator:

 *Tambis (Transparent Access To Multiple Bioinformatics Information Sources) [6]* is an integration system coupled to an ontology that allows for better interoperability between sources;

*K2/BioKleisli [7]* is a system based on CPL (Collection Programming Language) is a query language for high-level querying multiple sources.

#### *4.1.3. The multi agents approach*

This approach was used in GID-IGC *(Integrated Genomic Database - Genome Information System) project.* The proposed architecture uses a network of agents communicating each with other via CORBA and KQML. All have a specific function, such as *EIA* (External Interface Agent) that manages the user interface, or *SCA* (Dial Selector

Agent) witch decompose the global query into sub-queries for local data sources. This approach is very modular and easily extensible.

#### *4.1.4. Navigating between sources*

156 Lipoproteins – Role in Health and Diseases

**Figure 3.** Architecture of a mediator

between sources;

*4.1.3. The multi agents approach* 

In fact, these two approaches are not opposite, but complementary; depending on the problem to be solved. To integrate a few sources, most of which are stable, better to use the GAV method. By cons, as part of a large-scale integration, the LAV method is preferable as a

 *Tambis (Transparent Access To Multiple Bioinformatics Information Sources) [6]* is an integration system coupled to an ontology that allows for better interoperability

*K2/BioKleisli [7]* is a system based on CPL (Collection Programming Language) is a query

This approach was used in GID-IGC *(Integrated Genomic Database - Genome Information System) project.* The proposed architecture uses a network of agents communicating each with other via CORBA and KQML. All have a specific function, such as *EIA* (External

Agent) witch decompose the global query into sub-queries for local data sources. This

material change at a local source with little or no impact on the global schema.

Interface Agent) that manages the user interface, or *SCA* (Dial Selector

Two examples of systems integration based mediator:

language for high-level querying multiple sources.

approach is very modular and easily extensible.

This approach is based on what users usually do when searching for information on the web, which involves a search page to page by clicking the mouse. In practice, queries generated for this type of tool are converted into path expressions. The data banks are then integrated based on their cross-references. These expressions can answer the query of the user according to different levels of satisfaction.

A reference is a link between two data sources (Figure 4), a bridge between the information relating on the same object or the same concept. It can be done through an identifier of an external source or a URL (Unified Resource Locator). If the link can be browsed in both directions it is a cross-reference ("cross-reference").

**Figure 4.** Navigating between sources

### **4.2. Adopted approach**

In this work, we are interested in mediation systems. Such systems offer a uniform and centralized view of distributed data. This view may also reflect a more abstract, condensed, qualitative data and therefore more meaningful to the user. These mediation systems are also very useful in the presence of heterogeneous data, because they seem to use a homogeneous system.

In this architecture, each component provides a set of features, which, together will help to satisfy the user request at the end.

 The mediator is a software module that directly receives the user's request. It has to locate the necessary information to answer the query, resolve schematic and semantic conflicts, query different sources and integrate the partial results in a consistent and coherent response. This is the most complex component but only one instance of it is necessary (unlike multiple adapters). It provides access to multiple data sources as if it was a single one and offers this consultation through multiple languages and ontologies.

**Figure 5.** Adopted approach

This is a crucial component that allows a local system to distribute its information to a community of users.

	- Location: Reference, communication protocol, access technique (JDBC, ODBC API), support (DBMS, web pages)
	- Type of data it manages: structured (relational, object), semi-structured (XML, OEM), unstructured (image, multimedia)
	- Ability to query: SQL, OQL, search
	- Results Format: XML, HTML, relationships, texts
