**5. Studied scenario**

After exploring the different integration systems that solve the problem of heterogeneous biological data sources, this section describes the scenario we have chosen to work on.

#### **5.1. Biologist's need**

The objective of our work is to develop an integration system for biological data with an application on familial hypercholesterolemia. Such system should facilitate access to multiple data sources available on the Web, in a transparent and uniform way by giving biologists a single virtual source that summarizes the sites of interest to the application.

In order to satisfy this biologist's need we studied their current way to work. We first focused on existing tools, data sources they use and their functional specifications.

Tools: they use, mainly:

158 Lipoproteins – Role in Health and Diseases

**Figure 5.** Adopted approach

can be described by its:

**5. Studied scenario** 

**5.1. Biologist's need** 

support (DBMS, web pages)


OEM), unstructured (image, multimedia)


community of users.

response. This is the most complex component but only one instance of it is necessary (unlike multiple adapters). It provides access to multiple data sources as if it was a single

This is a crucial component that allows a local system to distribute its information to a

 The adapter allows the presentation of data in the mediation's syntactic format. So it's an interface for querying a database using a standardized language (pivot language). Data source: Represent the sources and banks of biological information. A data source



After exploring the different integration systems that solve the problem of heterogeneous biological data sources, this section describes the scenario we have chosen to work on.

The objective of our work is to develop an integration system for biological data with an application on familial hypercholesterolemia. Such system should facilitate access to

one and offers this consultation through multiple languages and ontologies.


Data sources: The focus was mainly on the following sources:


#### **5.2. Adopted scenario**

The adopted scenario consists on building a local database "homemade" that would include unorganized data already available in the biologist's laboratory and for our particular case data related to LDL receptor mutations. This goes along with the research we're going to do on the Web using the mediation system:

**Figure 6.** The adopted scenario

The integration platform is the mediation system. It will query data sources related to cardiovascular diseases especially familial hypercholesterolemia namely: PDB and PubMed. The results of the query will be processed by the tool CHARMM before being presented to the user [12, 13].

#### **5.3. Selected data sources**

**PubMed [8]**: Is the leading bibliographic data search engine of all fields of biology and especially medicine. It was developed by the National Center for Biotechnology Information (NCBI), and is hosted by the National Library of Medicine U.S. National Institutes of Health. PubMed is a free search engine giving access to the MEDLINE bibliographic database, gathering citations and abstracts of biomedical research.

The MEDLINE database in April 2007 had more than 15 million citations from 1950 published in 5000 biomedical journals (journals in biology and medicine) distinct. It is the database of reference for biomedical sciences. As with other indexes, including a citation in PubMed has no content. In addition to MEDLINE, PubMed also provides access to:


Most citations include a link to the full article when it is available (eg PubMed Central). PubMed is a search engine that allows users to search in the MEDLINE database; this information is also available from private organizations such as Ovid and Silverplatter, among others. PubMed is free since the mid-1990s. For optimal use of PubMed, it is necessary to have an understanding of his core, MEDLINE, and especially the MeSH vocabulary used for indexing articles in MEDLINE.

We can also find in PubMed information about the log, which can search by title, subject, short title, NLM ID, ISO abbreviation, and ISSN (International Standard Serial Number) written and electronic. The database "newspaper" includes all newspapers Enter Base.

The major interest of these bibliographic databases is that:


**PDB (Protein Data Bank)**: The databank on proteins of Research Collaborator for Structural Bioinformatics, more commonly known as Protein Data Bank or PDB is a worldwide collection of data on the three-dimensional (or 3D structure) of biological macromolecules : protein essentially, and nucleic.

Founded in 1971 by Brookhaven National Laboratory, the Protein Data Bank was transferred in 1998 to the Research Collaborator for Structural Bioinformatics (RCSB), which consists of Rutgers University, the University of Wisconsin at Madison, National Institute of Standards and Technology (NIST) and the "San Diego Supercomputer Centre." The PDB originally contained (in 1971) 7 structures. The number of structures deposited has grown since the 1980s. Indeed, at that time, the crystallographic techniques have improved, the structures determined by NMR have been added, and the scientific community has changed its view on data sharing.

The PDB containedon 28-04-2008, 50480 structures. The data are from the original pdb format, and in recent years are also mmCif format, specifically developed for structural data from the PDB. From 2000 to 3000 structures are added each year. The bank contains files for each molecular model. These files describe the exact location of each atom of the macromolecule studied, that is to say, the Cartesian coordinates of the atom in a threedimensional coordinate.

Each model is referenced in the bank by a unique identifier to 4 characters, the first is always a numeric character, the next three being alphanumeric characters. This identifier is called **"pdb code**".

Several formats exist for PDB files:

160 Lipoproteins – Role in Health and Diseases

**5.3. Selected data sources** 

gathering citations and abstracts of biomedical research.

OldMedline for articles before 1966

supplied electronically by the publisher) Articles submitted to PubMed Central for free

vocabulary used for indexing articles in MEDLINE.

The major interest of these bibliographic databases is that:

the user [12, 13].

example).

topic"

full text.

The integration platform is the mediation system. It will query data sources related to cardiovascular diseases especially familial hypercholesterolemia namely: PDB and PubMed. The results of the query will be processed by the tool CHARMM before being presented to

**PubMed [8]**: Is the leading bibliographic data search engine of all fields of biology and especially medicine. It was developed by the National Center for Biotechnology Information (NCBI), and is hosted by the National Library of Medicine U.S. National Institutes of Health. PubMed is a free search engine giving access to the MEDLINE bibliographic database,

The MEDLINE database in April 2007 had more than 15 million citations from 1950 published in 5000 biomedical journals (journals in biology and medicine) distinct. It is the database of reference for biomedical sciences. As with other indexes, including a citation in

 Citations of all articles, even "irrelevant" (that is to say, covering topics such as plate tectonics or astrophysics) from certain MEDLINE journals, primarily those published in major newspapers of general science or biochemical (such as Science and Nature, for

Citations being listed before indexing in MEDLINE or MeSH, or passage or status "off

Older citations selected for MEDLINE journal from which they arise (when they are

Most citations include a link to the full article when it is available (eg PubMed Central). PubMed is a search engine that allows users to search in the MEDLINE database; this information is also available from private organizations such as Ovid and Silverplatter, among others. PubMed is free since the mid-1990s. For optimal use of PubMed, it is necessary to have an understanding of his core, MEDLINE, and especially the MeSH

We can also find in PubMed information about the log, which can search by title, subject, short title, NLM ID, ISO abbreviation, and ISSN (International Standard Serial Number) written and electronic. The database "newspaper" includes all newspapers Enter Base.

They help to establish bibliographies (lists of relevant articles) on a subject or author.

 The bibliographic databases used to find references to documents, select, print or export them to other software. They may also propose to order documents or provide access to

Their bodies are used to identify recent publications in scientific journals.

They are portals to access full text documents available on the Internet.

PubMed has no content. In addition to MEDLINE, PubMed also provides access to:

**The PDB format**: it is the original format. The guide of this format has been revised several times; the current version is version 2.2[14], which has existed since 1996. Originally pdb format was dictated by the width and the use of punch cards for computers. Consequently, each line contains exactly 80 characters.

Pdb file format is a text file where each column has its meaning: Each parameter is positioned so immutable. Thus, the first 6 columns, that is to say the first 6 characters for a given line, determine the scope of the file. Found for example in the fields " TITLE\_ "(That is to say, the title of the macromolecule of interest)," KEYWDS "(The keywords of the entry)," EXPDTA "Which provides information on the experimental method used," SEQRES "(The sequence of the protein under study)," ATOM\_ "Or" HETATM "Fields containing all information related to a particular atom.

Pdb format limitations: Format in 80 columns pdb files is relatively restrictive. The maximum number of atoms in a pdb file is 99999, since there are only 5 columns allocated for the numbers of atoms. Similarly the number of residues per chain is at most 9999: There are only 4 columns allowed for this figure. The number of channels is limited to 62: A single column is available, and possible values are one of the 26 letters of the alphabet in upper or lower case, or one of the digits 0 through 9. As this format has been defined, these limitations did not seem restrictive, but they have been taken several times during the deposition of extremely large structures, such as viruses, ribosome, and multienzyme complexes.

**MmCIF format**:The growing interest in the development of database and electronic publications in the late 1980s has created the need for a more structured, standardized, open-ended and high quality data from the PDB. In 1990, the International Union of Crystallography IUCr extended to macromolecules data representation used to describe crystal structures of molecules of low molecular weight. This representation is called CIF, for Crystallographic Information File. The dictionary mmCIF (macromolecular Crystallographic Information File) published in 1996, was then developed.

In MmCIF format, each field of each section of a pdb file is represented by a description of a characteristic of an object, which includes both the name of the characteristic (eg \_struct.entry\_id), and the content of the description (pdb code: 1cbn). Which we can call "name-value". It is easy to convert, without loss of information, an mmCIF file format pdb, since all information is directly analyzed. It is not possible, however, to completely automate the conversion of a pdb file format mmCIF, since many mmCIF descriptors are either absent from the PDB file, either in this field " REMARK "Who can not always be analyzed. The contents of fields " REMARK " is indeed separated according to different mmCIF dictionary entries, in order to preserve the completeness of the information contained in such Materials and Methods section (crystal characteristics, refinement method ...) or in the description of the biologically active molecule or other molecules (substrate, inhibitor, ...)

The mmCIF dictionary contains over 1700 entries, which are much safer not all used in a single PDB file. All field names are preceded by the character "underscore"(\_), In order to differentiate the values themselves. Each name corresponds to an mmCIF dictionary entry, where the characteristics of the object are exactly defined.

**Pdbml format:** This format is pdbml adaptation to XML data format bps and contains the entries described in the dictionary "PDB Exchange Dictionary". This dictionary contains the same entries as the mmCIF dictionary, in order to take into account all data managed and distributed by the PDB. This format can store much more information on models than pdb format.

**Data retrieval:** The files describing molecular models can be downloaded from the website of the PDB and visualized using various software such as Rasmol [15], Jmol [16], chime [17] or an extension VRML [18] (plugin) a browser. The website of the PDB also contains resources for teaching, on structural genomics and other useful software.

#### **5.4. The global schema**

By studying and exploring the previous sources and by combining data from genome sources, we have identified all data that define the dictionary related to familial hypercholesterolemia disease (Table 3). From this data dictionary and business rules (as defined and established by experts in the field of biology), we extracted the major biological entities useful for our study. These entities are not independent and form a semantic graph with nodes reflecting relationships between these entities.


**Table 3.** Data Dictionaries

162 Lipoproteins – Role in Health and Diseases

**MmCIF format**:The growing interest in the development of database and electronic publications in the late 1980s has created the need for a more structured, standardized, open-ended and high quality data from the PDB. In 1990, the International Union of Crystallography IUCr extended to macromolecules data representation used to describe crystal structures of molecules of low molecular weight. This representation is called CIF, for Crystallographic Information File. The dictionary mmCIF (macromolecular

In MmCIF format, each field of each section of a pdb file is represented by a description of a characteristic of an object, which includes both the name of the characteristic (eg \_struct.entry\_id), and the content of the description (pdb code: 1cbn). Which we can call "name-value". It is easy to convert, without loss of information, an mmCIF file format pdb, since all information is directly analyzed. It is not possible, however, to completely automate the conversion of a pdb file format mmCIF, since many mmCIF descriptors are either absent from the PDB file, either in this field " REMARK "Who can not always be analyzed. The contents of fields " REMARK " is indeed separated according to different mmCIF dictionary entries, in order to preserve the completeness of the information contained in such Materials and Methods section (crystal characteristics, refinement method ...) or in the description of the

The mmCIF dictionary contains over 1700 entries, which are much safer not all used in a single PDB file. All field names are preceded by the character "underscore"(\_), In order to differentiate the values themselves. Each name corresponds to an mmCIF dictionary entry,

**Pdbml format:** This format is pdbml adaptation to XML data format bps and contains the entries described in the dictionary "PDB Exchange Dictionary". This dictionary contains the same entries as the mmCIF dictionary, in order to take into account all data managed and distributed by the PDB. This format can store much more information on models than pdb

**Data retrieval:** The files describing molecular models can be downloaded from the website of the PDB and visualized using various software such as Rasmol [15], Jmol [16], chime [17] or an extension VRML [18] (plugin) a browser. The website of the PDB also contains

By studying and exploring the previous sources and by combining data from genome sources, we have identified all data that define the dictionary related to familial hypercholesterolemia disease (Table 3). From this data dictionary and business rules (as defined and established by experts in the field of biology), we extracted the major biological entities useful for our study. These entities are not independent and form a semantic graph

Crystallographic Information File) published in 1996, was then developed.

biologically active molecule or other molecules (substrate, inhibitor, ...)

resources for teaching, on structural genomics and other useful software.

where the characteristics of the object are exactly defined.

with nodes reflecting relationships between these entities.

format.

**5.4. The global schema** 

From the global schema, it is possible to make our request and submit it to SQL mediator for treatment. For example, the query that givesthe associated protein mutated gene and publications on familial hypercholesterolemia is expressed as follows in SQL:

```
Select nom_proteine, journal_biblio, auteur_biblio, date_biblio, langue_biblio 
From mRNA, gene g, b bibliography 
Where a.nom_gene = g.nom_gene 
And g.nom_gene in (select nom_gene from recepteurs r, mutation m 
 Where r.nom_recepteur = m.nom_recepteur)
```
For its execution, the query is first submitted to the mediator which is responsible for locating sources and queries them through the wrapper or the associated adapter.It should be noted that the only access point to our sources for interrogation is a web form that, once processed through a wrapper, gives us the local sources that we describe below.

#### **5.5. Analysis of the query**

164 Lipoproteins – Role in Health and Diseases

StructureProtéines

varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) <pk>

titre\_structure resume\_structure nom\_molecule nom\_auteur date\_depot date\_release derniere\_release experimental\_methode molecule\_chaine\_type classification compound resolution

**Figure 7.** Global Schema

From mRNA, gene g, b bibliography Where a.nom\_gene = g.nom\_gene

From the global schema, it is possible to make our request and submit it to SQL mediator for treatment. For example, the query that givesthe associated protein mutated gene and

FK\_RECEPTEU\_EST\_GENES

FK\_ARNM\_TRADUIT\_PROTEINE

varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) <pk> <fk1> <fk2>

ARNm

FK\_DISPOSE\_DISPOSE\_PROTEINE

dispose

varchar(100) varchar(100)

Protéines

varchar(100) varchar(100) varchar(100) varchar(100)

<pk,fk1> <pk,fk2>

<pk>

nom\_proteine longeur\_proteine type\_proteine sequence\_proteine

FK\_MUTATION\_CAUSER\_HYPERCHO

varchar(255) varchar(255) varchar(100) varchar(100) <pk> <fk1> <fk2>

Mutations

Hypercholestérolémie code\_maladie varchar(255) <pk>

FK\_DECRITS\_\_DECRITS\_D\_HYPERCHO

décrits-dans

varchar(255) varchar(100)

code\_maladie code\_biblio

Bibliographie

code\_biblio auteur\_biblio date\_biblio volume\_biblio langue\_biblio contribution\_biblio journal\_biblio revue\_scient\_biblio livre\_biblio cd\_proceeding\_biblio

FK\_DECRITS\_\_DECRITS\_D\_BIBLIOGR

<pk,fk1> <pk,fk2>

varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) varchar(100) <pk>

FK\_MUTATION\_CONCERNE\_RECEPTEU

varchar(100) varchar(100) varchar(100) <pk> <fk>

Récepteurs

nom\_mutation code\_maladie nom\_recepteur classe\_mutation

nom\_recepteur nom\_gene type\_recepteur

For its execution, the query is first submitted to the mediator which is responsible for locating sources and queries them through the wrapper or the associated adapter.It should

publications on familial hypercholesterolemia is expressed as follows in SQL:

FK\_ARNM\_TRANSCRIT\_GENES

nom\_ARNm nom\_proteine nom\_gene longeur\_ARNm type\_ARNm sequence\_ARNm

<pk>

Gênes

varchar(100) varchar(100) varchar(100) varchar(255)

nom\_gene longeur\_gene type\_gene sequence\_gene

FK\_DISPOSE\_DISPOSE2\_STRUCTUR

nom\_proteine titre\_structure

Select nom\_proteine, journal\_biblio, auteur\_biblio, date\_biblio, langue\_biblio

And g.nom\_gene in (select nom\_gene from recepteurs r, mutation m Where r.nom\_recepteur = m.nom\_recepteur)

In the global query, 6 attributes are involved: Nom\_proteine, nom\_gene, journal\_biblio, auteur\_biblio, date\_biblio, langue\_biblio shown in the following table along with the sources (PubMed (S1), and PDB (S2).


**Table 4.** Identification of sources

From these sources, we can extract a local schema, for example:

S1\_L (journal\_biblio, auteur\_biblio, date\_biblio, langue\_biblio, nom\_gene),

S2\_L (nom\_proteine, nom\_gene)

From a programming point of view, S1\_L and S2\_L represent wrapper sources.

Next section describes the realization in which we develop wrappers, submit queries to the mediator, and combine the final result to be presented to the user.

### **6. Realization**

We define four steps in the realization:


#### **6.1. Step 1: Development of wrappers**

A wrapper is a program that envelops the execution of another program in the way that the environment can be more suitable. The mediator requests the various databases via wrappers that will extract information from websites of interest. It is necessary to create a wrapper for each specific database.

The sources that we identified in section 4 will be integrated through wrappers. There are different types of wrappers depending on the type of pages they incorporate. These can be either text files or XML files (Extensible Mark-up Language). It is necessary to know the structure of these files and know where the information is located (after any tag, for example). Developing wrappers is linked to functional specifications of the sources presented earlier.

Wrappers allow therefore the extraction of data to be represented in tables. Indeed, we declare the objects and their attributes for each site based on data provided. From all this information, local schemas (relational) for each of these databases are established.

Various programs were written in java. Even if a wrapper has been created for each database, they all have the same main structure. To fill out the fields of tables, the wrapper accesses the Web site to integrate the page and look for keywords behind which is the value to extract. Wrappers are of two types depending on the format of the sources : Either text wrappers or XML wrappers (Figure 8).

**Figure 8.** Presentation of a wrapper

Finally, a program that generates and initializes (gives the starting values for all wrappers) is created to coordinate everything. This program is also written in Java and integrates all the wrappers and their relationships. We thus obtain a set of local tables (Provisional) performed by the wrappers.

### **6.2. Step 2: Definition of "global schema"**

It is therefore necessary to build the global schema that will be the only interface for user. Indeed, the user does not know absolutely how the data are integrated. The global schema is a set of relational tables that are defined using local tables (for information). This schema was introduced in the previous section.

#### **6.3. Step 3: Matches between local tables and global schema**

166 Lipoproteins – Role in Health and Diseases

wrappers or XML wrappers (Figure 8).

**Figure 8.** Presentation of a wrapper

performed by the wrappers.

**6.2. Step 2: Definition of "global schema"** 

was introduced in the previous section.

Wrappers allow therefore the extraction of data to be represented in tables. Indeed, we declare the objects and their attributes for each site based on data provided. From all this

Various programs were written in java. Even if a wrapper has been created for each database, they all have the same main structure. To fill out the fields of tables, the wrapper accesses the Web site to integrate the page and look for keywords behind which is the value to extract. Wrappers are of two types depending on the format of the sources : Either text

Finally, a program that generates and initializes (gives the starting values for all wrappers) is created to coordinate everything. This program is also written in Java and integrates all the wrappers and their relationships. We thus obtain a set of local tables (Provisional)

It is therefore necessary to build the global schema that will be the only interface for user. Indeed, the user does not know absolutely how the data are integrated. The global schema is a set of relational tables that are defined using local tables (for information). This schema

information, local schemas (relational) for each of these databases are established.

As the various local tables have been filled by the wrappers and the global schema has been established, we should now define the correspondence rules between them in order to implement global schema with the extracted information from local schemas. The problem is that several sources may correspond to a business table (we must then join conditions on these tables) or otherwise a source may have several tables trades. For this, we use the Medience server tool.

Medience Server (Figure 5.2) is a complete environment that treats all matching problems (different formats, different representation of business information, dispersal of information described in a single business table). It is a "virtual database", because it does not store information but analyse the user needs. This tool will serve as a mediator that is to say that it will be the unique interface for the user as it will both integrate databases, present data and also offers possibility to loop and see only some information tables of interest to the user. The use of this tool goes in three steps:


**Figure 9.** Architecture of Medience


**Figure 10.** Anexample of Medience interface [19]

#### **6.4. Step 4: Data analysis**

It is therefore possible through a platform like Medience to integrate data sources (BD, Excel files, and text files) and view the results in a tabular form. Now, we can process to the analysis of the results. For this, the definition of demand in terms of mining must be decided: How can we use the data provided. Medience offers the possibility to ask tables on the global schema in SQL way. It also offers the ability to define views on these tables and keep a small part that is particularly interesting. It is possible to use our tool to answer the question like what is the protein associated with the mutated gene responsible for familial hypercholesterolemia and related publications.
