**3. Biological data sources**

150 Lipoproteins – Role in Health and Diseases

This chapter is structured as following:

characteristics of these sources.

**2. General context** 

macromolecules.

First we present the background of the project

hypercholesterolemia. This is a disorder of high LDL ("bad") cholesterol that is passed down through families, which means it is inherited. This disease is caused by a genetic mutation of certain lipoproteins. Indeed, these lipoproteins (called LDL) carry the 2/3 of cholesterol circulating in the blood; they deliver cholesterol to tissues by a system of recognition between Apo lipoprotein Band a receiver: the LDL receptor (lock and key system) that allows the entry of LDL and their cholesterol content in cells. When the LDL receptor (LDL-R) is weak (about one mutation), LDL accumulates in the blood and artery walls causing familial hypercholesterolemia (HF). So knowing these different mutations by biologists, can greatly facilitate the molecular screening of the disease and therefore to find the proper treatment. However, to answer such a query: "What are the mutations that cause familial hypercholesterolemia (HF)?" The biologist hasto make a fastidious search in disparate and

Second we focus on the problem of heterogeneity of data sources and biological

Third we present the state of the art of data integration, problems and constraints of this

Since the completion of the human genome sequencing in April 2003, we observe the accumulation of an outsize amounts of genomic and proteomic data on the web often

Information about genes provides access to their corresponding proteins. In addition, all diseases are associated with alterations in the structure or function of such proteins. A good

Bioinformatics has become an important tool to explore genomic data by relying heavily on computer systems. It suggests methods and software's for biological data storage and processing. Actually, it is acquiring and organizing data, developing software for the analysis, comparison and modeling of these data and analysis results produced by bioinformatics software to infer new biological knowledge, in collaboration with biologists. This work contributes to facilitate to biologists searching among heterogeneous and distributed data in public and / or private data sources on the web. In particular, it helps them to analyze proteins, by building a platform for integrating biological data. This will provide a tracking system to target special proteins involved in a disease known as familial hypercholesterolemia and thus, to better understand the biological activity of these

Familial hypercholesterolemia disease results from mutations in the LDLR gene. The LDLR gene provides instructions for making a protein called a low-density lipoprotein receptor.

heterogeneous databases which requires a considerable investment time.

integration and the various existing approaches to solve this problem. And fourth we expose studied scenario, the realization and perspectives.

syntactically and semantically heterogeneous and difficult to capitalize.

knowledge of protein structure provides insight into their function.

Number of data sources and tools available to biologists on the web has grown dramatically in recent years. This huge number of available data along with heterogeneous information generated wide variety of access interfaces, and also a profound heterogeneity.

#### **3.1. Genomic databases**

There are two types of databanks, those that correspond to a set of heterogeneous data socalled "databases" and those more homogeneous established around a specific theme.

Also, to avoid confusion we will distinguish between semantic databases, general [2] and specialized [3]databases.

For specific requirements related to the activity of a group, or to bibliographic compilations, many specific databases were created in laboratories. In some cases, these databases have been developed continuously; others have not been updated and disappeared as they represented a specific need. Still others are unknown or poorly known and are waiting for further investigation.

All these specialized databases of interest may vary considerably from one base to another according to their size. In most of the case, these bases correspond to a combination compared of generalist databases such as: Swiss-Prot, GenBank. DDBJ (DNA Data Bank of Japan), EMBL (European Molecular Biology Laboratory) which are used very often. It is important to know, that according to the field of activity or the genomics research, the surveyed banks are not necessarily the same. The genomic libraries contain various information that may include:


Table1 and Table 2 give two examples of genomic databases along with protein database.


#### **Table 1.** Genomic databases


**Table 2.** Protein databases

#### **3.2. Characteristics of biological data sources**

152 Lipoproteins – Role in Health and Diseases

European Bioinform atics Institute (EBI) Europe

Specific genomic resources

**Table 1.** Genomic databases

Primary databases

Composite databases

**Table 2.** Protein databases

Works of Cherry and al, 1998.

National Biomedical Research Foundation

National Center Biotechnology Information USA

SGD Saccharom yces Genome Database

PIR Protein Information Resource

NRDB Nonredundant Database

Nucleic bases EMBL

Table1 and Table 2 give two examples of genomic databases along with protein database.

Designation Location Roles Comments Web sites and references

Information' s search tools: SRS, System Retrieval System and via a web interface on EBI, through BLAST et FASTA software

Numerous research help functions on line

The current structure includes 4 compartments : PIR1, PIR2, PIR3 and PIR4

Accessible via the

http://www.ncbi.nl m.nih.gov/Web/NR

web site:

DB/

Designation Location Roles Comments Web sites and references

Accessible via the web site: http://www.ebi.ac.uk/ebi\_docs/em bl\_db/ebi/topembl.html

Accessible via the web site: http://genomewww.stanford.edu/Saccharomyce s/

> Accessible via the web site http://nbrfa.georgetown.ed

Non-redundant Database

u/pir/

NRDB

More than 1 million records (January 1998) for more than 15,500 species. The predominant species: Homo sapien,Caenorhabditis elegans, Saccharomyces cerevvisae ...

Online resources on molecular biology and S.cerevisiae genetic

Sequences collecting

to detect evolutionary relationship between

proteins

PIR, and GenPeptupdate (update of GenPept).

NRDB is the default database of BLAST and NCBI service

Composed of GenPept (derived from GenBank), PDB sequences, SWISS-PROT, SPupdate,

The diversity of information distributed sources and their heterogeneity are the one of the main problem that the web users have to face. This heterogeneity may result from the size or structure of the sources (structured sources: relational databases, partially structured sources: XML documents, or unstructured: texts), the access mode and query, or semantic heterogeneity: between concept maps, and implicit or explicit underlying ontology's.

Biological sources have a large heterogeneity at different levels:

