**Meet the editor**

Dr. Mahmood A. Mahdavi is a biotechnologist with a strong background in computational biology and bioinformatics. He is currently affiliated with the department of chemical engineering in Ferdowsi University of Mashhad, Iran. His current research activities include protein-protein interaction prediction, gene expression analysis and protein sequence analysis in the field of bioinformatics. His research lab is also involved in innovative biological energy sources based upon renewable resources. His bioinformatics background theoretically supports his wet lab activities and assists his students to experience experimental and theoretical molecular biology at the same time.

Contents

**Preface XI** 

**Part 2 Data Integration 39** 

**Part 1 Bioinformatics in Biology 1** 

Chapter 1 **Concepts, Historical Milestones and** 

Chapter 2 **Data Integration in Bioinformatics:** 

Chapter 3 **Semantic Data Integration on** 

**Part 3 Data Mining and Applications 83** 

Chapter 4 **Vector Space Information Retrieval** 

**the Central Place of Bioinformatics in Modern Biology: A European Perspective 3** 

**Current Efforts and Challenges 41**  Zhang Zhang, Vladimir B. Bajic, Jun Yu, Kei-Hoi Cheung and Jeffrey P. Townsend

Roland Kienast and Christian Baumgartner

Eric Sakk and Iyanuoluwa E. Odebode

Chapter 6 **A Pattern Search Method for Discovering** 

Chapter 7 **Database Mining: Defining the Pathogenesis** 

**Techniques for Bioinformatics Data Mining 85** 

**Conserved Motifs in Bioactive Peptide Families 121** 

**of Inflammatory and Immunological Diseases 143**  Fan Yang, Irene Hwa Yang, Hong Wang and Xiao-Feng Yang

Chapter 5 **Massively Parallelized DNA Motif Search on FPGA 107**  Yasmeen Farouk, Tarek ElDeeb and Hossam Faheem

> Feng Liu, Liliane Schoofs, Geert Baggerman, Geert Wets and Marleen Lindemans

T.K. Attwood, A. Gisel, N-E. Eriksson and E. Bongcam-Rudloff

**Biomedical Data Using Semantic Web Technologies 57** 

### Contents

#### **Preface** XIII

	- **Part 2 Data Integration 39**
	- **Part 3 Data Mining and Applications 83**

X Contents


Contents VII

**Part 6 Genome Analysis 371** 

Chapter 18 **Using Bacterial Artificial Chromosomes to Refine** 

Abhirami Ratnakumar, Wesley Barris, Sean McWilliam and Brian P. Dalrymple

Chapter 20 **SNPpattern: A Genetic Tool to Derive** 

Chapter 21 **Algorithms for CpG Islands Search:** 

**Part 7 Transcriptional Analysis 511** 

Chapter 23 *In-silico* **Approaches for RNAi** 

Zhiguo Wang

Chapter 25 **Genome-Wide Identification of** 

Chapter 26 **Quantification of Gene Expression**

Chapter 22 **Translational Oncogenomics and Human** 

Yulia A. Medvedeva

I. C. Baianu

**Genome Assemblies and to Build Virtual Genomes 373** 

Chapter 19 **Basidiomycetes Telomeres – A Bioinformatics Approach 393** 

**Diversity in Populations Using SNP Genotypes 425** 

Lucía Ramírez, Gúmer Pérez, Raúl Castanera, Francisco Santoyo and Antonio G. Pisabarro

**Haplotype Blocks and Measure Genomic** 

**New Advantages and Old Problems 449**

**Cancer Interactomics: Advanced Techniques and Complex System Dynamic Approaches 473** 

**Post-Transcriptional Gene Regulation:**

**Estrogen Receptor Alpha Regulated**

Jianzhen Xu, Xi Zhou and Chi-Wai Wong

**Part 8 Gene Expression and Systems Biology 575** 

**Based on Microarray Experiment 577** Samane F. Farsani and Mahmood A. Mahdavi

Ronnie Willaert and Hichem Sahli

Chapter 27 **On-Chip Living-Cell Microarrays for Network Biology 609** 

**Optimizing siRNA Design and Selection 513**  Mahmoud ElHefnawi and Mohamed Mysara

Chapter 24 **MicroRNA Targeting in Heart: A Theoretical Analysis 539**

**miRNAs Using Transcription Factor Binding Data 559** 

Stephen J. Goodswen and Haja N. Kadarmideen

#### **Part 6 Genome Analysis 371**

VI Contents

Chapter 8 **Data Mining Pubmed Identifies Core** 

Yongping You and Peiyu Pu

Chapter 10 **A Systematic and Thorough Search**

Grégory Nuel

Chapter 11 **Assessing Multiple Sequence**

Chapter 12 **Optimal Sequence Alignment** 

Chapter 13 **Predicting Virus Evolution 269**

Chapter 14 **A Bioinformatical Approach to** 

Chapter 15 **Structural Bioinformatics** 

Sheau Ling Ho

Chapter 17 **Identifying Enzyme Knockout** 

**Part 5 Protein Structure Analysis 287**

Tom Burr

**Part 4 Sequence Analysis and Evolution 171** 

**Signalings and miRNA Regulatory Module in Glioma 157** 

**Cysteine-Rich Group-B Family in the Human Genome 195** 

Catherine L. Anderson, Cory L. Strope and Etsuko N. Moriyama

**Protozoan Parasites: The** *Entamoeba histolytica* **Case 289** 

**Homology Modeling of Human Ryanodine Receptor 2 325** 

Chunsheng Kang, Junxia Zhang, Yingyi Wang, Ning Liu, Jilong Liu, Huazong Zeng, Tao Jiang,

Chapter 9 **Significance Score of Motifs in Biological Sequences 173** 

**for Domains of the Scavenger Receptor** 

Alexandre M. Carmo and Vattipally B. Sreenu

**and Its Relationship with Phylogeny 243**  Atoosa Ghahremani and Mahmood A. Mahdavi

**Study the Endosomal Sorting Complex Required for Transport (ESCRT) Machinery in** 

Israel López-Reyes, Cecilia Bañuelos, Abigail Betanzos and Esther Orozco

**Analysis of Acid Alpha-Glucosidase** 

Chapter 16 **Bioinformatics Domain Structure Prediction and** 

Bin Song, I. Esra Büyüktahtakın,

**Mutants with Pharmacological Chaperones 313** 

V. Bauerová-Hlinková, J. Bauer, E. Hostinová, J. Gašperík, K. Beck, Ľ. Borko, A. Faltínová, A. Zahradníková and J. Ševčík

**Strategies on Multiple Enzyme Associations 353** 

Nirmalya Bandyopadhyay, Sanjay Ranka and Tamer Kahveci

**Alignments Using Visual Tools 211** 

	- **Part 7 Transcriptional Analysis 511**
	- **Part 8 Gene Expression and Systems Biology 575**
	- **Part 9 Next Generation Sequencing 653**
	- **Part 10 Drug Design 705**

### Preface

Bioinformatics is a growing multidisciplinary field of science comprising biology, computer science, and mathematics. It is the theoretical and computational arm of modern biology. In other words, bioinformatics is a tool in the hands of biologists for analyzing huge amount of biological data available on mainstream public databases. Currently, bioinformatics has gained variety of applications in agriculture, medicine, engineering, and natural science. This book discusses a small portion of these applications along with basic concepts and fundamental techniques in bioinformatics.

The first section is a review of history of bioinformatics and the pace of its development in modern biology specifically in Europe. Section 2 and section 3 focus on fundamental principles of data integration and data mining as basic skills in bioinformatics. Data integration is now perceived a requirement in biology as the volume of biological data continues to grow. Section 2 provides an overview on integration of biomedical data using semantic web technologies and current efforts and challenges. Data mining is another basic tool to search databases for conserved regions, motifs, and regulatory modules effective in variety of diseases. Section 3 discusses these applications and basic approaches in data mining such as vector space information. Section 4 concentrates on another aspect of bioinformatics, sequence analysis. Sequences are analyzed to search for distribution of motifs, and search for domains. Basic tool for this analysis is sequence alignment which is discussed in this section in detail. Section 5 contains chapters on identification of specific structures in proteins such as endosomal sorting complex, chaperons, and human receptors. These structures are involved in different metabolic activities within the cell. Section 6 covers those chapters that discuss role of bioinformatics in genomic studies. Some applications of computational techniques in analysis of genomes such as SNP patterns, CPG islands, and virtual genomes have been described in this section. Section 7 focuses on regulatory machinery and the role micro RNAs in this system. Micro RNAs have recently been found to be important in regulatory networks. Some applications have been discussed in chapters within this section. Gene expression and system level understanding of expression process is one of the most interesting topics in bioinformatics. Section 8 contains fundamental principles of identification of differentially expressed genes from microarray data. The chapters in this section are suitable for those who seek basic information on gene expression and integration of this information into biological systems. Section 9 contains more advanced topics in

#### XIV Preface

bioinformatics including next generation sequencing. In this section the authors discuss more recent advances and technologies utilized in deep sequencing. The last section describes one of the growing practical applications of bioinformatics i.e. drug design. The ultimate goal of all theoretical analysis of biological data ought to be a product that improves lives of human. This section discusses one of thousands of efforts in designing a new drug for cancer treatment by means of bioinformatics.

Therefore, this book targets two types of readers: those who are new to bioinformatics and are interested in basic methods and fundamental principles and those who seek new approaches in bioinformatics. Both parties will benefit from studying this book.

In closing I wish to express my sincere sense of gratitude to all contributing authors, publishing process manager, Petra Nenadic and publishing staff.

> **Mahmood A. Mahdavi**  Ferdowsi University of Mashhad (FUM), Mashhad Iran

**Part 1** 

**Bioinformatics in Biology** 

**1** 

*1UK 2Italy 3,4Sweden* 

**Concepts, Historical Milestones and the Central** 

The origins of bioinformatics, both as a term and as a discipline, are difficult to pinpoint. The expression was used as early as 1977 by Dutch theoretical biologist Paulien Hogeweg when she described her main field of research as bioinformatics, and established a bioinformatics group at the University of Utrecht (Hogeweg, 1978; Hogeweg & Hesper, 1978). Nevertheless, the term had little traction in the community for at least another decade. In Europe, the turning point seems to have been *circa* 1990, with the planning of the "*Bioinformatics in the 90s*" conference, which was held in Maastricht in 1991. At this time, the National Center for Biotechnology Information (NCBI) had been newly established in the United States of America (USA) (Benson *et al*., 1990). Despite this, there was still a sense that the nation lacked a "*long-term biology 'informatics' strategy*", particularly regarding postdoctoral interdisciplinary training in computer science and molecular biology (Smith, 1990). Interestingly, Smith spoke here of 'biology informatics', not bioinformatics; and the

NCBI was a 'center for biotechnology information', not a bioinformatics centre.

The discipline itself ultimately grew organically from the needs of researchers to access and analyse (primarily biomedical) data, which appeared to be accumulating at alarming rates simultaneously in different parts of the world. The rapid collection of data was a direct consequence of a series of enormous technological leaps that yielded what was considered, at the time, unprecedented quantities of biological *sequence* information. Hot on the heels of these developments was the concomitant wide-scale blossoming of algorithms and computational resources necessary to analyse, manipulate and store these growing quantities of data. Together, these advances gave birth to the field we now refer to as

When we look back, it's clear that certain concepts and historical milestones were crucial to the evolution of this new field. Those we think most important, and consequently

**1. Introduction** 

bioinformatics.

**Place of Bioinformatics in Modern Biology:** 

T.K. Attwood1, A. Gisel2, N-E. Eriksson3 and E. Bongcam-Rudloff4 *1Faculty of Life Sciences & School of Computer Science, University of Manchester* 

**A European Perspective** 

*2Institute for Biomedical Technologies, CNR* 

*4Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences* 

*3Uppsala Biomedical Centre (BMC), University of Uppsala* 

### **Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective**

T.K. Attwood1, A. Gisel2, N-E. Eriksson3 and E. Bongcam-Rudloff4 *1Faculty of Life Sciences & School of Computer Science, University of Manchester 2Institute for Biomedical Technologies, CNR 3Uppsala Biomedical Centre (BMC), University of Uppsala 4Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences 1UK 2Italy 3,4Sweden* 

#### **1. Introduction**

The origins of bioinformatics, both as a term and as a discipline, are difficult to pinpoint. The expression was used as early as 1977 by Dutch theoretical biologist Paulien Hogeweg when she described her main field of research as bioinformatics, and established a bioinformatics group at the University of Utrecht (Hogeweg, 1978; Hogeweg & Hesper, 1978). Nevertheless, the term had little traction in the community for at least another decade. In Europe, the turning point seems to have been *circa* 1990, with the planning of the "*Bioinformatics in the 90s*" conference, which was held in Maastricht in 1991. At this time, the National Center for Biotechnology Information (NCBI) had been newly established in the United States of America (USA) (Benson *et al*., 1990). Despite this, there was still a sense that the nation lacked a "*long-term biology 'informatics' strategy*", particularly regarding postdoctoral interdisciplinary training in computer science and molecular biology (Smith, 1990). Interestingly, Smith spoke here of 'biology informatics', not bioinformatics; and the NCBI was a 'center for biotechnology information', not a bioinformatics centre.

The discipline itself ultimately grew organically from the needs of researchers to access and analyse (primarily biomedical) data, which appeared to be accumulating at alarming rates simultaneously in different parts of the world. The rapid collection of data was a direct consequence of a series of enormous technological leaps that yielded what was considered, at the time, unprecedented quantities of biological *sequence* information. Hot on the heels of these developments was the concomitant wide-scale blossoming of algorithms and computational resources necessary to analyse, manipulate and store these growing quantities of data. Together, these advances gave birth to the field we now refer to as bioinformatics.

When we look back, it's clear that certain concepts and historical milestones were crucial to the evolution of this new field. Those we think most important, and consequently

Concepts, Historical Milestones and

biological assembly.

amounted to 65 sequences!

the Central Place of Bioinformatics in Modern Biology: A European Perspective 5

year project paved the way for the elucidation of the protein's 3D structure – indeed, without the sequence information, the electron density maps could not have been meaningfully interpreted (Wyckoff *et al*., 1967). Knowledge of the primary structure of this small protein thus provided a vital piece of a 3D jigsaw puzzle that was to take a further 4

Fig. 1. Illustration of a) the primary structure of bovine insulin, showing intra- and interchain disulphide bonds connecting the a and b chains; and b) its zinc-coordinated tertiary structure (2INS), revealing two molecules in the asymmetric unit, and a hexameric

years to solve. Viewed in the light of the high-throughput sequence and structure determinations of today, these prolonged time-scales now seem almost inconceivable. Notwithstanding the challenges, however, the potential of peptide sequencing technology to aid our understanding of the biochemical functions and evolutionary histories of particular proteins, and to facilitate their structural analysis, was compelling. Consequently, the sequences of many other proteins were soon deduced. In the early '60s, amongst the first to appreciate the value of biological sequences, and particularly the ability to deduce evolutionary relationships from them, was Margaret Dayhoff. To facilitate her research and the work of others in the field, she began to collect all protein sequences then available, ultimately publishing them in book form – this was the first *Atlas of Protein Sequence and Structure* (Dayhoff *et al*., 1965), often simply referred to as the *Atlas*. It may seem amusing to us now, but in a letter she wrote in 1967, she observed, "*There is a tremendous amount of information regarding the evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively* [our emphasis]. *We feel it is important to collect this significant information, correlate it into a unified* whole *and interpret it*" (Dayhoff, 1967; Strasser, 2008). With the publication of the first *Atlas*, that 'explosive growth'

remember, depend largely on the perspective from which we view the emerging bioinformatics landscape. This chapter takes a largely European standpoint, while recognising that the development of bioinformatics in Europe was intimately coupled with parallel advances elsewhere in the world, and especially in the USA. The history is intricate. Here, we endeavour to recount the story as it unfolded along a number of tightly interwoven paths, including the rise and spread of some of the technological developments that spawned the data deluge and facilitated its world-wide propagation; of some of the databases that developed in order to store the rapidly accumulating data; and of some of the organisations and infrastructural initiatives that emerged to try to put some of those pivotal databases on a more solid financial footing.

#### **2. The seeds of bioinformatics**

It is hard to pinpoint where and when the seeds of bioinformatics were originally sown. Does the story start with Franklin and Gosling's foundational work towards the elucidation of the structure of DNA (Franklin & Gosling, 1953a, b, c), or with the opportunistic interpretation of their data by Watson and Crick (Watson & Crick, 1953)? Do we fastforward to the ground-breaking work of Kendrew *et al*. (1958) and of Muirhead & Perutz (1963) in determining the first three-dimensional (3D) structures of proteins? Or do we step back, and focus on the painstaking work of Sanger, who, in 1955, determined the amino acid sequence of the first peptide hormone? Or again, do we jump ahead to the progenitors of the first databases of macromolecular structures and sequences in the mid-1960s and early '70s? This era clearly heralded some of the most significant advances in molecular biology, as witnessed by a string of Nobel Prizes at the time: *e.g.*, Sanger's Prize in Chemistry in 1958; Watson, Crick and Wilkins' shared Prize in Physiology or Medicine in 1962, following Franklin's death; and Perutz and Kendrew's Prize in Chemistry, also in 1962. Clearly, in its own way, each of these advances played an important part in the emergence of the vibrant new field that we recognise today as 'bioinformatics'.

As a humbling reference point, we have chosen to begin our story in the mid 1940s, with Fred Sanger's pioneering work on insulin. Sanger used a range of chemical and enzymatic techniques to elucidate, for the first time, the order of amino acids in the primary structure of a protein. Back then, this was a tremendously complex puzzle to tackle, and its completion required the successful resolution of many different challenges over several years. That this was a difficult incremental process is illustrated by the fact that, between 1945 and 1955, each step was published in a separate, stand-alone article. All in all, something like 10 papers detail the series of experiments that led to the eventual determination of the sequences of bovine insulin (*e.g.*, Sanger, 1945; Sanger & Tuppy, 1951a, b; Sanger & Thompson, 1953a,b; Sanger *et al*., 1955; Ryle *et al*., 1955) and of ovine and porcine insulins (Brown *et al*., 1955). This was ground-breaking work, and had taken 10 years to complete. Incredibly, the 3D structure would not be known for another 14 years (Adams *et al*., 1969). The primary and tertiary structures of this historical protein are illustrated in Figure 1.

Such was the enormity of manual sequencing projects that it was many years before the sequence of the first enzyme (ribonuclease) was determined. Work on this protein began in 1955. After preliminary studies in 1957 and 1958, the first full 'draft sequence' was published in 1960 (Hirs *et al*., 1960). During the months that followed, the draft was meticulously refined, and a final version was published 3 years later (Smyth *et al*., 1963). Crucially, this 8-

remember, depend largely on the perspective from which we view the emerging bioinformatics landscape. This chapter takes a largely European standpoint, while recognising that the development of bioinformatics in Europe was intimately coupled with parallel advances elsewhere in the world, and especially in the USA. The history is intricate. Here, we endeavour to recount the story as it unfolded along a number of tightly interwoven paths, including the rise and spread of some of the technological developments that spawned the data deluge and facilitated its world-wide propagation; of some of the databases that developed in order to store the rapidly accumulating data; and of some of the organisations and infrastructural initiatives that emerged to try to put some of those pivotal

It is hard to pinpoint where and when the seeds of bioinformatics were originally sown. Does the story start with Franklin and Gosling's foundational work towards the elucidation of the structure of DNA (Franklin & Gosling, 1953a, b, c), or with the opportunistic interpretation of their data by Watson and Crick (Watson & Crick, 1953)? Do we fastforward to the ground-breaking work of Kendrew *et al*. (1958) and of Muirhead & Perutz (1963) in determining the first three-dimensional (3D) structures of proteins? Or do we step back, and focus on the painstaking work of Sanger, who, in 1955, determined the amino acid sequence of the first peptide hormone? Or again, do we jump ahead to the progenitors of the first databases of macromolecular structures and sequences in the mid-1960s and early '70s? This era clearly heralded some of the most significant advances in molecular biology, as witnessed by a string of Nobel Prizes at the time: *e.g.*, Sanger's Prize in Chemistry in 1958; Watson, Crick and Wilkins' shared Prize in Physiology or Medicine in 1962, following Franklin's death; and Perutz and Kendrew's Prize in Chemistry, also in 1962. Clearly, in its own way, each of these advances played an important part in the emergence of the vibrant

As a humbling reference point, we have chosen to begin our story in the mid 1940s, with Fred Sanger's pioneering work on insulin. Sanger used a range of chemical and enzymatic techniques to elucidate, for the first time, the order of amino acids in the primary structure of a protein. Back then, this was a tremendously complex puzzle to tackle, and its completion required the successful resolution of many different challenges over several years. That this was a difficult incremental process is illustrated by the fact that, between 1945 and 1955, each step was published in a separate, stand-alone article. All in all, something like 10 papers detail the series of experiments that led to the eventual determination of the sequences of bovine insulin (*e.g.*, Sanger, 1945; Sanger & Tuppy, 1951a, b; Sanger & Thompson, 1953a,b; Sanger *et al*., 1955; Ryle *et al*., 1955) and of ovine and porcine insulins (Brown *et al*., 1955). This was ground-breaking work, and had taken 10 years to complete. Incredibly, the 3D structure would not be known for another 14 years (Adams *et al*., 1969). The primary and tertiary structures of this historical protein are

Such was the enormity of manual sequencing projects that it was many years before the sequence of the first enzyme (ribonuclease) was determined. Work on this protein began in 1955. After preliminary studies in 1957 and 1958, the first full 'draft sequence' was published in 1960 (Hirs *et al*., 1960). During the months that followed, the draft was meticulously refined, and a final version was published 3 years later (Smyth *et al*., 1963). Crucially, this 8-

databases on a more solid financial footing.

new field that we recognise today as 'bioinformatics'.

illustrated in Figure 1.

**2. The seeds of bioinformatics** 

year project paved the way for the elucidation of the protein's 3D structure – indeed, without the sequence information, the electron density maps could not have been meaningfully interpreted (Wyckoff *et al*., 1967). Knowledge of the primary structure of this small protein thus provided a vital piece of a 3D jigsaw puzzle that was to take a further 4

Fig. 1. Illustration of a) the primary structure of bovine insulin, showing intra- and interchain disulphide bonds connecting the a and b chains; and b) its zinc-coordinated tertiary structure (2INS), revealing two molecules in the asymmetric unit, and a hexameric biological assembly.

years to solve. Viewed in the light of the high-throughput sequence and structure determinations of today, these prolonged time-scales now seem almost inconceivable. Notwithstanding the challenges, however, the potential of peptide sequencing technology to aid our understanding of the biochemical functions and evolutionary histories of particular proteins, and to facilitate their structural analysis, was compelling. Consequently, the sequences of many other proteins were soon deduced. In the early '60s, amongst the first to appreciate the value of biological sequences, and particularly the ability to deduce evolutionary relationships from them, was Margaret Dayhoff. To facilitate her research and the work of others in the field, she began to collect all protein sequences then available, ultimately publishing them in book form – this was the first *Atlas of Protein Sequence and Structure* (Dayhoff *et al*., 1965), often simply referred to as the *Atlas*. It may seem amusing to us now, but in a letter she wrote in 1967, she observed, "*There is a tremendous amount of information regarding the evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively* [our emphasis]. *We feel it is important to collect this significant information, correlate it into a unified* whole *and interpret it*" (Dayhoff, 1967; Strasser, 2008). With the publication of the first *Atlas*, that 'explosive growth' amounted to 65 sequences!

Concepts, Historical Milestones and

was not yet listed amongst its holdings.

Table 1. PDB holdings, August 1973.

2008; Hobohm *et al*., 1992).

became established as a major EBI resource.

**Protein structures** 

4 Subtilisin BPN (Novo) 5 Tosyl α-chymotrypsin 6 Bovine carboxypeptidase Aα 7 L-Lactate dehydrogenase

3 Basic pancreatic trypsin inhibitor

2 Cytochrome b5

8 Myoglobin 9 Rubredoxin

the Central Place of Bioinformatics in Modern Biology: A European Perspective 7

By 1973, the PDB was fully operational (Protein Data Bank, 1973). In August that year, the body of data it had been established to store amounted to 9 structures (see Table 1). Kennard and co-workers knew that the success of the resource was ultimately dependent on the support of the crystallography community in providing their data; but gaining sufficient community momentum to back the initiative was clearly a long, drawn-out process: note, for example, that the structure of ribonuclease, which had been determined 6 years earlier,

1 Cyanide methaemoglobin V from sea lamprey

Over the next 4 years, the number of structures acquired by the PDB grew slowly. By 1977, the archive also included the structure of a transfer RNA (tRNA), and hence the name *Protein Data Bank* was thought something of a misnomer (Bernstein *et al*., 1977). Nevertheless, despite this reservation, the name stuck, and the resource (which today includes more than 5,000 nucleic acid and protein-nucleic acid complexes) is still referred to as the PDB. Interestingly, at that time, the database contained 77 sets of atomic coordinates relating to 47 macromolecules, highlighting a significant level of redundancy. Coupled with their ongoing concerns about the pace of growth of the archive, perhaps this explains why the Berstein *et al.* paper was published verbatim in May and November of 1977, and again in January 1978, in three different journals (Bernstein *et al*., 1977a, b; 1978)? Whatever the real reasons, growth of the PDB compared to the CSD (~6,000 vs. ~150,000 structures in 1996) was slow (Kennard, 1997), and the number of unique structures remained relatively small – by 1992, the level of redundancy in the resource had been calculated to be ~7-fold (Berman,

In 1996, shortly after the establishment of the European Bioinformatics Institute (EBI) near Cambridge, UK, a new database of macromolecular structures was created – this was the E-MSD (Boutselakis *et al*., 2003). Building directly on PDB data, E-MSD was originally conceived as a pilot study to explore the feasibility of exploiting relational database technologies to manage structural data more effectively. In the end, the pilot project led to the creation of a database that was successful in its own right, and the E-MSD thereby

During this period, a concerted effort was made to hasten the pace of knowledge acquisition from structural studies. Part of the motivation was to build on the still-limited number of structures available in the PDB, and partly also to address its growing level of redundancy. The idea was to establish a program of high-throughput X-ray crystallography – the socalled Structural Genomics Initiative (SGI) (Burley *et al*., 1999). Several feasibility studies had

In the decade that followed, time-consuming manual processes were gradually superseded with the advent of automated peptide sequencers, which increased the rate of sequence determination considerably. Meanwhile, another revolution was taking place, heralded by the elucidation of the 3D structures of the first proteins, those of myoglobin and haemoglobin, respectively (Kendrew *et al*., 1958; Muirhead and Perutz, 1963). Building on the ongoing sequencing work, this advance set the scene for an exciting new era in which structure determination took centre stage in our quest to understand the biophysical mechanisms that underpin biochemical and evolutionary processes. In fact, so seductive was this approach that many more structural studies were initiated, and the numbers of deduced protein structures grew accordingly.

#### **3. The development and spread of databases, organisations and infrastructures**

Key to handling this burgeoning information was the recruitment of computers to help systematically analyse and store the accumulating sequence and structure data. At this time, the idea that molecular information could be collected within, and distributed from, electronic repositories was not only very new but also posed significant challenges. Just consider, for a moment, that concepts we take for granted today (email, the Internet, the World Wide Web) had not yet emerged; there was therefore no easy way to distribute data from a central database, other than by posting computer tapes and disks to individual users, at their request. This model of data distribution was clearly rather cumbersome and slow; it was also relatively costly, and led some of the first database pioneers to adopt pricing and/or data-sharing policies that threatened to drive away many of their potential users.

#### **3.1 The Protein Data Bank (PDB)**

One of the earliest, and hence now oldest, of scientific databases was established in 1965 at the Cambridge Crystallographic Data Centre (CCDC), under the direction of Olga Kennard (Kennard *et al*., 1972; Allen *et al*., 1991) – this was a repository of small-molecule crystal structures termed the Cambridge Structural Database, or CSD. The CSD, which originated as a traditional printed dissemination, ultimately assumed an electronic form so that Kennard could fulfill a dream, which she shared with J.D.Bernal, to be able to use data collections to discover new knowledge, above and beyond the results yielded by individual experiments (Kennard, 1997).

In 1971, a few years after the creation of the CSD, at a Cold Spring Harbor Symposium on the "*Structure and Function of Proteins at the Three Dimensional Level*", Walter Hamilton and colleagues discussed the possibility of creating a similar kind of 'bank' for protein coordinate data. Key to their proposal was that this archive should be mirrored at sites in the UK and the USA (Berman, 2008). Consequently, Hamilton volunteered to set up the 'master copy' of the American bank at the Brookhaven National Laboratory (BNL), while Kennard subsequently agreed to host the European copy and to extend the CCDC small molecule format to accommodate protein structural data (Kennard *et al.*, 1972; Meyer, 1997). Thus was born the Protein Data Bank (PDB); this was to be operated jointly by the CCDC and BNL, and where possible, distributed on magnetic tape in machine-readable form. News of its establishment was announced in a short bulletin in October that year (Protein Data Bank, 1971); its first release held 7 structures (Berman *et al*., 2000). Interestingly, Kennard viewed the PDB as a prototype for the EMBL data library, which was to materialise a decade later (Smith, 1990).

By 1973, the PDB was fully operational (Protein Data Bank, 1973). In August that year, the body of data it had been established to store amounted to 9 structures (see Table 1). Kennard and co-workers knew that the success of the resource was ultimately dependent on the support of the crystallography community in providing their data; but gaining sufficient community momentum to back the initiative was clearly a long, drawn-out process: note, for example, that the structure of ribonuclease, which had been determined 6 years earlier, was not yet listed amongst its holdings.


Table 1. PDB holdings, August 1973.

6 Bioinformatics – Trends and Methodologies

In the decade that followed, time-consuming manual processes were gradually superseded with the advent of automated peptide sequencers, which increased the rate of sequence determination considerably. Meanwhile, another revolution was taking place, heralded by the elucidation of the 3D structures of the first proteins, those of myoglobin and haemoglobin, respectively (Kendrew *et al*., 1958; Muirhead and Perutz, 1963). Building on the ongoing sequencing work, this advance set the scene for an exciting new era in which structure determination took centre stage in our quest to understand the biophysical mechanisms that underpin biochemical and evolutionary processes. In fact, so seductive was this approach that many more structural studies were initiated, and the numbers of

Key to handling this burgeoning information was the recruitment of computers to help systematically analyse and store the accumulating sequence and structure data. At this time, the idea that molecular information could be collected within, and distributed from, electronic repositories was not only very new but also posed significant challenges. Just consider, for a moment, that concepts we take for granted today (email, the Internet, the World Wide Web) had not yet emerged; there was therefore no easy way to distribute data from a central database, other than by posting computer tapes and disks to individual users, at their request. This model of data distribution was clearly rather cumbersome and slow; it was also relatively costly, and led some of the first database pioneers to adopt pricing and/or data-sharing policies that threatened to drive away many of their potential users.

One of the earliest, and hence now oldest, of scientific databases was established in 1965 at the Cambridge Crystallographic Data Centre (CCDC), under the direction of Olga Kennard (Kennard *et al*., 1972; Allen *et al*., 1991) – this was a repository of small-molecule crystal structures termed the Cambridge Structural Database, or CSD. The CSD, which originated as a traditional printed dissemination, ultimately assumed an electronic form so that Kennard could fulfill a dream, which she shared with J.D.Bernal, to be able to use data collections to discover new knowledge, above and beyond the results yielded by individual

In 1971, a few years after the creation of the CSD, at a Cold Spring Harbor Symposium on the "*Structure and Function of Proteins at the Three Dimensional Level*", Walter Hamilton and colleagues discussed the possibility of creating a similar kind of 'bank' for protein coordinate data. Key to their proposal was that this archive should be mirrored at sites in the UK and the USA (Berman, 2008). Consequently, Hamilton volunteered to set up the 'master copy' of the American bank at the Brookhaven National Laboratory (BNL), while Kennard subsequently agreed to host the European copy and to extend the CCDC small molecule format to accommodate protein structural data (Kennard *et al.*, 1972; Meyer, 1997). Thus was born the Protein Data Bank (PDB); this was to be operated jointly by the CCDC and BNL, and where possible, distributed on magnetic tape in machine-readable form. News of its establishment was announced in a short bulletin in October that year (Protein Data Bank, 1971); its first release held 7 structures (Berman *et al*., 2000). Interestingly, Kennard viewed the PDB as a prototype for the EMBL data library, which was to materialise

**3. The development and spread of databases, organisations and** 

deduced protein structures grew accordingly.

**3.1 The Protein Data Bank (PDB)** 

experiments (Kennard, 1997).

a decade later (Smith, 1990).

**infrastructures** 

Over the next 4 years, the number of structures acquired by the PDB grew slowly. By 1977, the archive also included the structure of a transfer RNA (tRNA), and hence the name *Protein Data Bank* was thought something of a misnomer (Bernstein *et al*., 1977). Nevertheless, despite this reservation, the name stuck, and the resource (which today includes more than 5,000 nucleic acid and protein-nucleic acid complexes) is still referred to as the PDB. Interestingly, at that time, the database contained 77 sets of atomic coordinates relating to 47 macromolecules, highlighting a significant level of redundancy. Coupled with their ongoing concerns about the pace of growth of the archive, perhaps this explains why the Berstein *et al.* paper was published verbatim in May and November of 1977, and again in January 1978, in three different journals (Bernstein *et al*., 1977a, b; 1978)? Whatever the real reasons, growth of the PDB compared to the CSD (~6,000 vs. ~150,000 structures in 1996) was slow (Kennard, 1997), and the number of unique structures remained relatively small – by 1992, the level of redundancy in the resource had been calculated to be ~7-fold (Berman, 2008; Hobohm *et al*., 1992).

In 1996, shortly after the establishment of the European Bioinformatics Institute (EBI) near Cambridge, UK, a new database of macromolecular structures was created – this was the E-MSD (Boutselakis *et al*., 2003). Building directly on PDB data, E-MSD was originally conceived as a pilot study to explore the feasibility of exploiting relational database technologies to manage structural data more effectively. In the end, the pilot project led to the creation of a database that was successful in its own right, and the E-MSD thereby became established as a major EBI resource.

During this period, a concerted effort was made to hasten the pace of knowledge acquisition from structural studies. Part of the motivation was to build on the still-limited number of structures available in the PDB, and partly also to address its growing level of redundancy. The idea was to establish a program of high-throughput X-ray crystallography – the socalled Structural Genomics Initiative (SGI) (Burley *et al*., 1999). Several feasibility studies had

Concepts, Historical Milestones and

information efficiently.

Table 2. Sequencing landmarks.

Cameron, 1986).

the Central Place of Bioinformatics in Modern Biology: A European Perspective 9

before it appear utterly inconsequential. Here, then, was a dramatic turning point: for the first time, it dawned on scientists that the new sequencing machines were shunting the bottlenecks away from data production *per se* and onto the requirements of data management: "*the rate limiting step in the process of nucleic acid sequencing is now shifting from data acquisition towards the organization and analysis of that data*" (Gingeras & Roberts, 1980). This realisation had profound consequences in both Europe and the USA, as a centralised data bank now seemed inescapable as a tool for managing nucleic acid sequence

**Year Protein RNA DNA No. of residues** 

So, the race was on to establish the first nucleotide sequence database. First past the post, in 1980, was the European Molecular Biology Laboratory (EMBL) in Heidelberg, who set up the EMBL data library. After an initial pilot period, the first release of 568 sequences was made in June 1982. The aim of this new resource was not only to make nucleic acid sequence data publicly available and encourage standardisation and free exchange of data, but also to provide a European focus for computational and biological data services (Hamm &

From the outset, it was recognised that maintenance of such a centralised repository, and of its attendant services, would require international collaboration. In the UK, a copy of the EMBL library was being maintained at Cambridge University, together with its manual, indices and associated sequence analysis, and search and retrieval software. This integrated system also provided access to the library of sequences then being developed at Los Alamos, GenBank (Kanehisa *et al.*, 1984). It makes fascinating reading to learn that, "*this system is presently being used by over 30 researchers in eight departments in the University and in local research institutes. These users can keep in touch with each other via the MAIL command*"! With the support of the Medical Research Council (MRC), the Cambridge services were extended to the wider UK community on the Joint Academic network (JANET) (Kneale & Kennard, 1984). As with the PDB before it, it was important not only to push the data out to researchers, but also to pull their data in. Hence, a further planned development was to

1935 Insulin 1 1945 Insulin 2 1947 Gramicidin S 5 1949 Insulin 9 1955 Insulin 51 1960 Ribonuclease 120 1965 tRNAAla 75 1967 5S RNA 120 1968 Bacteriophage λ 12 1977 Bacteriophage φX 174 5,375 1978 Bacteriophage φX 174 5,386 1981 Mitochondria 16,569 1982 Bacteriophage λ 48,502 1984 Epstein-Barr virus 172,282 2004 *Homo sapiens* 2.85 billion

already been launched and, in light of the broad-sweeping vision of the SGI, it had become clear that coping with high-throughput structure-determination pipelines would require new ways of gathering, storing, distributing and 'serving' the data to end users. One of the PDB's responses to this, and to the many challenges that lay ahead, was the formation of a new management structure. This was to be embodied in a 3-membered Research Collaboratory for Structural Bioinformatics (RCSB): the consortium included Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology (Berman *et al*., 2000; Berman *et al.*, 2003). Once the consortium was established, the BNL PDB ceased operations and the RCSB formally took the helm on 1 July, 1999.

With the RCSB PDB in the USA, the E-MSD established in Europe, and a sister resource (PDBj) subsequently announced in Japan (Nakamura *et al*., 2002), structure collection efforts had clearly taken on an international dimension. In consequence, in 2003, the 3 repositories were brought together beneath an umbrella organisation known as the worldwide Protein Data Bank (wwPDB), to streamline their activities and maintain a single, global, publicly available archive of macromolecular structural data (Berman *et al*., 2003). By 2009, perhaps to align its nomenclature in a more obvious way with its consortium partners, E-MSD was renamed PDBe (Velankar *et al*., 2009). Today, the RCSB remains the 'archive keeper', with sole write-access to the PDB, controlling its contents, and distributing new PDB identifiers to all deposition sites. In February 2011, the archive housed 71,415 structures.

#### **3.2 The EMBL nucleotide sequence data library**

Despite the advances in protein sequence- and structure-determination technologies between the mid-1940s and -'70s, sequencing nucleic acids had remained problematic. The key issues related to size and ease of molecular purification. It had proved possible to sequence tRNAs, largely because they're short (typically less than 100 nucleotides long) and individual molecules could, with some effort, be purified; but chromosomal DNA molecules are in a different league, containing many millions of nucleotides. Even if such molecules could be broken down into smaller chunks, purification was a major challenge. The longest fragment that could then be sequenced in a single experiment was ~500bp; and yields of potentially around half a million fragments per chromosome were simply beyond the technology of the day to handle.

During the mid '70s, however, Sanger had developed a technology (to become known as the 'Sanger method') that made it possible to work with much longer nucleotide fragments: this allowed completion of the sequencing of the 5,386 bases of the single-stranded bacteriophage φX174 (Sanger *et al*., 1978), subsequently permitting rapid and accurate sequencing of even longer sequences – an achievement of sufficient magnitude to earn him his second Nobel Prize in Chemistry, in 1980. With this technique, he went on to sequence human mitochondrial DNA (Anderson *et al*., 1981) and bacteriophage λ (Sanger *et al*., 1982). These were landmark achievements (see Table 2), providing the first direct evidence of the phenomenon of overlapping gene sequences and of the non-universality of the genetic code (Sanger, 1988; Dodson, 2005). But it was automation of these techniques from the mid-'80s that significantly increased productivity, and began to make the human genome a realistic target.

Together, these advances prepared the way for a new revolution, one that would rock the foundations of molecular biology and make the gathered fruits of all sequencing efforts before it appear utterly inconsequential. Here, then, was a dramatic turning point: for the first time, it dawned on scientists that the new sequencing machines were shunting the bottlenecks away from data production *per se* and onto the requirements of data management: "*the rate limiting step in the process of nucleic acid sequencing is now shifting from data acquisition towards the organization and analysis of that data*" (Gingeras & Roberts, 1980). This realisation had profound consequences in both Europe and the USA, as a centralised data bank now seemed inescapable as a tool for managing nucleic acid sequence information efficiently.


Table 2. Sequencing landmarks.

8 Bioinformatics – Trends and Methodologies

already been launched and, in light of the broad-sweeping vision of the SGI, it had become clear that coping with high-throughput structure-determination pipelines would require new ways of gathering, storing, distributing and 'serving' the data to end users. One of the PDB's responses to this, and to the many challenges that lay ahead, was the formation of a new management structure. This was to be embodied in a 3-membered Research Collaboratory for Structural Bioinformatics (RCSB): the consortium included Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology (Berman *et al*., 2000; Berman *et al.*, 2003). Once the consortium was established, the BNL PDB ceased operations and the RCSB formally took the helm on 1

With the RCSB PDB in the USA, the E-MSD established in Europe, and a sister resource (PDBj) subsequently announced in Japan (Nakamura *et al*., 2002), structure collection efforts had clearly taken on an international dimension. In consequence, in 2003, the 3 repositories were brought together beneath an umbrella organisation known as the worldwide Protein Data Bank (wwPDB), to streamline their activities and maintain a single, global, publicly available archive of macromolecular structural data (Berman *et al*., 2003). By 2009, perhaps to align its nomenclature in a more obvious way with its consortium partners, E-MSD was renamed PDBe (Velankar *et al*., 2009). Today, the RCSB remains the 'archive keeper', with sole write-access to the PDB, controlling its contents, and distributing new PDB identifiers to

Despite the advances in protein sequence- and structure-determination technologies between the mid-1940s and -'70s, sequencing nucleic acids had remained problematic. The key issues related to size and ease of molecular purification. It had proved possible to sequence tRNAs, largely because they're short (typically less than 100 nucleotides long) and individual molecules could, with some effort, be purified; but chromosomal DNA molecules are in a different league, containing many millions of nucleotides. Even if such molecules could be broken down into smaller chunks, purification was a major challenge. The longest fragment that could then be sequenced in a single experiment was ~500bp; and yields of potentially around half a million fragments per chromosome were simply beyond the

During the mid '70s, however, Sanger had developed a technology (to become known as the 'Sanger method') that made it possible to work with much longer nucleotide fragments: this allowed completion of the sequencing of the 5,386 bases of the single-stranded bacteriophage φX174 (Sanger *et al*., 1978), subsequently permitting rapid and accurate sequencing of even longer sequences – an achievement of sufficient magnitude to earn him his second Nobel Prize in Chemistry, in 1980. With this technique, he went on to sequence human mitochondrial DNA (Anderson *et al*., 1981) and bacteriophage λ (Sanger *et al*., 1982). These were landmark achievements (see Table 2), providing the first direct evidence of the phenomenon of overlapping gene sequences and of the non-universality of the genetic code (Sanger, 1988; Dodson, 2005). But it was automation of these techniques from the mid-'80s that significantly increased productivity, and began to make the human genome a realistic

Together, these advances prepared the way for a new revolution, one that would rock the foundations of molecular biology and make the gathered fruits of all sequencing efforts

all deposition sites. In February 2011, the archive housed 71,415 structures.

**3.2 The EMBL nucleotide sequence data library** 

technology of the day to handle.

target.

July, 1999.

So, the race was on to establish the first nucleotide sequence database. First past the post, in 1980, was the European Molecular Biology Laboratory (EMBL) in Heidelberg, who set up the EMBL data library. After an initial pilot period, the first release of 568 sequences was made in June 1982. The aim of this new resource was not only to make nucleic acid sequence data publicly available and encourage standardisation and free exchange of data, but also to provide a European focus for computational and biological data services (Hamm & Cameron, 1986).

From the outset, it was recognised that maintenance of such a centralised repository, and of its attendant services, would require international collaboration. In the UK, a copy of the EMBL library was being maintained at Cambridge University, together with its manual, indices and associated sequence analysis, and search and retrieval software. This integrated system also provided access to the library of sequences then being developed at Los Alamos, GenBank (Kanehisa *et al.*, 1984). It makes fascinating reading to learn that, "*this system is presently being used by over 30 researchers in eight departments in the University and in local research institutes. These users can keep in touch with each other via the MAIL command*"! With the support of the Medical Research Council (MRC), the Cambridge services were extended to the wider UK community on the Joint Academic network (JANET) (Kneale & Kennard, 1984). As with the PDB before it, it was important not only to push the data out to researchers, but also to pull their data in. Hence, a further planned development was to

Concepts, Historical Milestones and

*et al*., 1993; Smith, 1990).

**3.4 The PIR-PSD** 

*The reality, however, is more complex*" (Kennard, 1997).

sufficient to swing the pendulum in favour of the Los Alamos team.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 11

As an aside, it's interesting that the vision of free data and programs was advocated so strongly at this time, not least because there was no funding model to support it! And precisely the same arguments are still being vehemently propounded today with regard to free databases, free software and free literature (*e.g*., Lathrop *et al*., 2011). But even now, database funding remains an unsolved and controversial issue: as Olga Kennard put it almost 15 years ago, "*Free access to validated and enhanced data worldwide is a beautiful dream.* 

Returning to our theme, perhaps the final nail in the coffin of Dayhoff's proposal was that the NBRF had only limited means of data distribution (via modems), whereas the Los Alamos outfit had the enormous benefit of being able to distribute their data via ARPANET, the computer network of the US Department of Defense. Together, these advantages were

But the new GenBank did not, indeed could not, function in isolation. From its inception, it evolved in close collaboration with the EMBL data library and, from 1986 onwards, also with the DNA Data Bank of Japan. Although the databases were not identical (each with its own format, naming convention, and so on), the teams adopted common data-entry standards and data-exchange protocols in order to improve data quality and to manage both the growth of the resource and the annotation of its entries more effectively. Of this collaborative process, Temple Smith commented in 1990, "*By working out a division of labor with the EMBL and newer Japanese database efforts, and by involving the authors and journal editors, GenBank and the EMBL databases are currently keeping pace with the literature*." Today, the boot seems to be very much on the other foot, as the literature can no longer keep up with the data: by February 2011, GenBank contained 132,015,054 entries, presenting insurmountable annotation hurdles! (Note that this appears smaller than the size of the EMBL data library because GenBank doesn't report sequences from Whole Genome Shotgun projects in its total). Perhaps not surprisingly, the initial funding for GenBank was insufficient to adequately maintain this growing mass of data; hence, responsibility for its maintenance, with increased funding under a new contract, passed to IntelliGenetics in 1987; then, in 1992, it became the responsibility of the NCBI, where it remains today (Benson

To some extent, the gathering momentum of nucleic acid sequence-collection efforts had begun to overshadow the steady progress being made in the world of protein sequences, most notably with the *Atlas*. By October 1981, this had run into its fifth volume, a large book with three supplements, listing more than 1,660 proteins. This information, as with all data collections, required constant updating and revision in the light both of new knowledge and of new data appearing in the literature. Moreover, as the community had become increasingly keen to harness the efficiency gains of central data repositories, and more databases were appearing on the horizon, making and maintaining cross-references to database entries, of necessity, had to become part of data-annotation and update processes if scientists were to be able to exploit new and existing sequence data fully. Under the circumstances, continued publication of the *Atlas* in paper form simply became untenable: the time was ripe to exploit the advances in computer technology that had given rise to the CSD, the PDB, the EMBL data library and GenBank. In 1984, the *Atlas* was consequently

made available on computer tape as the Protein Sequence Database (PSD).

centralise collection of nucleic acid data from UK research groups, and to periodically transfer the information to the EMBL library. It was hoped that this would minimise both data-entry errors and the workload of EMBL staff at a time when the number of sequence determinations was predicted to "*increase greatly*" (Kneale & Kennard, 1984). Of course, the size of this 'great increase' could hardly have been predicted; in December 2010, the database contained 199,720,869 entries.

#### **3.3 GenBank**

The birth of GenBank, in December 1982, brought 606 sequences into the public domain. A consensus had emerged on the necessity of creating an international nucleic acid sequence repository at a scientific meeting at Rockefeller University in New York, in March 1979. At that time, several groups had expressed a desire to be a part of this endeavour, including those led by Dayhoff at the National Biomedical Research Foundation (NBRF); Walter Goad at Los Alamos National Laboratories; Doug Brutlag at Stanford; Olga Kennard and Fred Sanger at the MRC Laboratory in Cambridge; and Ken Murray and Hans Lehrach at the EMBL (Smith, 1990), all of whom had begun to create their own nucleotide sequence collections. However, it took the best part of 3 years for an appropriate funding model to emerge from the US National Institutes of Health (NIH), by which time the EMBL data library had already been publicly available for 6 months under the direction of Greg Hamm. By then, 3 proposals remained on the table for NIH support: 2 of these were from Los Alamos (one with Bolt, Beranek and Newman (BBN), the other with IntelliGenetics), and the third from NBRF. To the surprise of many, the decision was made in June 1982 to establish the new GenBank resource at Los Alamos (in collaboration with BBN, Inc.) rather than at the NBRF (Smith, 1990; Strasser, 2008).

Although there was a general sense of relief that a decision had finally been made, some members of the community (and doubtless Dayhoff herself) felt that the NBRF would have been a more appropriate home for GenBank, particularly given Dayhoff's successful track record as a curator of protein sequence data (Smith, 1990). Los Alamos, by contrast, although undoubtedly offering excellent computer facilities, was probably best known for its role in the creation of atomic weapons – this was not an obvious environment in which to establish the nation's first public nucleotide sequence database. The crux of the matter seemed to rest with the different philosophical approaches embodied in the NBRF and Los Alamos proposals, particularly as they related to scientific priority, data sharing/privacy and intellectual property policies. Dayhoff had intended to continue gathering sequences directly from literature sources and from bench scientists, and wasn't interested in matters of history or priority (Eck & Dayhoff, 1966); the Los Alamos team, on the other hand, advocated the collaboration of journal editors in making the publication of articles contingent on authors yielding their sequence data to the database. This latter approach was particularly compelling, as it would allow scientists to assert priority, and to keep their research results private until formally published and their provenance established; perhaps more importantly, it was unencumbered by proprietary interest in the data. Unfortunately, the fact that Dayhoff had prevented redistribution of NBRF's protein sequence library and sought revenues from its sales (albeit only to cover costs) worked against her – allowing the data to become the private hunting grounds of any one group of researchers was considered antithetical to the spirit of open access (Strasser, 2008). That the data and associated software tools should be free and open was thus paramount; it is perhaps ironic, then, that the site chosen for the database was within the secured area of what many in the community may have darkly perceived as 'The Atomic City' (en.wikipedia.org/wiki/The\_Atomic\_City).

As an aside, it's interesting that the vision of free data and programs was advocated so strongly at this time, not least because there was no funding model to support it! And precisely the same arguments are still being vehemently propounded today with regard to free databases, free software and free literature (*e.g*., Lathrop *et al*., 2011). But even now, database funding remains an unsolved and controversial issue: as Olga Kennard put it almost 15 years ago, "*Free access to validated and enhanced data worldwide is a beautiful dream. The reality, however, is more complex*" (Kennard, 1997).

Returning to our theme, perhaps the final nail in the coffin of Dayhoff's proposal was that the NBRF had only limited means of data distribution (via modems), whereas the Los Alamos outfit had the enormous benefit of being able to distribute their data via ARPANET, the computer network of the US Department of Defense. Together, these advantages were sufficient to swing the pendulum in favour of the Los Alamos team.

But the new GenBank did not, indeed could not, function in isolation. From its inception, it evolved in close collaboration with the EMBL data library and, from 1986 onwards, also with the DNA Data Bank of Japan. Although the databases were not identical (each with its own format, naming convention, and so on), the teams adopted common data-entry standards and data-exchange protocols in order to improve data quality and to manage both the growth of the resource and the annotation of its entries more effectively. Of this collaborative process, Temple Smith commented in 1990, "*By working out a division of labor with the EMBL and newer Japanese database efforts, and by involving the authors and journal editors, GenBank and the EMBL databases are currently keeping pace with the literature*." Today, the boot seems to be very much on the other foot, as the literature can no longer keep up with the data: by February 2011, GenBank contained 132,015,054 entries, presenting insurmountable annotation hurdles! (Note that this appears smaller than the size of the EMBL data library because GenBank doesn't report sequences from Whole Genome Shotgun projects in its total). Perhaps not surprisingly, the initial funding for GenBank was insufficient to adequately maintain this growing mass of data; hence, responsibility for its maintenance, with increased funding under a new contract, passed to IntelliGenetics in 1987; then, in 1992, it became the responsibility of the NCBI, where it remains today (Benson *et al*., 1993; Smith, 1990).

#### **3.4 The PIR-PSD**

10 Bioinformatics – Trends and Methodologies

centralise collection of nucleic acid data from UK research groups, and to periodically transfer the information to the EMBL library. It was hoped that this would minimise both data-entry errors and the workload of EMBL staff at a time when the number of sequence determinations was predicted to "*increase greatly*" (Kneale & Kennard, 1984). Of course, the size of this 'great increase' could hardly have been predicted; in December 2010, the

The birth of GenBank, in December 1982, brought 606 sequences into the public domain. A consensus had emerged on the necessity of creating an international nucleic acid sequence repository at a scientific meeting at Rockefeller University in New York, in March 1979. At that time, several groups had expressed a desire to be a part of this endeavour, including those led by Dayhoff at the National Biomedical Research Foundation (NBRF); Walter Goad at Los Alamos National Laboratories; Doug Brutlag at Stanford; Olga Kennard and Fred Sanger at the MRC Laboratory in Cambridge; and Ken Murray and Hans Lehrach at the EMBL (Smith, 1990), all of whom had begun to create their own nucleotide sequence collections. However, it took the best part of 3 years for an appropriate funding model to emerge from the US National Institutes of Health (NIH), by which time the EMBL data library had already been publicly available for 6 months under the direction of Greg Hamm. By then, 3 proposals remained on the table for NIH support: 2 of these were from Los Alamos (one with Bolt, Beranek and Newman (BBN), the other with IntelliGenetics), and the third from NBRF. To the surprise of many, the decision was made in June 1982 to establish the new GenBank resource at Los Alamos (in collaboration with BBN, Inc.) rather than at the

Although there was a general sense of relief that a decision had finally been made, some members of the community (and doubtless Dayhoff herself) felt that the NBRF would have been a more appropriate home for GenBank, particularly given Dayhoff's successful track record as a curator of protein sequence data (Smith, 1990). Los Alamos, by contrast, although undoubtedly offering excellent computer facilities, was probably best known for its role in the creation of atomic weapons – this was not an obvious environment in which to establish the nation's first public nucleotide sequence database. The crux of the matter seemed to rest with the different philosophical approaches embodied in the NBRF and Los Alamos proposals, particularly as they related to scientific priority, data sharing/privacy and intellectual property policies. Dayhoff had intended to continue gathering sequences directly from literature sources and from bench scientists, and wasn't interested in matters of history or priority (Eck & Dayhoff, 1966); the Los Alamos team, on the other hand, advocated the collaboration of journal editors in making the publication of articles contingent on authors yielding their sequence data to the database. This latter approach was particularly compelling, as it would allow scientists to assert priority, and to keep their research results private until formally published and their provenance established; perhaps more importantly, it was unencumbered by proprietary interest in the data. Unfortunately, the fact that Dayhoff had prevented redistribution of NBRF's protein sequence library and sought revenues from its sales (albeit only to cover costs) worked against her – allowing the data to become the private hunting grounds of any one group of researchers was considered antithetical to the spirit of open access (Strasser, 2008). That the data and associated software tools should be free and open was thus paramount; it is perhaps ironic, then, that the site chosen for the database was within the secured area of what many in the community may have darkly perceived as 'The Atomic City' (en.wikipedia.org/wiki/The\_Atomic\_City).

database contained 199,720,869 entries.

NBRF (Smith, 1990; Strasser, 2008).

**3.3 GenBank** 

To some extent, the gathering momentum of nucleic acid sequence-collection efforts had begun to overshadow the steady progress being made in the world of protein sequences, most notably with the *Atlas*. By October 1981, this had run into its fifth volume, a large book with three supplements, listing more than 1,660 proteins. This information, as with all data collections, required constant updating and revision in the light both of new knowledge and of new data appearing in the literature. Moreover, as the community had become increasingly keen to harness the efficiency gains of central data repositories, and more databases were appearing on the horizon, making and maintaining cross-references to database entries, of necessity, had to become part of data-annotation and update processes if scientists were to be able to exploit new and existing sequence data fully. Under the circumstances, continued publication of the *Atlas* in paper form simply became untenable: the time was ripe to exploit the advances in computer technology that had given rise to the CSD, the PDB, the EMBL data library and GenBank. In 1984, the *Atlas* was consequently made available on computer tape as the Protein Sequence Database (PSD).

Concepts, Historical Milestones and

to ensure that it was implemented in his own database.

(1982), *Biochemical Journal*, 203, 527-528. © the Biochemical Society

the Central Place of Bioinformatics in Modern Biology: A European Perspective 13

the literature; most were taken from the *Atlas*, which had not yet been released in electronic form. Of course, this was an immensely tedious process, and was also highly error-prone. Realising this, and anxious to avoid such problems for others in future, he wrote a letter to the *Biochemical Journal* recommending that researchers publishing protein and peptide sequences should compute checksums to "*facilitate the detection of typographical and keyboard errors*" (Bairoch, 1982). As part of the letter, he illustrated the computation of such a 'checking number' for an imaginary peptide, as shown in Figure 2. Although this recommendation was never widely adopted in publishing circles, Bairoch was at least able

Fig. 2. Computation of a 'checking number' (CN) for an imaginary peptide, as published in a letter to the *Biochemical Journal* in 1982. The journal editors either didn't notice, or chose to ignore, the hidden message in the peptide. Reproduced with permission, from Bairoch, A.

Several other important developments were to emerge from the work of this enthusiastic and industrious student. For the analysis software he was developing, he needed to distribute both a nucleotide and a protein sequence database. In 1983, he acquired a computer tape containing 811 sequences in version 2 of the EMBL data library; for his protein sequence database, he initially used the sequences he'd typed in for his Masters project. However, the following year, he received the first electronic copy of the *Atlas*. He was quick to appreciate the advantages and disadvantages of the PIR and EMBL formats, recognising that converting the manually annotated data of the former into something like the semi-structured format of the latter could produce a resource with the strengths of both – he called this PIR+ and released it side-by-side with his software package, PC/Gene,

Use of the publicly available PIR data-set in this way was not without its problems. Amongst other, deeper, issues were the difficulty of parsing PIR files to extract specific information (*e.g.*, relating to post-translational modifications (PTMs), *etc*.); the lack of functional annotations for some of the newer entries; the lack of cross-referencing to the parent DNA of a given protein sequence; and so on. Somewhat ironically, given what he went on to achieve, Bairoch has written of this period, "*As I was not interested in building up databases I kept sending letters to PIR to ask them to remedy this situation*". But his pleas met with little success. In the summer of 1986, in the face of increasing demand for unencumbered access to his database, he decided to release PIR+ independently of PC/Gene, to make it freely available to the entire research community. The new, public version of the database was released on 21 July 1986 and contained ~3,900 sequences (the exact number is unknown as the original floppy disks have been lost!) This new resource

which by that time he'd commercialised through IntelliGenetics (Bairoch, 2000).

Later, in 1986, in order to facilitate protein sequence analysis more broadly, the NBRF established the Protein Identification Resource (PIR) (George *et al*., 1986). This new online system included the PSD, several bespoke query and analysis tools (*e.g.*, the Protein Sequence Query (PSQ), SEARCH and ALIGN programs), and a new, efficient search program, FASTP. The latter was a modification of an earlier algorithm for searching protein and nucleic acid sequences (Wilbur & Lipman, 1983). Interestingly, given that the number of deduced sequences had, by that time, grown into the thousands, the great advantage of Wilbur and Lipman's method was considered to be its speed. Indeed, their paper reported a "*substantial reduction in the time required to search a data bank".* Improving on this even further, the new FASTP algorithm was able to compare a 200-amino-acid sequence to the 2,677 sequences of the PSD in "*less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)*" (Lipman & Pearson, 1985). Looking back, such search times on such small numbers of sequences seem incredibly slow; at the time (when a contemporary algorithm required 8 hours for the same search), they were revolutionary.

As the PIR was built on NBRF's existing resources, it also made available its DNA databank (Dayhoff *et al*., 1981a) and associated software tools, together with copies of GenBank and the EMBL data library; it also retained the NBRF's cost-recovery model, levying a charge for copies of its databases on magnetic tape and an annual subscription fee for use of its online services – in 1988, these amounted to \$200 per tape release and \$350 per annum respectively (Dayhoff *et al.*, 1981b; Sidman *et al*., 1988). By 1992, the PSD had shown steady growth, with increasing contributions from European and Asian protein sequence centres – most notably, from MIPS (Martinsried, Germany) and from JIPID (Tokyo, Japan). Accordingly, a tripartite collaboration was established, termed PIR-International, to formalise these relationships and establish and disseminate a comprehensive set of protein sequences (Barker *et al.*, 1992). By this time, charging for access to the resource was no longer mentioned, possibly both as a consequence of this more formal distribution arrangement and the advent of browsers like Mosaic, which had suddenly and dramatically changed the way that information could be broadcast and received over the World Wide Web (or, simply, the Web). In 1997 PIR changed its name to the Protein Information Resource (George *et al*., 1997) and, by 2003, with 283,000 sequences (Wu *et al*., 2003), the PSD was the most comprehensive protein sequence database in the world.

#### **3.5 Swiss-prot**

While these events were taking place, a newly qualified Swiss student (who, as a teenager, had been interested in space exploration and the search for extraterrestrial life) attempted to embark on a Masters project involving both 'wet' and 'dry' work – this was Amos Bairoch. The experimental side of his project immediately hit problems when it was discovered that the new mass spectrometer he was to have used didn't work properly. He therefore set to work instead developing protein sequence analysis programs on the computer system running the spectrometer. These were the first steps towards creating the software system that was later to be known as PC/Gene, and was to become the most widely used PC-based sequence analysis package of its day (Bairoch, 2000).

Part of what made this software suite unique was its focus on proteins at a time when the analysis of nucleotide sequences was very much in vogue. In creating these tools, Bairoch entered >1,000 protein sequences into his computer by hand: some of these he gleaned from the literature; most were taken from the *Atlas*, which had not yet been released in electronic form. Of course, this was an immensely tedious process, and was also highly error-prone. Realising this, and anxious to avoid such problems for others in future, he wrote a letter to the *Biochemical Journal* recommending that researchers publishing protein and peptide sequences should compute checksums to "*facilitate the detection of typographical and keyboard errors*" (Bairoch, 1982). As part of the letter, he illustrated the computation of such a 'checking number' for an imaginary peptide, as shown in Figure 2. Although this recommendation was never widely adopted in publishing circles, Bairoch was at least able to ensure that it was implemented in his own database.

12 Bioinformatics – Trends and Methodologies

Later, in 1986, in order to facilitate protein sequence analysis more broadly, the NBRF established the Protein Identification Resource (PIR) (George *et al*., 1986). This new online system included the PSD, several bespoke query and analysis tools (*e.g.*, the Protein Sequence Query (PSQ), SEARCH and ALIGN programs), and a new, efficient search program, FASTP. The latter was a modification of an earlier algorithm for searching protein and nucleic acid sequences (Wilbur & Lipman, 1983). Interestingly, given that the number of deduced sequences had, by that time, grown into the thousands, the great advantage of Wilbur and Lipman's method was considered to be its speed. Indeed, their paper reported a "*substantial reduction in the time required to search a data bank".* Improving on this even further, the new FASTP algorithm was able to compare a 200-amino-acid sequence to the 2,677 sequences of the PSD in "*less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)*" (Lipman & Pearson, 1985). Looking back, such search times on such small numbers of sequences seem incredibly slow; at the time (when a contemporary

As the PIR was built on NBRF's existing resources, it also made available its DNA databank (Dayhoff *et al*., 1981a) and associated software tools, together with copies of GenBank and the EMBL data library; it also retained the NBRF's cost-recovery model, levying a charge for copies of its databases on magnetic tape and an annual subscription fee for use of its online services – in 1988, these amounted to \$200 per tape release and \$350 per annum respectively (Dayhoff *et al.*, 1981b; Sidman *et al*., 1988). By 1992, the PSD had shown steady growth, with increasing contributions from European and Asian protein sequence centres – most notably, from MIPS (Martinsried, Germany) and from JIPID (Tokyo, Japan). Accordingly, a tripartite collaboration was established, termed PIR-International, to formalise these relationships and establish and disseminate a comprehensive set of protein sequences (Barker *et al.*, 1992). By this time, charging for access to the resource was no longer mentioned, possibly both as a consequence of this more formal distribution arrangement and the advent of browsers like Mosaic, which had suddenly and dramatically changed the way that information could be broadcast and received over the World Wide Web (or, simply, the Web). In 1997 PIR changed its name to the Protein Information Resource (George *et al*., 1997) and, by 2003, with 283,000 sequences (Wu *et al*., 2003), the PSD was the most comprehensive protein

While these events were taking place, a newly qualified Swiss student (who, as a teenager, had been interested in space exploration and the search for extraterrestrial life) attempted to embark on a Masters project involving both 'wet' and 'dry' work – this was Amos Bairoch. The experimental side of his project immediately hit problems when it was discovered that the new mass spectrometer he was to have used didn't work properly. He therefore set to work instead developing protein sequence analysis programs on the computer system running the spectrometer. These were the first steps towards creating the software system that was later to be known as PC/Gene, and was to become the most widely used PC-based

Part of what made this software suite unique was its focus on proteins at a time when the analysis of nucleotide sequences was very much in vogue. In creating these tools, Bairoch entered >1,000 protein sequences into his computer by hand: some of these he gleaned from

algorithm required 8 hours for the same search), they were revolutionary.

sequence database in the world.

sequence analysis package of its day (Bairoch, 2000).

**3.5 Swiss-prot** 

Fig. 2. Computation of a 'checking number' (CN) for an imaginary peptide, as published in a letter to the *Biochemical Journal* in 1982. The journal editors either didn't notice, or chose to ignore, the hidden message in the peptide. Reproduced with permission, from Bairoch, A. (1982), *Biochemical Journal*, 203, 527-528. © the Biochemical Society

Several other important developments were to emerge from the work of this enthusiastic and industrious student. For the analysis software he was developing, he needed to distribute both a nucleotide and a protein sequence database. In 1983, he acquired a computer tape containing 811 sequences in version 2 of the EMBL data library; for his protein sequence database, he initially used the sequences he'd typed in for his Masters project. However, the following year, he received the first electronic copy of the *Atlas*. He was quick to appreciate the advantages and disadvantages of the PIR and EMBL formats, recognising that converting the manually annotated data of the former into something like the semi-structured format of the latter could produce a resource with the strengths of both – he called this PIR+ and released it side-by-side with his software package, PC/Gene, which by that time he'd commercialised through IntelliGenetics (Bairoch, 2000).

Use of the publicly available PIR data-set in this way was not without its problems. Amongst other, deeper, issues were the difficulty of parsing PIR files to extract specific information (*e.g.*, relating to post-translational modifications (PTMs), *etc*.); the lack of functional annotations for some of the newer entries; the lack of cross-referencing to the parent DNA of a given protein sequence; and so on. Somewhat ironically, given what he went on to achieve, Bairoch has written of this period, "*As I was not interested in building up databases I kept sending letters to PIR to ask them to remedy this situation*". But his pleas met with little success. In the summer of 1986, in the face of increasing demand for unencumbered access to his database, he decided to release PIR+ independently of PC/Gene, to make it freely available to the entire research community. The new, public version of the database was released on 21 July 1986 and contained ~3,900 sequences (the exact number is unknown as the original floppy disks have been lost!) This new resource

Concepts, Historical Milestones and

Europe was routed via the Swedish node using xNDT.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 15

X25/DataPak, this was replaced by a TCP/IP-package called MultiNet, which was licensed for all EMBnet Nodes from SRI (Stanford Research Institute). FTP-transmissions of database updates were often interrupted by network problems, and, to overcome the need for frequent re-transmissions, the NDT (Network Data Transfer, later xNDT for extended NDT) protocol was developed at the Swedish EMBnet Node at Uppsala Biomedical Centre, by Peter Gad. It was given a so-called 'systems well-known port' (embl-ndt, 394/udp,# EMBL Nucleic Data Transfer) by the Internet authorities, and is thus in good company with, for example, Telnet (port 23) and FTP (ports 20, 21). For a few years, (x)NDT, and its accompanying suite of client-server programs, was the method par preference, used at almost all EMBnet Nodes to keep their local databases updated. NDT took care of the transmission (database) entry by entry and didn't have to re-start following network interruptions. The Greek node, situated in Crete, only had a modem connection to the mainland, and benefited hugely from using the xNDT-suite. Indeed, at the time the European Bioinformatics Institute was established (when the EMBL Data Library moved from Heidelberg to Cambridge), most of the nucleotide sequence database update traffic in

Fig. 3. Illustration of the relationship between the different Nodes of the early EMBnet: some National Nodes had either Specialist or Industrial Nodes affiliated with them; some had both; some had neither. Today, 31 National and Specialist Nodes contribute to the Network.

was called Swiss-Prot (Bairoch & Boeckmann, 1991), and was to become the foremost manually annotated database of protein sequences in the world.

#### **3.6 The European Molecular Biology Network (EMBnet)**

It is interesting that, during this era, the distribution of databases like the EMBL data library, PIR, Swiss-Prot and so on, was still largely effected by the exchange of computer tapes and disks. By this time, a variety of computer networks had begun to evolve: the first such network, ARPANET (which began life with 4 nodes in late 1969), was the progenitor of the Internet, and was superseded by it in 1983 – recall, it was partly owing to the existence of ARPANET that GenBank was established at Los Alamos. Other networks that offered gateways into the Internet later merged with it, including Usenet and BITNET; commercial and educational networks, such as Telenet (or Sprintnet), Tymnet, Compuserve and JANET, were interconnected with it in the 1980s.

In 1988, Chris Sander at the EMBL helped to establish a new network, EMBnet, to disseminate data, knowledge and services, to support and advance molecular biology and biotechnology research across Europe. A major driver for creating EMBnet was the need for local access to databases such as the EMBL data library from centralised sources. Essentially, this is because scientists were now demanding to use client workstations with Graphical User Interfaces (GUIs) that provided real-time interaction with their back-end data/analysis servers. At the time, high-speed data communication across Europe was in its infancy, and access to remote computers using ordinary command-line oriented terminals was too slow. It was clear that communication delays could be eliminated if servers held copies of data locally; the sheer amount of compute resources needed for European research in this field also pointed to a distributed solution (note that computer cluster technology only gained widespread acceptance much later). Thus, an organised way of distributing data and resources from the EMBL to its member states had to be established.

The concept of a network of national 'nodes', each serving its country with up-to-date biological databases and also providing compute resources for data analysis, was formulated. It was given the name the European Molecular Biology network, EMBnet. The first practical steps were taken in the spring of 1988 to solicit feedback from scientists around Europe; and in July 1988, the first EMBnet Workshop was organised at EMBL, with participants from EMBL, Daresbury (UK), CITI2 (France), the CAOS/CAMM Centre (the Netherlands) and Hoffmann-La Roche. In November of that year, the EMBL Director General corresponded with EMBL Council members, encouraging them to stimulate local processes to identify regional EMBnet nodes. As more countries joined the network (France, Sweden, the UK, the Netherlands, Spain, Israel, Norway, Italy and Denmark, with Switzerland, West Germany, Austria, Greece and Finland waiting in the wings), EMBnet received its first European grant under the BRIDGE framework, in 1991.

The principal project objective was to promote EMBnet as a computer network for European bioinformatics. Service provision and knowledge sharing was to be orchestrated primarily by 'National Nodes', with government mandates to support their local communities, especially by providing access to bioinformatics data synchronised with the EMBL, GenBank and DDBJ central data repositories – in time, the network also attracted a number of 'Specialist' and 'Industrial' Nodes, whose resources and know-how were seen to complement those of its National Nodes (this arrangement of cooperating Nodes is illustrated in Figure 3).

Most EMBnet Nodes had VAX computers, and the original intention was to use DECNET as the underlying transport protocol. However, after a short, but expensive, period of using

was called Swiss-Prot (Bairoch & Boeckmann, 1991), and was to become the foremost

It is interesting that, during this era, the distribution of databases like the EMBL data library, PIR, Swiss-Prot and so on, was still largely effected by the exchange of computer tapes and disks. By this time, a variety of computer networks had begun to evolve: the first such network, ARPANET (which began life with 4 nodes in late 1969), was the progenitor of the Internet, and was superseded by it in 1983 – recall, it was partly owing to the existence of ARPANET that GenBank was established at Los Alamos. Other networks that offered gateways into the Internet later merged with it, including Usenet and BITNET; commercial and educational networks, such as Telenet (or Sprintnet), Tymnet, Compuserve and JANET,

In 1988, Chris Sander at the EMBL helped to establish a new network, EMBnet, to disseminate data, knowledge and services, to support and advance molecular biology and biotechnology research across Europe. A major driver for creating EMBnet was the need for local access to databases such as the EMBL data library from centralised sources. Essentially, this is because scientists were now demanding to use client workstations with Graphical User Interfaces (GUIs) that provided real-time interaction with their back-end data/analysis servers. At the time, high-speed data communication across Europe was in its infancy, and access to remote computers using ordinary command-line oriented terminals was too slow. It was clear that communication delays could be eliminated if servers held copies of data locally; the sheer amount of compute resources needed for European research in this field also pointed to a distributed solution (note that computer cluster technology only gained widespread acceptance much later). Thus, an organised way of distributing data and

The concept of a network of national 'nodes', each serving its country with up-to-date biological databases and also providing compute resources for data analysis, was formulated. It was given the name the European Molecular Biology network, EMBnet. The first practical steps were taken in the spring of 1988 to solicit feedback from scientists around Europe; and in July 1988, the first EMBnet Workshop was organised at EMBL, with participants from EMBL, Daresbury (UK), CITI2 (France), the CAOS/CAMM Centre (the Netherlands) and Hoffmann-La Roche. In November of that year, the EMBL Director General corresponded with EMBL Council members, encouraging them to stimulate local processes to identify regional EMBnet nodes. As more countries joined the network (France, Sweden, the UK, the Netherlands, Spain, Israel, Norway, Italy and Denmark, with Switzerland, West Germany, Austria, Greece and Finland waiting in the wings), EMBnet

The principal project objective was to promote EMBnet as a computer network for European bioinformatics. Service provision and knowledge sharing was to be orchestrated primarily by 'National Nodes', with government mandates to support their local communities, especially by providing access to bioinformatics data synchronised with the EMBL, GenBank and DDBJ central data repositories – in time, the network also attracted a number of 'Specialist' and 'Industrial' Nodes, whose resources and know-how were seen to complement those of its National Nodes (this arrangement of cooperating Nodes is

Most EMBnet Nodes had VAX computers, and the original intention was to use DECNET as the underlying transport protocol. However, after a short, but expensive, period of using

manually annotated database of protein sequences in the world.

resources from the EMBL to its member states had to be established.

received its first European grant under the BRIDGE framework, in 1991.

**3.6 The European Molecular Biology Network (EMBnet)** 

were interconnected with it in the 1980s.

illustrated in Figure 3).

X25/DataPak, this was replaced by a TCP/IP-package called MultiNet, which was licensed for all EMBnet Nodes from SRI (Stanford Research Institute). FTP-transmissions of database updates were often interrupted by network problems, and, to overcome the need for frequent re-transmissions, the NDT (Network Data Transfer, later xNDT for extended NDT) protocol was developed at the Swedish EMBnet Node at Uppsala Biomedical Centre, by Peter Gad. It was given a so-called 'systems well-known port' (embl-ndt, 394/udp,# EMBL Nucleic Data Transfer) by the Internet authorities, and is thus in good company with, for example, Telnet (port 23) and FTP (ports 20, 21). For a few years, (x)NDT, and its accompanying suite of client-server programs, was the method par preference, used at almost all EMBnet Nodes to keep their local databases updated. NDT took care of the transmission (database) entry by entry and didn't have to re-start following network interruptions. The Greek node, situated in Crete, only had a modem connection to the mainland, and benefited hugely from using the xNDT-suite. Indeed, at the time the European Bioinformatics Institute was established (when the EMBL Data Library moved from Heidelberg to Cambridge), most of the nucleotide sequence database update traffic in Europe was routed via the Swedish node using xNDT.

Fig. 3. Illustration of the relationship between the different Nodes of the early EMBnet: some National Nodes had either Specialist or Industrial Nodes affiliated with them; some had both; some had neither. Today, 31 National and Specialist Nodes contribute to the Network.

Concepts, Historical Milestones and

(Bairoch & Bucher, 1994).

held 441 family descriptions.

the USA.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 17

developed a program to scan his growing collection of sequence 'patterns'. This partprogram, part-database chimera he named PROSITE (Bairoch, 1991). In March 1988, as part

As with Swiss-Prot before it, PROSITE swiftly gained popularity. Its growing band of users began not only to suggest additional patterns that could be included in the database, but also to pressure Bairoch into giving PROSITE an independent life of its own, outside PC/Gene. Consequently, the availability of a new public version was announced in October 1989, and formally released the following month with 202 entries (version 4.0). Diagnostically, it was clear that sequence patterns had certain limitations. In particular, matching a pattern is a binary 'match/no-match' event: even the most trivial difference (a single amino acid) results in a mis-match. As Swiss-Prot expanded and accommodated more and more divergent members of its various superfamilies, the more evident this particular weakness became. One solution to this problem emerged in the form of position-specific weight matrices, or profiles. Built from comprehensive sequence alignments, profiles are tolerant both of amino acid substitutions and of insertions/deletions; they therefore allow the relationships between families of sequences to be modelled more 'realistically'. Accordingly, with the help of Phillipp Bucher, Bairoch began to augment PROSITE with sequence profiles – the first release to include them came with version 12.0, in June 1994

Another solution, which arose (at least methodologically) independently from PROSITE, was the development of protein family 'fingerprints'. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. They are diagnostically more powerful and flexible than patterns, because they can tolerate mis-matches at the level both of individual motifs and of the fingerprint as a whole. Fingerprints formed the basis of a database that began life as the Features Database, part of the SERPENT information storage and analysis resource for protein sequences established at the University of Leeds (Akrigg *et al*., 1992). Its first release, in October 1991, contained 29 entries: two thirds of these were linked to equivalent entries in PROSITE, which by then

Although disparate in size, the Features and PROSITE databases had various aspects in common; most notable amongst these was the principle of added-value through handcrafted annotation of their diagnostic signatures. In March 1991, Bairoch met Terri Attwood for the first time at the British Crystallographic Association spring meeting in Sheffield. Faced with the same, relentlessly time-consuming, manual-annotation burdens, they shared their woes and discussed the wisdom of unifying the PROSITE and Features databases. Motivated by common ideals, they later formalised their ideas in the guise of their first European grant proposal to merge their databases into an integrated protein family

In the meantime, inspired by PROSITE, a range of other signature databases began to emerge. One of the earliest of these was Blocks, first described by Steve and Jorja Henikoff in December 1991 (Henikoff & Henikoff, 1991). Later came ProDom (Sonnhammer & Kahn, 1994), and later still Pfam (Sonnhammer *et al*., 1997). Initially linked closely to the annotation of predicted proteins from genomic sequencing of *Caenorhabditis elegans*, Pfam was to become one of the most widely used protein family databases across Europe and

annotation resource. This was 1992; they were not successful.

of PC/Gene, the first release of this new resource contained 58 entries.

By the early '90s, biomolecular databases could be accessed across the Internet by means of the WAIS and Gopher network retrieval systems; and, under the auspices of EMBnet, Reinhard Doetlz had developed a new network access protocol, HASSLE, the Hierarchical Access System for Sequence Libraries in Europe (Doelz, 1994). But it was the advent of graphical Web browsers (first, Mosaic from the National Center for Supercomputing Applications in 1993, and then Netscape Navigator in 1994) that revolutionised the processes of database dissemination and information consumption – literally, at the click of a mouse button.

Of course, browsers allowed data and documents of all kinds to be instantly shared, and individuals and organisations across the globe were quick to establish their own unique 'Web presence'. EMBnet was no exception, and embraced the Web as a means of communicating more effectively with its widening community, in particular by publishing a regular newsletter, *EMBnet.news.* The newsletter was designed to provide reports and updates on its internal and international activities and achievements, together with technical and scientific papers on new developments in bioinformatics, computational biology and biocomputing. In 2000, the organisation provided an educational grant to help support the creation of the peer-reviewed journal *Briefings in Bioinformatics* (BiB) and, as a mark of its own success, *EMBnet.news* is also now in the process of transitioning to a peer-reviewed journal.

From the outset, EMBnet has promoted the development of distributed computing services to share workload among international servers; it has contributed to the development and maintenance of advanced database systems; it has been an advocate of the deployment of Grid technologies for the life sciences through its contributions to major European Grid projects; it developed, and continues to promote the use of, an e-learning system both to support distance learning in bioinformatics and to complement face-to-face bioinformatics teaching and training; and it is committed to bringing the latest software and algorithms to users, free of charge.

The combined expertise of its Nodes has allowed EMBnet to provide services to its local European life science communities with far greater effect than could be achieved by any of its individual Nodes in isolation. Following this success, a variety of Nodes world-wide have joined EMBnet such that, today, the network is global, with many countries from Asia, Africa and America joining in recent years (including Sri Lanka, Pakistan, Kenya and Costa Rica). Currently, the network connects 31 member Nodes extending over 27 countries; together, the Nodes continue to work to disseminate data, to share compute resources and to provide training support, reaching out to many thousands of users.

#### **3.7 PROSITE**

While EMBnet was being conceived, before the Internet had truly taken off, and while bioinformatics was still in the throes of been born, the computer savvy molecular biologists of the day were still busily swapping biomolecular databases on magnetic tapes and computer disks. Perhaps an inevitable consequence of the systematic collection of protein and nucleotide sequences in this way was the need to organise and classify these molecular entities in meaningful ways. The first endeavour to categorise protein sequences into evolutionarily related families, and to provide the diagnostic means to detect potential new family members, arose once again as a derivative of the PC/Gene suite. Inspired by the sequence analysis primer, *Of URFs and ORFs* (Doolittle, 1986), Bairoch began to amass examples of short sequences, characteristic of particular binding and active sites, and

By the early '90s, biomolecular databases could be accessed across the Internet by means of the WAIS and Gopher network retrieval systems; and, under the auspices of EMBnet, Reinhard Doetlz had developed a new network access protocol, HASSLE, the Hierarchical Access System for Sequence Libraries in Europe (Doelz, 1994). But it was the advent of graphical Web browsers (first, Mosaic from the National Center for Supercomputing Applications in 1993, and then Netscape Navigator in 1994) that revolutionised the processes of database dissemination and information consumption – literally, at the click of

Of course, browsers allowed data and documents of all kinds to be instantly shared, and individuals and organisations across the globe were quick to establish their own unique 'Web presence'. EMBnet was no exception, and embraced the Web as a means of communicating more effectively with its widening community, in particular by publishing a regular newsletter, *EMBnet.news.* The newsletter was designed to provide reports and updates on its internal and international activities and achievements, together with technical and scientific papers on new developments in bioinformatics, computational biology and biocomputing. In 2000, the organisation provided an educational grant to help support the creation of the peer-reviewed journal *Briefings in Bioinformatics* (BiB) and, as a mark of its own success, *EMBnet.news* is also now in the process of transitioning to a peer-reviewed

From the outset, EMBnet has promoted the development of distributed computing services to share workload among international servers; it has contributed to the development and maintenance of advanced database systems; it has been an advocate of the deployment of Grid technologies for the life sciences through its contributions to major European Grid projects; it developed, and continues to promote the use of, an e-learning system both to support distance learning in bioinformatics and to complement face-to-face bioinformatics teaching and training; and it is committed to bringing the latest software and algorithms to

The combined expertise of its Nodes has allowed EMBnet to provide services to its local European life science communities with far greater effect than could be achieved by any of its individual Nodes in isolation. Following this success, a variety of Nodes world-wide have joined EMBnet such that, today, the network is global, with many countries from Asia, Africa and America joining in recent years (including Sri Lanka, Pakistan, Kenya and Costa Rica). Currently, the network connects 31 member Nodes extending over 27 countries; together, the Nodes continue to work to disseminate data, to share compute resources and

While EMBnet was being conceived, before the Internet had truly taken off, and while bioinformatics was still in the throes of been born, the computer savvy molecular biologists of the day were still busily swapping biomolecular databases on magnetic tapes and computer disks. Perhaps an inevitable consequence of the systematic collection of protein and nucleotide sequences in this way was the need to organise and classify these molecular entities in meaningful ways. The first endeavour to categorise protein sequences into evolutionarily related families, and to provide the diagnostic means to detect potential new family members, arose once again as a derivative of the PC/Gene suite. Inspired by the sequence analysis primer, *Of URFs and ORFs* (Doolittle, 1986), Bairoch began to amass examples of short sequences, characteristic of particular binding and active sites, and

to provide training support, reaching out to many thousands of users.

a mouse button.

journal.

users, free of charge.

**3.7 PROSITE** 

developed a program to scan his growing collection of sequence 'patterns'. This partprogram, part-database chimera he named PROSITE (Bairoch, 1991). In March 1988, as part of PC/Gene, the first release of this new resource contained 58 entries.

As with Swiss-Prot before it, PROSITE swiftly gained popularity. Its growing band of users began not only to suggest additional patterns that could be included in the database, but also to pressure Bairoch into giving PROSITE an independent life of its own, outside PC/Gene. Consequently, the availability of a new public version was announced in October 1989, and formally released the following month with 202 entries (version 4.0). Diagnostically, it was clear that sequence patterns had certain limitations. In particular, matching a pattern is a binary 'match/no-match' event: even the most trivial difference (a single amino acid) results in a mis-match. As Swiss-Prot expanded and accommodated more and more divergent members of its various superfamilies, the more evident this particular weakness became. One solution to this problem emerged in the form of position-specific weight matrices, or profiles. Built from comprehensive sequence alignments, profiles are tolerant both of amino acid substitutions and of insertions/deletions; they therefore allow the relationships between families of sequences to be modelled more 'realistically'. Accordingly, with the help of Phillipp Bucher, Bairoch began to augment PROSITE with sequence profiles – the first release to include them came with version 12.0, in June 1994 (Bairoch & Bucher, 1994).

Another solution, which arose (at least methodologically) independently from PROSITE, was the development of protein family 'fingerprints'. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. They are diagnostically more powerful and flexible than patterns, because they can tolerate mis-matches at the level both of individual motifs and of the fingerprint as a whole. Fingerprints formed the basis of a database that began life as the Features Database, part of the SERPENT information storage and analysis resource for protein sequences established at the University of Leeds (Akrigg *et al*., 1992). Its first release, in October 1991, contained 29 entries: two thirds of these were linked to equivalent entries in PROSITE, which by then held 441 family descriptions.

Although disparate in size, the Features and PROSITE databases had various aspects in common; most notable amongst these was the principle of added-value through handcrafted annotation of their diagnostic signatures. In March 1991, Bairoch met Terri Attwood for the first time at the British Crystallographic Association spring meeting in Sheffield. Faced with the same, relentlessly time-consuming, manual-annotation burdens, they shared their woes and discussed the wisdom of unifying the PROSITE and Features databases. Motivated by common ideals, they later formalised their ideas in the guise of their first European grant proposal to merge their databases into an integrated protein family annotation resource. This was 1992; they were not successful.

In the meantime, inspired by PROSITE, a range of other signature databases began to emerge. One of the earliest of these was Blocks, first described by Steve and Jorja Henikoff in December 1991 (Henikoff & Henikoff, 1991). Later came ProDom (Sonnhammer & Kahn, 1994), and later still Pfam (Sonnhammer *et al*., 1997). Initially linked closely to the annotation of predicted proteins from genomic sequencing of *Caenorhabditis elegans*, Pfam was to become one of the most widely used protein family databases across Europe and the USA.

Concepts, Historical Milestones and

Infrastructure for Biological Information project.

**3.9 Global data overload** 

sequenced since this fruitful dawn.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 19

genome research was placed in the public domain and was freely accessible to the entire scientific community in order to promote scientific progress. Today, with its original 3-fold structure still largely in place, the Institute builds, maintains and disseminates databases and information services relevant to molecular biology, genetics, medicine and agriculture,

Despite its pivotal role as Europe's main bio-database provider, four years later, the EBI was in financial trouble. While the Wellcome Trust and MRC had financed the initial capital costs, the Institute relied on the EU for almost half its budget. In March 1999, however, the member states had advised the Commission that core funding and operational costs for infrastructure should not qualify for funding; the EBI's application for Framework funds was consequently rejected for being out of scope. Graham Cameron, by then joint Head of the Institute with Michael Ashburner, was quick to point out that without an immediate solution, "*we will have to abandon major projects like the DNA database, the draft human genome, the macromolecular structure database and the microarray expression database*" (Butler, 1999). The EBI was in a tricky situation, and Britain had shot itself in the foot: it could hardly contest the Commission's ruling against supporting the EBI because, a Commission official pointed out, "*it was among the countries most against funding infrastructure directly*" (Butler, 1999). The situation was neatly summed up in an editorial *Nature* ran at the time, "*If this Kafkaesque affair has any merit, it is that it has exposed the absence of a clear mechanism for the planning and support of research infrastructure at the European level*" (Nature Editorial, 1999). The cries for new mechanisms for infrastructural support, with stable partners, stable financing and long-term political commitment, doubtless helped to sew the seeds that in 2008 grew into the preparatory phase of ELIXIR, the European Life Science

The late '80s and early '90s were fertile years, giving rise to a flourishing number of new molecular structures and sequences, to new breeds of protein family signatures, and to new databases in which to store them. Looking back at this period of fervent activity, it's incredible to reflect that two major developments had yet to take place: together, these would not only seed an overwhelming explosion of biological data but would also spur their global dissemination – they were the advent of the Web and the arrival of highthroughput DNA sequencing. The latter made whole-genome sequencing practically feasible for the first time. Seizing this opportunity, there followed an unprecedented burst of sequencing activity, yielding, in quick succession, for example, the genomes of *Haemophilus influenzae* and *Mycoplasma genitalium* in 1995 (Fleischmann *et al*., 1995; Fraser *et al*., 1995), of *Methanococcus jannachii* and *Saccharomyces cerevisiae* in 1996 (Bult *et al.*, 1996; Goffeau *et al*., 1996), of *Caenorhabditis elegans* in 1998 (*C.elegans* sequencing consortium, 1998), of *Drosophila melanogaster* in 2000 (Adams *et al*., 2000) and, the ultimate prize, of *Homo sapiens* in 2001 (Lander *et al.*, 2001; Venter *et al*., 2001; IHGSC, 2004). Hundreds of genomes have been

Hand-in-hand with these activities came the development of numerous organism-specific databases to store the emerging genomic data: for example, FlyBase (Ashburner & Drysdale, 1994), ACeDB (Eeckman & Durbin, 1995), SGD (Cherry *et al*., 1998), TAIR (Huala *et al.*, 2001), Ensembl (Hubbard *et al*., 2002), DictyBase (Kreppel *et al*., 2004) and, of course, many more. For some, the value of this genomic 'gold rush' was not entirely clear: with much of the amassed data seemingly impossible to characterise, and vast amounts of it non-coding, the hoped-for

and undertakes leading-edge research in bioinformatics and computational biology.

#### **3.8 The European Bioinformatics Institute (EBI)**

Notwithstanding the proliferation of databases in the '80s, funding for their maintenance was becoming a significant problem. By the early '90s, supporting the EMBL data library was becoming increasingly difficult, and there was growing awareness that a more efficient European bioinformatics infrastructure would be needed to sustain it in future. In 1992, the EMBL concluded that the most robust solution would be to establish a new outstation, devoted to bioinformatics. The vision of creating a European Bioinformatics Institute (EBI) quickly took hold and, in December that year, the EMBL Governing Council published a call for proposals to host the new facility. The deadline was extremely short (February 1993); despite the interest of many countries, therefore, few were able to submit bids in time.

In a study by PA Consulting Group, commissioned by the EC's DGXII, a plan had been developed for a European Nucleotide Sequence Centre (ENSC). The EMBL Council decided to "*negotiate with the EC for the inclusion of the ENSC within the EBI*"; the EBI "*would provide bioinformatics services for European scientists, be a home for the Data Library, and include expansions in research and development necessary for long-term viability and strengthening of neglected areas such as user support*"(Philipson, 1992).

In EMBL's proposal for an EBI from October 1992, worries were expressed that Europe was lagging behind the USA: "*Over the last decade increments in US support for such resources have far outstripped those in Europe*," and the EBI was conceived "*to ensure that European research needs are satisfied in a way which is appropriate to this global competitive context*" (EMBL, 1992). The need for supportive relations between EBI and the European scientific community was emphasised, as "*It would be impossible and undesirable for the EBI to be the sole bioinformatics resource in Europe*". It was noted that support should be given to "*major European interest groups such as software developers, database hosts and other bioinformatics institutes*"; more specifically, "*In recognition of the need for strong national bioinformatics activities, the EBI will give technical and organisational support to the EMBnet Nodes, as is currently done by the EMBL Data Library*"(EMBL, 1992).

Among the bidders for the EBI were Germany, Sweden and the UK. Very favourable conditions were offered from all three. The Swedish bid for an EBI close to Uppsala Biomedical Centre, included, for example, sufficient office space, free of rent, and highspeed network connections. But Michael Ashburner led a more compelling UK bid. The proposal was to host the EBI on a park, newly purchased by the Wellcome Trust, at Hinxton, on the outskirts of Cambridge. The Trust and MRC had agreed each to fund half of the initial capital costs of creating a complete genomics infrastructure on this site, which would also include the newly established Sanger Centre (which, by then, had become embroiled in the HGP) and the Human Genome Mapping Project Resource Centre (Dickson & Abbott, 1993). With its "*clear commitment from all levels of the UK scientific community and Government*", the UK bid won over both Uppsala and the alternative location in Heidelberg, directly adjacent to the EMBL; it was accepted by Council in March 1993. Paulo Zanella (who had directed the CERN Data Handling Division) was subsequently appointed as EBI's first director (Bairoch, 2000).

The EBI became fully operational after completion of the new building in September 1995 – this will no doubt have come as a great relief to the EMBL data library group, who had been accommodated in portable cabins on the Hinxton site since the end of 1994! The new facility had 3 broad divisions: research, industry and services, the latter being mostly devoted to provision and maintenance of the EMBL data library and Swiss-Prot (Bairoch, 2000). The EBI's mission was to ensure that the growing corpus of data from molecular biology and genome research was placed in the public domain and was freely accessible to the entire scientific community in order to promote scientific progress. Today, with its original 3-fold structure still largely in place, the Institute builds, maintains and disseminates databases and information services relevant to molecular biology, genetics, medicine and agriculture, and undertakes leading-edge research in bioinformatics and computational biology.

Despite its pivotal role as Europe's main bio-database provider, four years later, the EBI was in financial trouble. While the Wellcome Trust and MRC had financed the initial capital costs, the Institute relied on the EU for almost half its budget. In March 1999, however, the member states had advised the Commission that core funding and operational costs for infrastructure should not qualify for funding; the EBI's application for Framework funds was consequently rejected for being out of scope. Graham Cameron, by then joint Head of the Institute with Michael Ashburner, was quick to point out that without an immediate solution, "*we will have to abandon major projects like the DNA database, the draft human genome, the macromolecular structure database and the microarray expression database*" (Butler, 1999). The EBI was in a tricky situation, and Britain had shot itself in the foot: it could hardly contest the Commission's ruling against supporting the EBI because, a Commission official pointed out, "*it was among the countries most against funding infrastructure directly*" (Butler, 1999). The situation was neatly summed up in an editorial *Nature* ran at the time, "*If this Kafkaesque affair has any merit, it is that it has exposed the absence of a clear mechanism for the planning and support of research infrastructure at the European level*" (Nature Editorial, 1999). The cries for new mechanisms for infrastructural support, with stable partners, stable financing and long-term political commitment, doubtless helped to sew the seeds that in 2008 grew into the preparatory phase of ELIXIR, the European Life Science Infrastructure for Biological Information project.

#### **3.9 Global data overload**

18 Bioinformatics – Trends and Methodologies

Notwithstanding the proliferation of databases in the '80s, funding for their maintenance was becoming a significant problem. By the early '90s, supporting the EMBL data library was becoming increasingly difficult, and there was growing awareness that a more efficient European bioinformatics infrastructure would be needed to sustain it in future. In 1992, the EMBL concluded that the most robust solution would be to establish a new outstation, devoted to bioinformatics. The vision of creating a European Bioinformatics Institute (EBI) quickly took hold and, in December that year, the EMBL Governing Council published a call for proposals to host the new facility. The deadline was extremely short (February 1993); despite the interest of many countries, therefore, few were able to submit bids in time. In a study by PA Consulting Group, commissioned by the EC's DGXII, a plan had been developed for a European Nucleotide Sequence Centre (ENSC). The EMBL Council decided to "*negotiate with the EC for the inclusion of the ENSC within the EBI*"; the EBI "*would provide bioinformatics services for European scientists, be a home for the Data Library, and include expansions in research and development necessary for long-term viability and strengthening of* 

In EMBL's proposal for an EBI from October 1992, worries were expressed that Europe was lagging behind the USA: "*Over the last decade increments in US support for such resources have far outstripped those in Europe*," and the EBI was conceived "*to ensure that European research needs are satisfied in a way which is appropriate to this global competitive context*" (EMBL, 1992). The need for supportive relations between EBI and the European scientific community was emphasised, as "*It would be impossible and undesirable for the EBI to be the sole bioinformatics resource in Europe*". It was noted that support should be given to "*major European interest groups such as software developers, database hosts and other bioinformatics institutes*"; more specifically, "*In recognition of the need for strong national bioinformatics activities, the EBI will give technical and organisational support to the EMBnet Nodes, as is currently done by the EMBL* 

Among the bidders for the EBI were Germany, Sweden and the UK. Very favourable conditions were offered from all three. The Swedish bid for an EBI close to Uppsala Biomedical Centre, included, for example, sufficient office space, free of rent, and highspeed network connections. But Michael Ashburner led a more compelling UK bid. The proposal was to host the EBI on a park, newly purchased by the Wellcome Trust, at Hinxton, on the outskirts of Cambridge. The Trust and MRC had agreed each to fund half of the initial capital costs of creating a complete genomics infrastructure on this site, which would also include the newly established Sanger Centre (which, by then, had become embroiled in the HGP) and the Human Genome Mapping Project Resource Centre (Dickson & Abbott, 1993). With its "*clear commitment from all levels of the UK scientific community and Government*", the UK bid won over both Uppsala and the alternative location in Heidelberg, directly adjacent to the EMBL; it was accepted by Council in March 1993. Paulo Zanella (who had directed the CERN Data Handling Division) was subsequently appointed as EBI's first

The EBI became fully operational after completion of the new building in September 1995 – this will no doubt have come as a great relief to the EMBL data library group, who had been accommodated in portable cabins on the Hinxton site since the end of 1994! The new facility had 3 broad divisions: research, industry and services, the latter being mostly devoted to provision and maintenance of the EMBL data library and Swiss-Prot (Bairoch, 2000). The EBI's mission was to ensure that the growing corpus of data from molecular biology and

**3.8 The European Bioinformatics Institute (EBI)** 

*neglected areas such as user support*"(Philipson, 1992).

*Data Library*"(EMBL, 1992).

director (Bairoch, 2000).

The late '80s and early '90s were fertile years, giving rise to a flourishing number of new molecular structures and sequences, to new breeds of protein family signatures, and to new databases in which to store them. Looking back at this period of fervent activity, it's incredible to reflect that two major developments had yet to take place: together, these would not only seed an overwhelming explosion of biological data but would also spur their global dissemination – they were the advent of the Web and the arrival of highthroughput DNA sequencing. The latter made whole-genome sequencing practically feasible for the first time. Seizing this opportunity, there followed an unprecedented burst of sequencing activity, yielding, in quick succession, for example, the genomes of *Haemophilus influenzae* and *Mycoplasma genitalium* in 1995 (Fleischmann *et al*., 1995; Fraser *et al*., 1995), of *Methanococcus jannachii* and *Saccharomyces cerevisiae* in 1996 (Bult *et al.*, 1996; Goffeau *et al*., 1996), of *Caenorhabditis elegans* in 1998 (*C.elegans* sequencing consortium, 1998), of *Drosophila melanogaster* in 2000 (Adams *et al*., 2000) and, the ultimate prize, of *Homo sapiens* in 2001 (Lander *et al.*, 2001; Venter *et al*., 2001; IHGSC, 2004). Hundreds of genomes have been sequenced since this fruitful dawn.

Hand-in-hand with these activities came the development of numerous organism-specific databases to store the emerging genomic data: for example, FlyBase (Ashburner & Drysdale, 1994), ACeDB (Eeckman & Durbin, 1995), SGD (Cherry *et al*., 1998), TAIR (Huala *et al.*, 2001), Ensembl (Hubbard *et al*., 2002), DictyBase (Kreppel *et al*., 2004) and, of course, many more. For some, the value of this genomic 'gold rush' was not entirely clear: with much of the amassed data seemingly impossible to characterise, and vast amounts of it non-coding, the hoped-for

Concepts, Historical Milestones and

to do so, new processes had to be put in place.

**3.10 TrEMBL** 

simply became untenable.

assisted annotation strategies.

**3.11 InterPro** 

the Central Place of Bioinformatics in Modern Biology: A European Perspective 21

passing year from the mid '90s, there was a widening gulf both between the volume of accumulating uncharacterised genomic sequence data and the fraction of this that it was possible to annotate, and between the quantities of deposited biomolecular sequence and structure data. Against this backdrop, Bairoch announced the development of a separate, automatically generated counterpart to augment Swiss-Prot, to help disseminate the fruits of the increasingly abundant genome projects more efficiently, without compromising the quality of Swiss-Prot by including within it substantial quantities of uncharacterised data.

By 1996, the first shock-waves from the impact of whole genome sequencing were beginning to be felt. The aftermath was greatest for databases whose maintenance involved significant amounts of manual annotation. Some did not recover. Swiss-Prot did survive the quake, but

At the time, Swiss-Prot had the highest standard of annotation of any publicly available protein sequence database: from the outset, one of its leading goals was to provide critical analyses for all of its constituent sequences. To this end, each entry was accompanied by a significant amount of annotation, derived primarily from original publications and review articles by an expanding group of curators, with occasional input from an international panel of experts. This high degree of meticulous manual annotation had always been the rate-limiting step for each release of the resource; however, faced with the increased data flow from the growing number of genome projects, this hugely labour-intensive process

To keep up, it was clear that a new approach was needed. The products of genomic sequences had to be made available more swiftly; but how could this be achieved without compromising the high quality of the existing Swiss-Prot data, or eroding the editorial standards of the database in future? The answer was to prepare a computer-generated supplement, with entries in a Swiss-Prot-like format, derived by translation of coding sequences in the EMBL library – this was TrEMBL, first released in October 1996 (Bairoch & Apweiler, 1996). TrEMBL 1.0 contained almost 105,000 entries, not far off twice the size of

Initially, TrEMBL was an unannotated supplement to Swiss-Prot. Over the years, however, to accelerate the process of upgrading TrEMBL entries to the Swiss-Prot standard, automatic protocols have been established to annotate sequences with information about their potential functions, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and so on. Such information was derived from similarity and motif searches, initially using patterns, profiles, fingerprints and so on from databases like PROSITE, PRINTS and Pfam, and later using the amalgamated protein family resource, InterPro. By February 2011, with many millions of entries, TrEMBL was almost 26 times larger than Swiss-Prot, illustrating the vast disparity between manual and computer-

Rolf Apweiler was to spearhead the development of TrEMBL at the EBI in collaboration with Bairoch at the Swiss Institute of Bioinformatics (SIB). In 1997, Michael Ashburner (then Director of the EBI) awarded Attwood an EBI Visiting Fellowship. This entailed weekly visits from London, and led to frequent discussions between Apweiler, Attwood and Bairoch about sequence annotation. The feasibility of uniting PROSITE and PRINTS again

Swiss-Prot 34.0 (59,000 entries), with which it was released in parallel.

treasure troves were beginning to look about as inspiring as large-scale collections of butterflies (Strasser, 2008), and perhaps suggested that molecular biology had entered a somewhat vacuous era of "*high-tech stamp collecting*" (Hunter, 2006). Arguments like this characterised some of the early opposition to the establishment of GenBank, and to the substantial resistance to the Human Genome Project (HGP) a few years later (Strasser, 2008).

Perhaps inevitably, then, the HGP was an extraordinarily high-profile affair. This was partly for the reasons outlined above, coupled with its considerable price-tag (estimated at \$3 billion from 1990-2003), but in part also because of the public-private race between Francis Collins (who was directing the NIH National Human Genome Research Institute contributions to the HGP) and Craig Venter (then Head of Celera Genomics) to obtain the first rough draft of man's genetic blueprint. This intensely political 'drama' had been preceded by a similar struggle to be the first to sequence *Drosophila*, which served as a kind of 'warm up' battle for the human genome (Ashburner, 2006); it also had an intriguing parallel in the competition between two public-private corporations to sequence the genome of the commercially valuable *Agrobacterium tumefaciens* (Goodner *et al*., 2001; Wood *et al*., 2001; Harvey & McMeekin, 2004). The principal tension between these public and private, and public-private hybrid, enterprises arose not just from the race to be first to complete the sequencing: the struggle was as much about making the results public, on the one hand, and obtaining the property rights (for commercial exploitation, including gene patenting), on the other. Like the concerns in the early '80s surrounding NBRF's proprietary interest in protein sequences culled from the public domain, such conflicts raised serious questions about the duty of public science to ensure that genome sequences were made available for the public good; moreover, they challenged such wasteful competition, resulting in the acquisition of duplicate data-sets and, usually, back-toback publications in high-profile journals (Harvey & McMeekin, 2004).

Another, more tangible, consequence of this intense orgy of genomic sequencing was the generation of more data than could realistically be managed and annotated by hand – and this was just the tip of an enormous future iceberg. As illustrated in Figure 4, with each

Fig. 4. Growth of the EMBL data library (millions of entries) since its inception (red curve). Also shown are the corresponding growth of the manually-annotated Swiss-Prot (green line), and of structures deposited in the PDB (this line is too small to be visible!).

passing year from the mid '90s, there was a widening gulf both between the volume of accumulating uncharacterised genomic sequence data and the fraction of this that it was possible to annotate, and between the quantities of deposited biomolecular sequence and structure data. Against this backdrop, Bairoch announced the development of a separate, automatically generated counterpart to augment Swiss-Prot, to help disseminate the fruits of the increasingly abundant genome projects more efficiently, without compromising the quality of Swiss-Prot by including within it substantial quantities of uncharacterised data.

#### **3.10 TrEMBL**

20 Bioinformatics – Trends and Methodologies

treasure troves were beginning to look about as inspiring as large-scale collections of butterflies (Strasser, 2008), and perhaps suggested that molecular biology had entered a somewhat vacuous era of "*high-tech stamp collecting*" (Hunter, 2006). Arguments like this characterised some of the early opposition to the establishment of GenBank, and to the substantial resistance to the Human Genome Project (HGP) a few years later (Strasser, 2008). Perhaps inevitably, then, the HGP was an extraordinarily high-profile affair. This was partly for the reasons outlined above, coupled with its considerable price-tag (estimated at \$3 billion from 1990-2003), but in part also because of the public-private race between Francis Collins (who was directing the NIH National Human Genome Research Institute contributions to the HGP) and Craig Venter (then Head of Celera Genomics) to obtain the first rough draft of man's genetic blueprint. This intensely political 'drama' had been preceded by a similar struggle to be the first to sequence *Drosophila*, which served as a kind of 'warm up' battle for the human genome (Ashburner, 2006); it also had an intriguing parallel in the competition between two public-private corporations to sequence the genome of the commercially valuable *Agrobacterium tumefaciens* (Goodner *et al*., 2001; Wood *et al*., 2001; Harvey & McMeekin, 2004). The principal tension between these public and private, and public-private hybrid, enterprises arose not just from the race to be first to complete the sequencing: the struggle was as much about making the results public, on the one hand, and obtaining the property rights (for commercial exploitation, including gene patenting), on the other. Like the concerns in the early '80s surrounding NBRF's proprietary interest in protein sequences culled from the public domain, such conflicts raised serious questions about the duty of public science to ensure that genome sequences were made available for the public good; moreover, they challenged such wasteful competition, resulting in the acquisition of duplicate data-sets and, usually, back-to-

back publications in high-profile journals (Harvey & McMeekin, 2004).

Another, more tangible, consequence of this intense orgy of genomic sequencing was the generation of more data than could realistically be managed and annotated by hand – and this was just the tip of an enormous future iceberg. As illustrated in Figure 4, with each

Fig. 4. Growth of the EMBL data library (millions of entries) since its inception (red curve). Also shown are the corresponding growth of the manually-annotated Swiss-Prot (green

line), and of structures deposited in the PDB (this line is too small to be visible!).

By 1996, the first shock-waves from the impact of whole genome sequencing were beginning to be felt. The aftermath was greatest for databases whose maintenance involved significant amounts of manual annotation. Some did not recover. Swiss-Prot did survive the quake, but to do so, new processes had to be put in place.

At the time, Swiss-Prot had the highest standard of annotation of any publicly available protein sequence database: from the outset, one of its leading goals was to provide critical analyses for all of its constituent sequences. To this end, each entry was accompanied by a significant amount of annotation, derived primarily from original publications and review articles by an expanding group of curators, with occasional input from an international panel of experts. This high degree of meticulous manual annotation had always been the rate-limiting step for each release of the resource; however, faced with the increased data flow from the growing number of genome projects, this hugely labour-intensive process simply became untenable.

To keep up, it was clear that a new approach was needed. The products of genomic sequences had to be made available more swiftly; but how could this be achieved without compromising the high quality of the existing Swiss-Prot data, or eroding the editorial standards of the database in future? The answer was to prepare a computer-generated supplement, with entries in a Swiss-Prot-like format, derived by translation of coding sequences in the EMBL library – this was TrEMBL, first released in October 1996 (Bairoch & Apweiler, 1996). TrEMBL 1.0 contained almost 105,000 entries, not far off twice the size of Swiss-Prot 34.0 (59,000 entries), with which it was released in parallel.

Initially, TrEMBL was an unannotated supplement to Swiss-Prot. Over the years, however, to accelerate the process of upgrading TrEMBL entries to the Swiss-Prot standard, automatic protocols have been established to annotate sequences with information about their potential functions, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and so on. Such information was derived from similarity and motif searches, initially using patterns, profiles, fingerprints and so on from databases like PROSITE, PRINTS and Pfam, and later using the amalgamated protein family resource, InterPro. By February 2011, with many millions of entries, TrEMBL was almost 26 times larger than Swiss-Prot, illustrating the vast disparity between manual and computerassisted annotation strategies.

#### **3.11 InterPro**

Rolf Apweiler was to spearhead the development of TrEMBL at the EBI in collaboration with Bairoch at the Swiss Institute of Bioinformatics (SIB). In 1997, Michael Ashburner (then Director of the EBI) awarded Attwood an EBI Visiting Fellowship. This entailed weekly visits from London, and led to frequent discussions between Apweiler, Attwood and Bairoch about sequence annotation. The feasibility of uniting PROSITE and PRINTS again

Concepts, Historical Milestones and

to facilitate the integration process.

Table 3. InterPro release 31.0, February 2011.

helped to sustain the resource.

(Hunter *et al*., 2009).

**3.12 UniProt** 

the Central Place of Bioinformatics in Modern Biology: A European Perspective 23

ProDom, although part of the original consortium (see Figure 5), was not included in the first release, initially because there was no obvious way of doing so. ProDom is built from automatically generated sequence clusters: it isn't a true signature database, in the sense that it doesn't exploit diagnostic discriminators; moreover, its sequence clusters need not have precise biological correlations, so can change between database releases. Assigning stable accession numbers to its entries was therefore impossible; this issue had to be addressed before it could be meaningfully included in InterPro. Other factors rendered a step-wise approach to the development of InterPro desirable. The scale of amalgamating just PROSITE, PRINTS and Pfam was immense. Trying to sensibly merge apparently equivalent database entries that, in fact, defined specific families, domains within those families, or even repeats within those domains, presented enormous challenges. In the beginning, InterPro therefore focused on amalgamating databases that offered some level of annotation,

Over the years, further partners joined the InterPro consortium, as illustrated in Figure 5. Today, with 12 primary sources, the integration challenges are legion (some of the complexity can be understood from the list of partners, and the numbers of their signatures that InterPro has incorporated, shown in Table 3)! With 21,185 entries in February 2011 (release 31.0), it is the most comprehensive integrated protein family database in the world

**Signature Database Version Signatures Integrated Signatures** 

The year 2004 marked a turning point for the way in which protein sequence data were to be collected and disseminated globally. The PIR-PSD, which had evolved from Dayhoff's *Atlas*, had been available online since 1986; Swiss-Prot, which originally built on PIR data, also became available in 1986; and TrEMBL had been released in 1996. The ongoing maintenance of these disparate resources over so many years had posed major funding headaches. For PIR, some of the difficulties were mitigated, at least in the early years, by charging for copies of their databases and for online access to their software; later, the international collaboration with MIPS and JIPID, supported by NSF and European grants, no doubt

GENE3D 3.3.0 2,386 1,377 HAMAP 021210 1,675 1,429 PANTHER 7.0 80,933 1,777 PIRSF 2.74 3,248 2,791 PRINTS 41.1 2,050 2,009 PROSITE patterns 20.66 1,308 1,292 PROSITE profiles 20.66 901 877 Pfam 24.0 11,912 11,465 ProDom 2006.1 1,894 1,008 SMART 6.1 895 882 SUPERFAMILY 1.73 1,774 1,154 TIGRFAMs 9.0 3,808 3,796

reared its head, but this time primarily as an instrument to help analyse and functionally annotate the growing numbers of uncharacterised genomic sequences. Compared to the original proposal in 1992, the case was much stronger, especially as there were now other related databases to bring into the picture: Daniel Kahn had released ProDom in 1994, and Richard Durbin had just announced Pfam. A new proposal was therefore submitted to the European Commission, and the vision of an integrated protein family database was finally funded.

In October 1999, a beta release of the unified resource was made with 2,423 entries (representing 615 domains, 1776 families, 27 repeats and 8 sites of PTM), based on Swiss-Prot 38.0 and TrEMBL 11.0 – this was InterPro (Apweiler *et al*., 2001). By that time, PROSITE and the Features Database had both undergone significant changes: PROSITE had seen 3 fold growth to 1,370 entries (release 16.0); meanwhile, the Features Database had grown 40 fold to 1,157 entries (release 23.1) and had been renamed 'PRINTS' (Attwood *et al*., 1994). The first release of InterPro therefore combined the contents of PROSITE 16.0 and PRINTS 23.1; it also incorporated descriptors from 241 profiles, together with 1,465 hidden Markov models from Pfam 4.0.

Fig. 5. Stylised illustration of the relationship between the InterPro integrating hub, its founding databases and its later additional partners, all of which contribute diagnostic signatures and, in some cases, protein family and domain annotation. The arrows indicate that information is shared both between satellite databases and between satellites and the central hub. See Table 3 for further details.

ProDom, although part of the original consortium (see Figure 5), was not included in the first release, initially because there was no obvious way of doing so. ProDom is built from automatically generated sequence clusters: it isn't a true signature database, in the sense that it doesn't exploit diagnostic discriminators; moreover, its sequence clusters need not have precise biological correlations, so can change between database releases. Assigning stable accession numbers to its entries was therefore impossible; this issue had to be addressed before it could be meaningfully included in InterPro. Other factors rendered a step-wise approach to the development of InterPro desirable. The scale of amalgamating just PROSITE, PRINTS and Pfam was immense. Trying to sensibly merge apparently equivalent database entries that, in fact, defined specific families, domains within those families, or even repeats within those domains, presented enormous challenges. In the beginning, InterPro therefore focused on amalgamating databases that offered some level of annotation, to facilitate the integration process.

Over the years, further partners joined the InterPro consortium, as illustrated in Figure 5. Today, with 12 primary sources, the integration challenges are legion (some of the complexity can be understood from the list of partners, and the numbers of their signatures that InterPro has incorporated, shown in Table 3)! With 21,185 entries in February 2011 (release 31.0), it is the most comprehensive integrated protein family database in the world (Hunter *et al*., 2009).


Table 3. InterPro release 31.0, February 2011.

#### **3.12 UniProt**

22 Bioinformatics – Trends and Methodologies

reared its head, but this time primarily as an instrument to help analyse and functionally annotate the growing numbers of uncharacterised genomic sequences. Compared to the original proposal in 1992, the case was much stronger, especially as there were now other related databases to bring into the picture: Daniel Kahn had released ProDom in 1994, and Richard Durbin had just announced Pfam. A new proposal was therefore submitted to the European Commission, and the vision of an integrated protein family database was finally

In October 1999, a beta release of the unified resource was made with 2,423 entries (representing 615 domains, 1776 families, 27 repeats and 8 sites of PTM), based on Swiss-Prot 38.0 and TrEMBL 11.0 – this was InterPro (Apweiler *et al*., 2001). By that time, PROSITE and the Features Database had both undergone significant changes: PROSITE had seen 3 fold growth to 1,370 entries (release 16.0); meanwhile, the Features Database had grown 40 fold to 1,157 entries (release 23.1) and had been renamed 'PRINTS' (Attwood *et al*., 1994). The first release of InterPro therefore combined the contents of PROSITE 16.0 and PRINTS 23.1; it also incorporated descriptors from 241 profiles, together with 1,465 hidden Markov

Fig. 5. Stylised illustration of the relationship between the InterPro integrating hub, its founding databases and its later additional partners, all of which contribute diagnostic signatures and, in some cases, protein family and domain annotation. The arrows indicate that information is shared both between satellite databases and between satellites and the

central hub. See Table 3 for further details.

funded.

models from Pfam 4.0.

The year 2004 marked a turning point for the way in which protein sequence data were to be collected and disseminated globally. The PIR-PSD, which had evolved from Dayhoff's *Atlas*, had been available online since 1986; Swiss-Prot, which originally built on PIR data, also became available in 1986; and TrEMBL had been released in 1996. The ongoing maintenance of these disparate resources over so many years had posed major funding headaches. For PIR, some of the difficulties were mitigated, at least in the early years, by charging for copies of their databases and for online access to their software; later, the international collaboration with MIPS and JIPID, supported by NSF and European grants, no doubt helped to sustain the resource.

Concepts, Historical Milestones and

the Swiss-Prot groups (Bairoch, 2000).

EMBL-Bank).

**3.14 The European Nucleotide Archive (ENA)** 

the Central Place of Bioinformatics in Modern Biology: A European Perspective 25

After a lengthy period of consultation, the SIB was finally created as a non-profit foundation in March 1998, with Victor Jongeneel as the first director. The founders then went on to win funds for some of the SIB's activities from the Swiss Federal government: by law, only 50% of the Institute's work could be funded in this way – the rest had to come from other

Partly in response to this stipulation, but partly also because it had become clear that Swiss-Prot could not be reliably sustained solely with public funding, the decision was made to ask commercial users of the database to pay a licence fee. Various models for achieving this were tested; in the end, in 1997, Bairoch, Appel and Denis Hochstrasser decided that the best way forward was to set up a new company – this was Geneva Bioinformatics SA (GeneBio). Up to three quarters of the revenues now generated by GeneBio from sales of annual database and software licences are returned to SIB, thereby helping to bolster the work of

Today, the SIB leads and coordinates the field of bioinformatics in Switzerland: its vision, to help shape the future of the life sciences through excellence in bioinformatics services, research and education; its mission, to provide world-class core bioinformatics resources to both national and international research communities in fields spanning genomics, proteomics and systems biology. Many of its core activities, including maintenance of databases such as UniProt and InterPro, are carried out in close collaboration with the EBI.

Meanwhile, with the advent of large-scale sequencing projects and the dawn of Next Generation Sequencing (NGS) technologies, a mounting tsunami of nucleotide sequence data was growing force across the globe; a number of important developments were to take place in its wake. By 2003, it was clear that there was a need to provide access not only to the most recent versions of sequences, but also to their historical artifacts – following the rush to patent genetic information, issues of priority became increasingly important, and it was vital to be able to see sequence entries exactly as they appeared in the past. Accordingly, the EBI established a Sequence Version Archive (Leinonen *et al*., 2003), to store both current and earlier versions of entries in the EMBL data library (which, by then, had been dubbed

By September 2004, EMBL-Bank had grown prodigiously, with more than 42 million entries (Kanz *et al*., 2005) and, by 2007, was accompanied by the Ensembl Trace Archive (ETA) – the ETA was set up to provide a permanent archive for single-pass DNA sequencing reads (from whole-genome shotgun, EST and other large-scale sequencing projects) and associated traces and quality values. Together, EMBL-Bank and the ETA became known as ENA, the European Nucleotide Archive, Europe's primary nucleotide-sequence repository (Cochrane *et al*., 2008). Throughout 2007, ENA continued to grow in terms both of its volume and of the nature of data it contained such that, by October of that year, it included more than 1.7 billion records (comprising ~1.7 trillion (1.7x1012) base pairs of sequence) (Cochrane *et al*., 2008). By 2010, ENA had embraced a third component – the Sequence Read Archive (SRA) – and now contained ~500 billion raw and assembled sequences, comprising 50x1012 base pairs; this is a phenomenal growth in just 3 years! During this period, NGS reads held in the SRA had become the largest and fastest growing source of new data, and accounted for ~95% of all base pairs made available by ENA (Leinonen *et al.*, 2011). Contributing to this

sources, preferably by commercial exploitation of its research.

Swiss-Prot, meanwhile, had had a rocky ride and had had to be rescued from the brink of closure, following a procedural 'catch-22' catastrophe: viewing Swiss-Prot as an international resource, the Swiss government declined to provide further support unless the database also gained a financial injection from a European Union (EU) grant; a joint proposal with the EBI for an EU infrastructure grant, however, was declined because Swiss-Prot was not being supported by the Swiss government! In May 1996, with only 2 months of salary remaining for the Swiss-Prot entourage, an Internet appeal was launched announcing the forthcoming closure, on 30 June, of Swiss-Prot and its associated databases and software tools, owing to lack of funding. This appeal stimulated a storm of protest on the Internet, in high-profile academic journals, and in the media. Such was the barrage that the Swiss government stepped in, offering interim funding until the end of the year. In the negotiations that followed, the need to create a stable vehicle for long-term funding both of Swiss-Prot and of the Swiss EMBnet Node was discussed, and resulted in the drafting of outline plans to establish a Swiss Institute of Bioinformatics (Bairoch, 2000).

Against this background, in 2002, with multinational funding from NIH, the NSF, the Swiss federal government and the EU, Swiss-Prot, TrEMBL and the PIR-PSD joined forces as the UniProt consortium. In forming the consortium, the idea was to build on the partners' many years of foundational work, by providing a stable, high-quality, unified database. This would serve as the world's most comprehensive protein sequence knowledgebase, replete with accurate annotations and extensive cross-references, and accompanied by freelyavailable, easy-to-use querying interfaces.

Under its hood, UniProt initially consisted of 3 separate database layers: the UniProt Archive (UniParc), to provide a complete, non-redundant collection of all publicly available protein sequence data; the UniProt Knowledgebase (UniProt), consisting of Swiss-Prot and TrEMBL, to act as the central database of protein sequences, with accurate, consistent and rich sequence and functional annotation; and the UniProt NREF databases (UniRef), to provide non-redundant subsets of the UniProt Knowledgebase, for efficient database searching (Apweiler *et al*., 2004). By 2011, UniProt also included a Metagenomic and Environmental Sequence component, termed UniMES (The UniProt Consortium, 2011); by this time, UniProtKB:Swiss-Prot contained 525,207 entries, accompanied by UniProtKB:TrEMBL, with a staggering 13,499,622 entries.

#### **3.13 The Swiss Institute of Bioinformatics (SIB)**

Like the EBI, the need for which largely grew out of high-level negotiations to try to put the EMBL data library on a more stable financial footing, the Swiss Institute of Bioinformatics (SIB) grew out of similar high-level negotiations to establish long-term financial support for Swiss-Prot. At the time of the Swiss-Prot funding crisis, Bairoch was aware that the Swiss scientific authorities had been emphasising the need to establish centres of excellence in economically important, interdisciplinary areas that would be crucial for 'tomorrow's society'. Seizing upon this, together with Ron Appel, Philipp Bucher, Victor Jongeneel and Manuel Peitsch, he submitted a proposal to create a Swiss bioinformatics institute, whose goals were to:


Swiss-Prot, meanwhile, had had a rocky ride and had had to be rescued from the brink of closure, following a procedural 'catch-22' catastrophe: viewing Swiss-Prot as an international resource, the Swiss government declined to provide further support unless the database also gained a financial injection from a European Union (EU) grant; a joint proposal with the EBI for an EU infrastructure grant, however, was declined because Swiss-Prot was not being supported by the Swiss government! In May 1996, with only 2 months of salary remaining for the Swiss-Prot entourage, an Internet appeal was launched announcing the forthcoming closure, on 30 June, of Swiss-Prot and its associated databases and software tools, owing to lack of funding. This appeal stimulated a storm of protest on the Internet, in high-profile academic journals, and in the media. Such was the barrage that the Swiss government stepped in, offering interim funding until the end of the year. In the negotiations that followed, the need to create a stable vehicle for long-term funding both of Swiss-Prot and of the Swiss EMBnet Node was discussed, and resulted in the drafting of

Against this background, in 2002, with multinational funding from NIH, the NSF, the Swiss federal government and the EU, Swiss-Prot, TrEMBL and the PIR-PSD joined forces as the UniProt consortium. In forming the consortium, the idea was to build on the partners' many years of foundational work, by providing a stable, high-quality, unified database. This would serve as the world's most comprehensive protein sequence knowledgebase, replete with accurate annotations and extensive cross-references, and accompanied by freely-

Under its hood, UniProt initially consisted of 3 separate database layers: the UniProt Archive (UniParc), to provide a complete, non-redundant collection of all publicly available protein sequence data; the UniProt Knowledgebase (UniProt), consisting of Swiss-Prot and TrEMBL, to act as the central database of protein sequences, with accurate, consistent and rich sequence and functional annotation; and the UniProt NREF databases (UniRef), to provide non-redundant subsets of the UniProt Knowledgebase, for efficient database searching (Apweiler *et al*., 2004). By 2011, UniProt also included a Metagenomic and Environmental Sequence component, termed UniMES (The UniProt Consortium, 2011); by this time, UniProtKB:Swiss-Prot contained 525,207 entries, accompanied by

Like the EBI, the need for which largely grew out of high-level negotiations to try to put the EMBL data library on a more stable financial footing, the Swiss Institute of Bioinformatics (SIB) grew out of similar high-level negotiations to establish long-term financial support for Swiss-Prot. At the time of the Swiss-Prot funding crisis, Bairoch was aware that the Swiss scientific authorities had been emphasising the need to establish centres of excellence in economically important, interdisciplinary areas that would be crucial for 'tomorrow's society'. Seizing upon this, together with Ron Appel, Philipp Bucher, Victor Jongeneel and Manuel Peitsch, he submitted a proposal to create a Swiss bioinformatics institute, whose

• collaborate with academic partners to provide a curriculum to train research scientists

• promote the development of bioinformatics software tools and databases;

• offer services to its user community through the Swiss Node of EMBnet.

outline plans to establish a Swiss Institute of Bioinformatics (Bairoch, 2000).

available, easy-to-use querying interfaces.

UniProtKB:TrEMBL, with a staggering 13,499,622 entries.

**3.13 The Swiss Institute of Bioinformatics (SIB)** 

• sustain high-quality bioinformatics research;

in the field of bioinformatics; and

goals were to:

After a lengthy period of consultation, the SIB was finally created as a non-profit foundation in March 1998, with Victor Jongeneel as the first director. The founders then went on to win funds for some of the SIB's activities from the Swiss Federal government: by law, only 50% of the Institute's work could be funded in this way – the rest had to come from other sources, preferably by commercial exploitation of its research.

Partly in response to this stipulation, but partly also because it had become clear that Swiss-Prot could not be reliably sustained solely with public funding, the decision was made to ask commercial users of the database to pay a licence fee. Various models for achieving this were tested; in the end, in 1997, Bairoch, Appel and Denis Hochstrasser decided that the best way forward was to set up a new company – this was Geneva Bioinformatics SA (GeneBio). Up to three quarters of the revenues now generated by GeneBio from sales of annual database and software licences are returned to SIB, thereby helping to bolster the work of the Swiss-Prot groups (Bairoch, 2000).

Today, the SIB leads and coordinates the field of bioinformatics in Switzerland: its vision, to help shape the future of the life sciences through excellence in bioinformatics services, research and education; its mission, to provide world-class core bioinformatics resources to both national and international research communities in fields spanning genomics, proteomics and systems biology. Many of its core activities, including maintenance of databases such as UniProt and InterPro, are carried out in close collaboration with the EBI.

#### **3.14 The European Nucleotide Archive (ENA)**

Meanwhile, with the advent of large-scale sequencing projects and the dawn of Next Generation Sequencing (NGS) technologies, a mounting tsunami of nucleotide sequence data was growing force across the globe; a number of important developments were to take place in its wake. By 2003, it was clear that there was a need to provide access not only to the most recent versions of sequences, but also to their historical artifacts – following the rush to patent genetic information, issues of priority became increasingly important, and it was vital to be able to see sequence entries exactly as they appeared in the past. Accordingly, the EBI established a Sequence Version Archive (Leinonen *et al*., 2003), to store both current and earlier versions of entries in the EMBL data library (which, by then, had been dubbed EMBL-Bank).

By September 2004, EMBL-Bank had grown prodigiously, with more than 42 million entries (Kanz *et al*., 2005) and, by 2007, was accompanied by the Ensembl Trace Archive (ETA) – the ETA was set up to provide a permanent archive for single-pass DNA sequencing reads (from whole-genome shotgun, EST and other large-scale sequencing projects) and associated traces and quality values. Together, EMBL-Bank and the ETA became known as ENA, the European Nucleotide Archive, Europe's primary nucleotide-sequence repository (Cochrane *et al*., 2008). Throughout 2007, ENA continued to grow in terms both of its volume and of the nature of data it contained such that, by October of that year, it included more than 1.7 billion records (comprising ~1.7 trillion (1.7x1012) base pairs of sequence) (Cochrane *et al*., 2008). By 2010, ENA had embraced a third component – the Sequence Read Archive (SRA) – and now contained ~500 billion raw and assembled sequences, comprising 50x1012 base pairs; this is a phenomenal growth in just 3 years! During this period, NGS reads held in the SRA had become the largest and fastest growing source of new data, and accounted for ~95% of all base pairs made available by ENA (Leinonen *et al.*, 2011). Contributing to this

Concepts, Historical Milestones and

the Central Place of Bioinformatics in Modern Biology: A European Perspective 27

Fig. 6. Proposed topology of the ELIXIR Hub and Nodes. In an arrangement reminiscent of

bioinformatics centres; others, with similar functions, will collaborate as service networks,

Initially, the numbers of Nodes is expected to be small, growing to ~20 during the first 5 years of the initiative (during the preparatory phase, more than 50 institutions submitted expressions of interest in becoming ELIXIR Nodes), at a cost of several hundred million euro. To garner support for the business case, governments of the European Member States have been invited to sign a non-binding Memorandum of Understanding (MoU) in order to initiate negotiations to construct ELIXIR; the MoU will become effective once 5 countries and the EMBL have signed. Europe's databases (estimated to number around 500), especially those hosted by the EBI, will become the foundation of the new ELIXIR infrastructure as part of its mission, *"to construct and operate a sustainable infrastructure for biological information in Europe to support life science research and its translation to medicine and* 

EMBnet 23 years before it, some of the Nodes are expected to serve as national

**4. The development and spread of tools to keep pace with the new** 

With the sequencing of biopolymers and subsequent organisation of the growing mass of biosequences in databases, visual comparison techniques became tedious, not least because

for example to provide data or compute resources, or training.

*the environment, the bio-industries and society"* (Thornton, 2011).

**technologies** 

mass of data were the completed genomes of more than 1,400 cellular organisms, and 3,000 viruses and phages.

But such enormous progress comes at a cost, challenging current IT infrastructures to the limit. Some of the oldest data in ENA date back to the early '80s, with the inception of the EMBL data library. As an aside, it is somewhat ironic that, even in those days, there were distribution headaches. Bairoch, for example, relates how difficult it was to transfer version 2 of the EMBL data library from computer tape to a mainframe computer and thence to his microcomputer, because the mainframe had no communication protocol to talk to a microcomputer – he therefore had to spend the night transferring the data, screen by screen, using a 300 baud acoustic modem (Bairoch, 2000). To put this in perspective, this version of EMBL-Bank contained 811 nucleotide sequences (with more than 1 million base pairs) – this is about the same amount of data that currently enters ENA every 2 seconds.

Today, ENA holds more than 20 terabases of nucleotide sequence data, which, combined with its annotation information, and so on, occupies more than 230 terabytes of disk space. The infrastructure required to store, maintain and service such a vast archive, and the cost of doing so, is beyond anything that either the originators of the first databases, or the developers of the new sequencing technologies could have conceived. Interestingly, in February 2011, the NCBI announced that it would be discontinuing its Sequence Read and Trace Archives for high-throughput sequence data, owing to budget constraints. The closure of the databases is to be phased, and completed within 12 months. The NCBI is still committed to supporting and developing information resources for biological data derived from NGS technologies (genotypes, variations, assemblies, gene expression data, and so on), but will need to find new funding strategies for access to and storage of the existing data.

#### **3.15 ELIXIR**

The opportunities NGS technologies present for advancing life science research (especially in areas such as healthcare, food security, energy diversification and environmental protection) are incredibly exciting; but these opportunities will be lost if they are not underpinned by a robust, effective and sustainable information infrastructure. The best estimates today suggest that, by 2020, NGS technologies will be producing data at up to a million times the current rate. Development of an appropriate infrastructure to manage the data deluge is therefore paramount.

The ELIXIR project is the realisation of this urgent need. Recognising that the task is of such magnitude that it cannot be tackled by a single organisation, it is a call to arms for international cooperation in building a pan-European infrastructure to help extract the maximum value from the investments that have already been made, and from those that will be made in future, in this area. The plan is for the ELIXIR infrastructure to be distributed across a variety of 'Nodes' hosted by centres of excellence across Europe, and for each of these to be connected to the EBI central 'Hub'. It is expected that some of the Nodes will act as national coordination centres to expedite interactions both with the Hub and with local funders; Nodes that perform similar functions will be expected to collaborate to form ELIXIR service networks, providing data or compute resources, or training, according to their speciality, as depicted in Figure 6.

mass of data were the completed genomes of more than 1,400 cellular organisms, and 3,000

But such enormous progress comes at a cost, challenging current IT infrastructures to the limit. Some of the oldest data in ENA date back to the early '80s, with the inception of the EMBL data library. As an aside, it is somewhat ironic that, even in those days, there were distribution headaches. Bairoch, for example, relates how difficult it was to transfer version 2 of the EMBL data library from computer tape to a mainframe computer and thence to his microcomputer, because the mainframe had no communication protocol to talk to a microcomputer – he therefore had to spend the night transferring the data, screen by screen, using a 300 baud acoustic modem (Bairoch, 2000). To put this in perspective, this version of EMBL-Bank contained 811 nucleotide sequences (with more than 1 million base pairs) – this is about the same amount of data that currently enters ENA every 2

Today, ENA holds more than 20 terabases of nucleotide sequence data, which, combined with its annotation information, and so on, occupies more than 230 terabytes of disk space. The infrastructure required to store, maintain and service such a vast archive, and the cost of doing so, is beyond anything that either the originators of the first databases, or the developers of the new sequencing technologies could have conceived. Interestingly, in February 2011, the NCBI announced that it would be discontinuing its Sequence Read and Trace Archives for high-throughput sequence data, owing to budget constraints. The closure of the databases is to be phased, and completed within 12 months. The NCBI is still committed to supporting and developing information resources for biological data derived from NGS technologies (genotypes, variations, assemblies, gene expression data, and so on), but will need to find new funding strategies for access to and storage of the

The opportunities NGS technologies present for advancing life science research (especially in areas such as healthcare, food security, energy diversification and environmental protection) are incredibly exciting; but these opportunities will be lost if they are not underpinned by a robust, effective and sustainable information infrastructure. The best estimates today suggest that, by 2020, NGS technologies will be producing data at up to a million times the current rate. Development of an appropriate infrastructure to manage the

The ELIXIR project is the realisation of this urgent need. Recognising that the task is of such magnitude that it cannot be tackled by a single organisation, it is a call to arms for international cooperation in building a pan-European infrastructure to help extract the maximum value from the investments that have already been made, and from those that will be made in future, in this area. The plan is for the ELIXIR infrastructure to be distributed across a variety of 'Nodes' hosted by centres of excellence across Europe, and for each of these to be connected to the EBI central 'Hub'. It is expected that some of the Nodes will act as national coordination centres to expedite interactions both with the Hub and with local funders; Nodes that perform similar functions will be expected to collaborate to form ELIXIR service networks, providing data or compute resources, or training, according to

viruses and phages.

seconds.

existing data.

**3.15 ELIXIR** 

data deluge is therefore paramount.

their speciality, as depicted in Figure 6.

Fig. 6. Proposed topology of the ELIXIR Hub and Nodes. In an arrangement reminiscent of EMBnet 23 years before it, some of the Nodes are expected to serve as national bioinformatics centres; others, with similar functions, will collaborate as service networks, for example to provide data or compute resources, or training.

Initially, the numbers of Nodes is expected to be small, growing to ~20 during the first 5 years of the initiative (during the preparatory phase, more than 50 institutions submitted expressions of interest in becoming ELIXIR Nodes), at a cost of several hundred million euro. To garner support for the business case, governments of the European Member States have been invited to sign a non-binding Memorandum of Understanding (MoU) in order to initiate negotiations to construct ELIXIR; the MoU will become effective once 5 countries and the EMBL have signed. Europe's databases (estimated to number around 500), especially those hosted by the EBI, will become the foundation of the new ELIXIR infrastructure as part of its mission, *"to construct and operate a sustainable infrastructure for biological information in Europe to support life science research and its translation to medicine and the environment, the bio-industries and society"* (Thornton, 2011).

#### **4. The development and spread of tools to keep pace with the new technologies**

With the sequencing of biopolymers and subsequent organisation of the growing mass of biosequences in databases, visual comparison techniques became tedious, not least because

Concepts, Historical Milestones and

Montpellier.

analysis.

(Zdobnov *et al.*, 2002).

providers and 505 members.

**5. The central place of bioinformatics in modern biology** 

the Central Place of Bioinformatics in Modern Biology: A European Perspective 29

aim was to develop new sequence analysis tools, by "*replacing popular but obsolete EGCG applications*," and integrating with SRS, ACEDB, and a range of other publicly available software interfaces and tools. The idea was to encourage other developers to use the EMBOSS software libraries, and especially to harness the expertise and potential additional manpower at EMBnet Nodes (*e.g*., in Germany, Italy, France, The Netherlands, Austria, Russia, Switzerland, Israel, Spain, Norway, and so on). Target users of the resource included those at the Sanger Centre, those served by EMBnet, and those in academic and pharmaceutical settings. Funded by the Wellcome Trust for 3 years, the project was a collaborative effort of the Sanger Centre, EMBnet UK (SEQNET), the EBI and CNRS

With the pivotal support of EMBnet, EMBOSS quickly became a comprehensive bioinformatics resource (Rice *et al*., 2000). There are now several incarnations of the suite with different GUIs, including the EMBOSS team's Java-based interface, jEMBOSS; the Belgian and Argentinian EMBnet Nodes' wEMBOSS; and the EMBOSS GUI from the National Research Council of Canada. Today, EMBOSS is still being developed, adopting new specific file formats and algorithms in order to embrace the world of NGS data

Another important development driven by the EMBL was the Sequence Retrieval System (SRS), an information indexing system applied to flat-file databases, such as the EMBL data library, Swiss-Prot and PROSITE (Etzold and Argos, 1993). SRS became the most widely used data-retrieval system for flat-file systems, with an extended GUI to extract not only sequences but all related information, via an exhaustive sequence query and export system

Europe-wide, there are vast numbers of other specialised biological data-analysis, datavisualisation and data-retrieval tools available: many of these are provided by the EBI; others by the SIB's ExPASy Proteomics Server; some are offered via the National and Specialist Nodes of EMBnet; others are available as Web services collected in the BioCatalogue (Bhagat *et al*., 2010). The BioCatalogue evolved from the EMBRACE registry (Pettifer *et al.*, 2010), one of the end products of the EMBRACE project (European Model for Bioinformatics Research and Community Education) – this was a 5-year FP7 Network of Excellence, whose main goal was to orchestrate highly integrated access to a broad range of bio-molecular data and software packages. Achieving this required standardised access to tools and databases; to this end, the decision was to use Web services. In consequence, many of the project partners adapted their tools and database-access protocols, and logged their Web services in a common registry. At the end of EMBRACE, in 2010, the registry was handed over to the BioCatalogue, which is now being maintained in collaboration with myExperiment, myGrid, seekda and BioMoby, and hosts 2,053 services from 147 service

Clearly, we have travelled a very long way since Jensen and Evans positioned a single amino acid (a terminal phenylalanine) in insulin (Jensen & Evans, 1935; Sanger, 1945; Sanger, 1988) and Sanger elucidated its complete sequence, the first of any protein (recall Table 2). In a story spanning something like 70 years, bioinformatics has given us the first 'complete' catalogues of DNA and protein sequences, including the genomes and proteomes

"*the determination of the significance of a given result usually is left to intuitive rationalization*" (Needleman & Wunsch 1971). To reduce reliance on manual (often subjective) interpretation and put sequence analysis on a more systematic footing, algorithms to analyse and compare sequences began to emerge. As early as 1966, Fitch proposed computational analysis to study evolutionary homology, using mutation values to indicate how many nucleotides in the genomic code must change in order to introduce change (mutation) at the amino acid level. In 1970, Needleman and Wunsch described the first algorithm to quantify the similarity between two protein sequences (so-called global alignment) – today, this algorithm is still used to identify similarities between two sequences and infer likely ancestry. Years later, Smith and Waterman (1981) presented an algorithm to find local similarities: "*to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity*". In time, more efficient methods were required to compare newly sequenced proteins against the rapidly expanding databases. FASTP was the first 'fast' algorithm (Lipman & Pearson 1985).

Search algorithms like this afforded many of the earliest and most exciting discoveries attributable to 'bioinformatics'. For example, one of the first observations that gave a clue to the molecular mechanism of neoplastic transformation was provided by the finding of a near identity in amino acid sequence between the platelet-derived growth factor (PDGF) Bchain and a region in the transforming protein, p28sis, of simian sarcoma virus (SSV), an agent that causes sarcomas and gliomas in experimental animals (Waterfield *et al*., 1983). This finding arose from computer searches using the Wilbur and Lipman algorithm on the, at the time (1983) available, NEWAT protein database created by Doolittle *et al*. This first success story, where simple sequence comparison led to the completely new concept of gene-oncogene, showed the medical community the enormous potential of computer techniques for sequence comparison and analysis.

In a similar way, DNA sequencing having been revolutionised by Sanger and by subsequent improvements of his technique, and having given rise to the growing number of nucleotide sequences being collected in data repositories like the EMBL data library and GenBank, so too algorithms to search these databases became a necessity. FASTA was a more sensitive modification of FASTP, and had the advantage of being able to search nucleotide sequence databases with either a nucleic acid or protein sequence by translating the DNA database during the search (Pearson & Lipman 1988). Later, somewhat overshadowing these developments, came the Basic Local Alignment Search Tool, BLAST (Altschul *et al*., 1990); this offered an extended tool-set to apply any kind of sequence database search, and is still the most widely used tool in bioinformatics. The success of BLAST spawned a number of more specialised sequence search methods, such as PSI-BLAST, PHI-BLAST, BLAT, and so on, and is itself still in continuous development (Camacho *et al*., 2009).

Aside from these very popular database search tools, many other sequence, annotation and expression analysis tools were developed for a broad range of applications: *e.g*., for pattern recognition, for protein and RNA secondary structure prediction, for microarray data analysis, for proteome and genome annotation, and so on. In the early '90s, building on the existing University of Wisconsin Genetics Computer Group (UWGCG, or simply GCG) package, several such algorithms were collected at the EMBL and packaged as 'GCGEMBL Utilities', later known as 'Extended GCG'. However, GCG was then commercialised and its distribution policy changed. Reacting against the new policies, in 1998 several software developers founded EMBOSS, the European Molecular Biology Open Software Suite. Their

"*the determination of the significance of a given result usually is left to intuitive rationalization*" (Needleman & Wunsch 1971). To reduce reliance on manual (often subjective) interpretation and put sequence analysis on a more systematic footing, algorithms to analyse and compare sequences began to emerge. As early as 1966, Fitch proposed computational analysis to study evolutionary homology, using mutation values to indicate how many nucleotides in the genomic code must change in order to introduce change (mutation) at the amino acid level. In 1970, Needleman and Wunsch described the first algorithm to quantify the similarity between two protein sequences (so-called global alignment) – today, this algorithm is still used to identify similarities between two sequences and infer likely ancestry. Years later, Smith and Waterman (1981) presented an algorithm to find local similarities: "*to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity*". In time, more efficient methods were required to compare newly sequenced proteins against the rapidly expanding databases. FASTP was

Search algorithms like this afforded many of the earliest and most exciting discoveries attributable to 'bioinformatics'. For example, one of the first observations that gave a clue to the molecular mechanism of neoplastic transformation was provided by the finding of a near identity in amino acid sequence between the platelet-derived growth factor (PDGF) Bchain and a region in the transforming protein, p28sis, of simian sarcoma virus (SSV), an agent that causes sarcomas and gliomas in experimental animals (Waterfield *et al*., 1983). This finding arose from computer searches using the Wilbur and Lipman algorithm on the, at the time (1983) available, NEWAT protein database created by Doolittle *et al*. This first success story, where simple sequence comparison led to the completely new concept of gene-oncogene, showed the medical community the enormous potential of computer

In a similar way, DNA sequencing having been revolutionised by Sanger and by subsequent improvements of his technique, and having given rise to the growing number of nucleotide sequences being collected in data repositories like the EMBL data library and GenBank, so too algorithms to search these databases became a necessity. FASTA was a more sensitive modification of FASTP, and had the advantage of being able to search nucleotide sequence databases with either a nucleic acid or protein sequence by translating the DNA database during the search (Pearson & Lipman 1988). Later, somewhat overshadowing these developments, came the Basic Local Alignment Search Tool, BLAST (Altschul *et al*., 1990); this offered an extended tool-set to apply any kind of sequence database search, and is still the most widely used tool in bioinformatics. The success of BLAST spawned a number of more specialised sequence search methods, such as PSI-BLAST, PHI-BLAST, BLAT, and so

Aside from these very popular database search tools, many other sequence, annotation and expression analysis tools were developed for a broad range of applications: *e.g*., for pattern recognition, for protein and RNA secondary structure prediction, for microarray data analysis, for proteome and genome annotation, and so on. In the early '90s, building on the existing University of Wisconsin Genetics Computer Group (UWGCG, or simply GCG) package, several such algorithms were collected at the EMBL and packaged as 'GCGEMBL Utilities', later known as 'Extended GCG'. However, GCG was then commercialised and its distribution policy changed. Reacting against the new policies, in 1998 several software developers founded EMBOSS, the European Molecular Biology Open Software Suite. Their

the first 'fast' algorithm (Lipman & Pearson 1985).

techniques for sequence comparison and analysis.

on, and is itself still in continuous development (Camacho *et al*., 2009).

aim was to develop new sequence analysis tools, by "*replacing popular but obsolete EGCG applications*," and integrating with SRS, ACEDB, and a range of other publicly available software interfaces and tools. The idea was to encourage other developers to use the EMBOSS software libraries, and especially to harness the expertise and potential additional manpower at EMBnet Nodes (*e.g*., in Germany, Italy, France, The Netherlands, Austria, Russia, Switzerland, Israel, Spain, Norway, and so on). Target users of the resource included those at the Sanger Centre, those served by EMBnet, and those in academic and pharmaceutical settings. Funded by the Wellcome Trust for 3 years, the project was a collaborative effort of the Sanger Centre, EMBnet UK (SEQNET), the EBI and CNRS Montpellier.

With the pivotal support of EMBnet, EMBOSS quickly became a comprehensive bioinformatics resource (Rice *et al*., 2000). There are now several incarnations of the suite with different GUIs, including the EMBOSS team's Java-based interface, jEMBOSS; the Belgian and Argentinian EMBnet Nodes' wEMBOSS; and the EMBOSS GUI from the National Research Council of Canada. Today, EMBOSS is still being developed, adopting new specific file formats and algorithms in order to embrace the world of NGS data analysis.

Another important development driven by the EMBL was the Sequence Retrieval System (SRS), an information indexing system applied to flat-file databases, such as the EMBL data library, Swiss-Prot and PROSITE (Etzold and Argos, 1993). SRS became the most widely used data-retrieval system for flat-file systems, with an extended GUI to extract not only sequences but all related information, via an exhaustive sequence query and export system (Zdobnov *et al.*, 2002).

Europe-wide, there are vast numbers of other specialised biological data-analysis, datavisualisation and data-retrieval tools available: many of these are provided by the EBI; others by the SIB's ExPASy Proteomics Server; some are offered via the National and Specialist Nodes of EMBnet; others are available as Web services collected in the BioCatalogue (Bhagat *et al*., 2010). The BioCatalogue evolved from the EMBRACE registry (Pettifer *et al.*, 2010), one of the end products of the EMBRACE project (European Model for Bioinformatics Research and Community Education) – this was a 5-year FP7 Network of Excellence, whose main goal was to orchestrate highly integrated access to a broad range of bio-molecular data and software packages. Achieving this required standardised access to tools and databases; to this end, the decision was to use Web services. In consequence, many of the project partners adapted their tools and database-access protocols, and logged their Web services in a common registry. At the end of EMBRACE, in 2010, the registry was handed over to the BioCatalogue, which is now being maintained in collaboration with myExperiment, myGrid, seekda and BioMoby, and hosts 2,053 services from 147 service providers and 505 members.

#### **5. The central place of bioinformatics in modern biology**

Clearly, we have travelled a very long way since Jensen and Evans positioned a single amino acid (a terminal phenylalanine) in insulin (Jensen & Evans, 1935; Sanger, 1945; Sanger, 1988) and Sanger elucidated its complete sequence, the first of any protein (recall Table 2). In a story spanning something like 70 years, bioinformatics has given us the first 'complete' catalogues of DNA and protein sequences, including the genomes and proteomes

Concepts, Historical Milestones and

continue in a global arena.

**7. Acknowledgement** 

187-204.

*Nature*, 290, 457-465.

chapter.

**8. References** 

of so many of the other important details and perspectives.

Zinc Insulin Crystals. *Nature,* 224, 49-495.

search tool. J.Mol.Biol., 215, 403-410.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 31

growth and dissemination, of the databases that grew up to manage and analyse them, and of the institutions and infrastructural initiatives that arose to try to give those databases some measure of financial stability. In so doing, we accept that we've only scratched the surface, and we regret any shortcomings that may have arisen from the necessary omission

Clearly, the evolution and impact of bioinformatics reaches far beyond Europe, and there are now many organisations world-wide with missions to bring life science data to their local communities, to make freely available easy-to-use software tools with which to analyse the data, and to provide training, both to users of bioinformatics databases and software, and to new generations of bioinformatics trainers (Schneider *et al.*, 2010). In this context, EMBnet, for example, which began life as the European Molecular Biology Network, is now a global bioinformatics network, maintaining fruitful cooperations with the Iberoamerican (SoIBio) and Asia Pacific (APBioNet) bioinformatics networks, as well as with the USAbased International Society for Computational Biology (ISCB); it has also established close ties with the African Society for Bioinformatics and Computational Biology (ASBCB), and synergies with other relevant groups in northern Africa are now developing. Interestingly, 33 years ago, Joshua Lederberg observed that, "*the claim of science to universal validity is supportable only by virtue of a strenuous commitment to global communication*" (Lederberg, 1978). Today, this is a commitment that EMBnet vigorously pursues; in a similar spirit, we can be quite sure that the contribution of Europe to the future evolution of bioinformatics will

We would like to thank Vicky Schneider for providing the inspiration (and the title) for this

Adams, M.J.; Blundell, T.L., Dodson, E.J., Dodson, G.G., Vijayan, M., Baker, E.N., Harding,

Adams, M.D.; Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D. *et al*. (2000) The genome

Akrigg, D.A.; Attwood, T.K., Bleasby, A.J., Findlay, J.B.C., North, A.C.T., Parry-Smith, D.J.,

Allen, F. H.; Davies, J.E., Galloy, J.J., Johnson, O., Kennard, O., Macrae, C.F., Mitchell, E.M.,

Altschul, S.F., Gish, W., Miller, W. Myers, E.W. & Lipman, D.J. (1990) Basic local alignment

Anderson, S, Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon,

sequence of *Drosophila melanogaster*. *Science*, 287, 2185-2195.

analysis resource for protein sequences. *CABIOS*, 8(3), 295-296.

M.M., Hodgkin, D.C., Rimmer, B. & Sheat, S. (1969) Structure of Rhombohedral 2

Perkins, D.N. & Wootton, J.C. (1992) SERPENT - An information storage and

Mitchell, G.F., Smith, J.M. & Watson, D.G. (1991) The Development of Versions 3 and 4 of the Cambridge Structural Database System. *J. Chem. Inf. Comput. Sci.*, 31*,* 

I.C., Nierlich, D.P., Roe, B.A., Sanger, F., Schreier, P.H., Smith, A.J., Staden, R. and Young, I.G. (1981) Sequence and organization of the human mitochondrial genome.

of organisms across the entire Tree of Life; it has furnished the requisite software to help analyse biological data on an unprecedented scale; it has hence yielded the possibilities to understand more about evolutionary processes in general, our place in the Tree of Life in particular, and ultimately, a great deal more about health, disease and disease processes. Figure 7 offers a summary of some of the most important landmarks that have charted the development of bioinformatics in Europe and helped to place it at the heart of 21st century biology.

Fig. 7. Historical milestones that have placed bioinformatics at the heart of 21st century biology, from the determination of the first amino acid sequence, to the development of an archive of 500 billion nucleotide sequences. Some major milestones are denoted in black; key computing innovations are indicated in purple; example databases are indicated in blue; organisations and institutions in green; numbers of sequences in red, the growing mass of which is highlighted both in the red curve and the background gradient – the impact of genomic sequencing in the mid '90s is clear.

#### **6. Conclusion – European bioinformatics goes global**

The history of bioinformatics has clearly been a convoluted interplay between events in Europe, the USA, Japan and across the globe. Here, we have attempted to recount the story primarily from a European perspective as it unfolded largely from the point of view of sequence data: in terms of the technological innovations that spawned their extraordinary growth and dissemination, of the databases that grew up to manage and analyse them, and of the institutions and infrastructural initiatives that arose to try to give those databases some measure of financial stability. In so doing, we accept that we've only scratched the surface, and we regret any shortcomings that may have arisen from the necessary omission of so many of the other important details and perspectives.

Clearly, the evolution and impact of bioinformatics reaches far beyond Europe, and there are now many organisations world-wide with missions to bring life science data to their local communities, to make freely available easy-to-use software tools with which to analyse the data, and to provide training, both to users of bioinformatics databases and software, and to new generations of bioinformatics trainers (Schneider *et al.*, 2010). In this context, EMBnet, for example, which began life as the European Molecular Biology Network, is now a global bioinformatics network, maintaining fruitful cooperations with the Iberoamerican (SoIBio) and Asia Pacific (APBioNet) bioinformatics networks, as well as with the USAbased International Society for Computational Biology (ISCB); it has also established close ties with the African Society for Bioinformatics and Computational Biology (ASBCB), and synergies with other relevant groups in northern Africa are now developing. Interestingly, 33 years ago, Joshua Lederberg observed that, "*the claim of science to universal validity is supportable only by virtue of a strenuous commitment to global communication*" (Lederberg, 1978). Today, this is a commitment that EMBnet vigorously pursues; in a similar spirit, we can be quite sure that the contribution of Europe to the future evolution of bioinformatics will continue in a global arena.

#### **7. Acknowledgement**

We would like to thank Vicky Schneider for providing the inspiration (and the title) for this chapter.

#### **8. References**

30 Bioinformatics – Trends and Methodologies

of organisms across the entire Tree of Life; it has furnished the requisite software to help analyse biological data on an unprecedented scale; it has hence yielded the possibilities to understand more about evolutionary processes in general, our place in the Tree of Life in particular, and ultimately, a great deal more about health, disease and disease processes. Figure 7 offers a summary of some of the most important landmarks that have charted the development of bioinformatics in Europe and helped to place it at the heart of 21st century

Fig. 7. Historical milestones that have placed bioinformatics at the heart of 21st century biology, from the determination of the first amino acid sequence, to the development of an archive of 500 billion nucleotide sequences. Some major milestones are denoted in black; key computing innovations are indicated in purple; example databases are indicated in blue; organisations and institutions in green; numbers of sequences in red, the growing mass of which is highlighted both in the red curve and the background gradient – the impact of

The history of bioinformatics has clearly been a convoluted interplay between events in Europe, the USA, Japan and across the globe. Here, we have attempted to recount the story primarily from a European perspective as it unfolded largely from the point of view of sequence data: in terms of the technological innovations that spawned their extraordinary

genomic sequencing in the mid '90s is clear.

**6. Conclusion – European bioinformatics goes global** 

biology.


Concepts, Historical Milestones and

*and Biophysics*, 185(2), 584-591 (1978).

*Nucleic Acids Res*., 31(1), 458-462.

human genome project. *Nat. Genet.*, 23(2), 151-157.

Database. *Nucleic Acids Res*., 26(1), 73-79.

*Journal,* 60(4), 556–565.

USA.

database. *Nature*, 361, 383.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 33

Boutselakis, H.; Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides,

Brown, H.; Sanger, F. & Kitai, R. (1955), The structure of pig and sheep insulins. *Biochemical* 

Burley, S.K.; Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D.,

Butler, D. (1999) Life science facilities in crisis as Brussels switches off funding. *Nature*, 402,

Camacho, C.; Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. & Madden, T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. Cherry, J.M.; Adler, C., Ball, C., Chervitz, S.A., Dwight, S.S., Hester, E.T., Jia, Y., Juvik, G.,

Cochrane, G.; Akhtar, R., Aldebert, P., Althorpe, N., Baldwin, A., Bates, K., Bhattacharyya,

Dayhoff, M.O.; Eck, R.V., Chang, M.A. & Sochard, M.R. (Eds.) (1965) *Atlas of protein sequence* 

Dayhoff, M.O. to Berkley, C. (1967) Margaret O.Dayhoff Papers, Archives of the National

Dayhoff, M.O.; Schwartz, R.M., Chen, H.R., Barker, W.C., Hunt, L.T. & Orcutt, B.C. (1981)

Dodson, G. (2005) Fred Sanger: sequencing pioneer. *Biochem. J*., doi:10.1042/BJ2005c013.

Eck, R.V. & Dayhoff, M.O. (1966) *Atlas of Protein Sequence and Structure*. National Biomedical

*elegans*: a platform for investigating biology. *Science*, 282, 2012-2018.

Sequence Database. *Nucleic Acids Res*., 36(Database issue), D5-D12.

Biomedical Research Foundation, Washington, D.C., USA.

Doelz, R. (1994) Biocomputing on a Server Network. *EMBnet.news*, 1(2), 6-8. EMBL (1992) The European Bioinformatics Institute (EBI): A Proposal

Research Foundation, Silver Spring, Maryland, USA.

542. Reprinted in *Eur. J. Biochem*., 80(2), 319-24 (1977); and *Archives of Biochemistry* 

J., John, M., Keller, P. A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldeld, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Uruena, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S. & Vranken, W. (2003) E-MSD: the European Bioinformatics Institute Macromolecular Structure Database.

Sali, A., Studier, F.W. & Swaminathan, S. (1999) Structural genomics: beyond the

3-4. *C. elegans* Sequencing Consortium. (1998) Genome sequence of the nematode *C.* 

Roe, T., Schroeder, M., Weng, S. & Botstein, D. (1998) SGD: Saccharomyces Genome

S., Bonfield, J., Bower, L., Browne, P., Castro, M., Cox, T., Demiralp, F., Eberhardt, R., Faruque, N., Hoad, G., Jang, M., Kulikova, T., Labarga, A., Leinonen, R., Leonard, S., Lin, Q., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Plaister, S., Robinson, S., Sobhany, S., Vaughan, R., Wu, D., Zhu, W., Apweiler, R., Hubbard, T. & Birney, E. (2008) Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide

*and structure*. National Biomedical Research Foundation, Silver Spring, Maryland,

Nucleic Acid Sequence Database. *DNA*, 1, 51-58; b) Dayhoff, M.O., Schwartz, R.M., Chen, H.R., Hunt, L.T., Barker, W.C. & Orcutt, B.C. (1981) Data Bank. *Nature*, 290, 8. Dickson, D. & Abbott, A. (1993) Cambridge and Heidelberg compete for new European gene


Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P.,

Apweiler, R.; Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E.,

Ashburner, M. (1996) Won for all: how the Drosophila genome was sequenced. Cold Spring

Ashburner, M. & Drysdale, R. (1994) FlyBase – the *Drosophila* genetic database. *Development*,

Bairoch, A. (1982) Suggestion to research groups working on protein and peptide sequence.

Bhagat, J., Tanoh, F., Nzuobontane, E., Laurent, T., Orlowski, J., Roos, M., Wolstencroft, K.,

Bairoch, A. (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician

Bairoch, A. & Boeckmann, B. (1991) The SWISS-PROT protein sequence data bank. *Nucleic* 

Bairoch A. (1991) PROSITE: a dictionary of sites and patterns in proteins. *Nucleic Acids Res*.,

Bairoch, A. & Apweiler, R. (1996) The SWISS-PROT protein sequence data bank and its new

Bairoch, A. & Bucher, P. (1994) PROSITE: recent developments. *Nucleic Acids Res.*, 22(17),

Barker, W.C.; George, D.G., Mewes, H.W. & Tsugita, A. (1992) The PIR-International Protein

Benson, D.; Boguski, M., Lipman, D.J. & Ostell, J. (1990) The National Center for

Berman, H. (2008) The Protein Data Bank: A historical perspective. *Foundations of* 

Bernstein, F.C.; Koetzle, T.F., Williams, G.J., Meyer, E.F. Jr, Brice, M.D., Rodgers, J.R.,

Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) The Protein Data Bank. A computer-based archival file for macromolecular structures. *J.Mol.Biol*., 112(3), 535-

Benson, D.; Lipman, D.J. & Ostell, J. (1993) GenBank. *Nucleic Acids Res*., 21(13), 2963-2965. Berman, H.M.; Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000) The Protein Data Bank. *Nucleic Acids Res.*, 28(1), 235-242. Berman, H.; Henrick, K. & Nakamura, H. (2003) Announcing the worldwide Protein Data

Aleksejevs, S., Stevens, R., Pettifer, S., Lopez, R. & Goble, C.A. (2010) BioCatalogue: a universal catalogue of web services for the life sciences. Nucleic Acids Res., 38,

Harbor Laboratory Press, Cold Spring Harbor, New York, USA.

sites. *Nucleic Acids Res*., 29(1), 37-40.

120(7), 2077-2079.

W689-694

3583-3589.

*Biochem.J*., 203(2), 527-528.

*Acids Res*., 19 Suppl., 2247-2249.

19 Suppl., 2241-2245.

*Nucleic Acids Res*., 32(Database issue), D115-119.

through exciting times! *Bioinformatics*, 16(1), 48-64.

supplement TREMBL. *Nucleic Acids Res.*, 24(1), 21-25.

Biotechnology Information. *Genomics*, 6, 389-391.

Bank. *Nature Structural Biology,* 10, 980.

*Crystallography*, 64(1), 88-95.

Sequence Database. *Nucleic Acids Res*., 20 Suppl., 2023-206

Cerutti, L., Corpet, F., Croning, M.D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J. & Zdobnov, E.M. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional

Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O'Donovan, C., Redaschi, N. & Yeh, L.S. (2004) UniProt: the Universal Protein knowledgebase. 542. Reprinted in *Eur. J. Biochem*., 80(2), 319-24 (1977); and *Archives of Biochemistry and Biophysics*, 185(2), 584-591 (1978).


Concepts, Historical Milestones and

D211-D215.

12(1), 149-158.

*Nature*, 181, 662-666.

*Trans.*, 12, 1011–1014.

D33.

the Central Place of Bioinformatics in Modern Biology: A European Perspective 35

Hubbard, T.; Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J.,

(2002) The Ensembl genome database project. *Nucleic Acids Res.*, 30(1), 38-41. Hunter, D.J. (2006) Genomics and proteomics in epidemiology: treasure trove or "high-tech

Hunter, S.; Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A. *et al.* (2009) InterPro: the

International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic

Jensen, H. & Evans Jr., E.A. (1935) Studies on crystalline insulin. XVIII. The nature of the free

Kanehisa, M.; Fickett, J.W. & Goad, W.B. (1984) A relational database system for the

Kanz, C.; Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., Browne, P., van den

Kendrew, J.C.; Bodo, G., Dintzis, H.M., Parrish, R. G., Wyckoff, H. & Phillips, D. C. (1958) A

Kennard, O.; Watson, D. G. & Town, W. G. (1972) Cambridge Crystallographic Data Centre.

Kennard, O. (1997) From private data to public knowledge. In *The Impact of Electronic* 

Kneale, G.G. & Kennard, O. (1984) The EMBL nucleotide sequence data library. *Biochem. Soc.* 

Kreppel, L.; Fey, P., Gaudet, P., Just, E., Kibbe, W.A., Chisholm, R.L. & Kimmel, A.R. (2004)

Lander, E.S.; Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C. *et al*. (2001) Initial sequencing and analysis of the human genome. *Nature*, 409, 860-921.

for a model plant. *Nucleic Acids Res*., 29(1), 102-105.

stamp collecting"? *Epidemiology*, 17(5), 487-489.

cyrstalline insulin. *J.Biol.Chem.*, 108, 1-12.

I. Bibliographic File. *J. Chem. Doc*., 12(1), 14-19.

Ltd., London, UK. ISBN 1 85578 122 0

32(Database issue), D332-D333.

sequence of the human genome. *Nature*, 431, 931-945.

database and web-based information retrieval, analysis, and visualization system

Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Huminiecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C., Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I. & Clamp, M.

integrative protein signature database. *Nucleic Acids Res*., 37 (Database Issue),

amino groups in insulin and the isolation of phenylalanine and proline from

maintenance and verification of the Los Alamos sequence library. *Nucleic Acids Res.*,

Broek, A., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Gamble, J., Diez, F.G., Harte, N., Kulikova, T., Lin, Q., Lombard, V., Lopez, R., Mancuso, R., McHale, M., Nardone, F., Silventoinen, V., Sobhany, S., Stoehr, P., Tuli, M.A., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W. & Apweiler, R. (2005) The EMBL Nucleotide Sequence Database. *Nucleic Acids Res*., 33(Database issue), D29-

three-dimensional model of the myoglobin molecule obtained by x-ray analysis.

*Publishing on the Academic Community*, an International Workshop organised by the Academia Europaea and the Wenner-Gren Foundation, Wenner-Gren Center, Stockholm, 16-20 April, 1997. Ian Butterworth, Ed. Published by Portland Press

dictyBase: a new Dictyostelium discoideum genome database. *Nucleic Acids Res*.,


Doolittle, R.F. (1986) Of Urfs and Orfs: a primer on how to analyze derived amino acid

Etzold, T. & Argos, P. (1993) SRS – an indexing and retrieval tool for flat file data libraries.

Fleischmann, R.D.; Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F. *et al*. (1995) Whole-

Franklin, R.E. & Gosling, R.G. (1953) a) The structure of sodium thymonucleate fibres. I. The

Fraser, C.M.; Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A. *et al*. (1995) The minimal

George, D.G.; Barker, W.C. & Hunt, L.T. (1986) The protein identification resource (PIR).

George, D.G.; Dodson, R.J., Garavelli, J.S., Haft, D.H., Hunt, L.T., Marzec, C.R., Orcutt, B.C.,

International Protein Sequence Database. *Nucleic Acids Res*., 25(1), 24-28. Gingeras, T.R. & Roberts, R.J. (1980) Steps towards computer analysis of nucleotide

Goffeau, A.; Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B. *et al*. (1996) Life with 6000

Goodner, B.; Hinkle, G., Gattung, S., Miller, N., Blanchard, M. *et al*. (2001). Genome

Hamm, G.H. & Cameron, G.N. (1986) The EMBL data library. *Nucleic Acids Res*., 14(1), 5–9. Harvey, M. & McMeekin, A. (2004) Public-private collaborations and the race to sequence

Henikoff, S. & Henikoff, J.G. (1991) Automated assembly of protein blocks for database

Hirs, C.H.W.; Moore, S. & Stein, W.H. (1960) The Sequence of the Amino Acid Residues in

Hobohm, U.; Scharf, M., Schneider, R. & Sander, C. (1992) Selection of representative protein

Hogeweg, P. & Hesper, B. (1978) Interactive instruction on population interactions. *Comput.* 

Huala, E.; Dickerman, A.W., Garcia-Hernandez, M., Weems, D., Reiser, L., LaFond, F.,

Hanley, D., Kiphart, D., Zhuang, M., Huang, W., Mueller, L.A., Bhattacharyya, D., Bhaya, D., Sobral, B.W., Beavis, W., Meinke, D.W., Town, C.D., Somerville, C., Rhee & S.Y. (2001) The Arabidopsis Information Resource (TAIR): a comprehensive

Performic Acid-oxidized Ribonuclease. *J.Biol.Chem*., 235, 633-647.

Hogeweg, P. (1978) Simulating the growth of cellular forms. *Simulation*, 31, 90-96.

*Agrobacterium tumefaciens*. *Nat. Biotechnol.*, 22(7), 807-810.

searching. *Nucleic Acids Res*., 19(23), 6565-6572.

data sets. *Protein Sci.*, 1(3), 409-417.

*Biol. Med.,* 8, 319-327.

Sequence of the Plant Pathogen and Biotechnology Agent *Agrobacterium tumefaciens*

gene complement of *Mycoplasma genitalium*. *Science*, 270, 397-403.

Eeckman, F.H. & Durbin, R. (1995) ACeDB and macace. *Methods Cell Biol*., 48, 583-605. Fitch, W.M. (1966) An improved method of testing for evolutionary homology. J.Mol.Biol.,

USA. ISBN 0-935702-54-7

16, 9-16.

269, 496-512.

Comput.Appl. Biosci., 9, 49-57.

*Nucleic Acids Res.,* 14(1), 11-15.

sequences. *Science*, 209, 1322-1328.

genes. *Science*, 274, 546-567.

C58. *Science*, 294, 2323-2328.

sequences. University Science Books, 20 Edgehill Road, Mill Valley, CA 94941,

genome random sequencing and assembly of *Haemophilus influenzae Rd*. *Science*,

influence of water content. *Acta Cryst.*, 6, 673-677; b) The structure of sodium thymonucleate fibres. II. The cylindrically symmetrical Patterson function. *Ibid.*, 678-685; c) Molecular configuration in sodium thymonucleate. *Nature*, 171, 740-741.

Sidman, K.E., Srinivasarao, G.Y., Yeh, L.S., Arminski, L.M., Ledley, R.S., Tsugita, A. & Barker, W.C. (1997) The Protein Information Resource (PIR) and the PIR-

database and web-based information retrieval, analysis, and visualization system for a model plant. *Nucleic Acids Res*., 29(1), 102-105.


Concepts, Historical Milestones and

J.Mol.Biol., 147, 195-197

*Acids Res*., 38, D308–D317.

1983 Jul 7-13;304(5921):35-9.

human genome. *Science*, 291, 1304-1351.

banks. *Proc. Natl. Acad. Sci. USA*., 80(3), 726-730.

227-234.

551

the Central Place of Bioinformatics in Modern Biology: A European Perspective 37

Sanger, F.; Coulson, A.R., Friedmann, T., Air, G.M., Barrell, B.G., Brown, N.L., Fiddes, J.C.,

Sanger, F.; Coulson, A.R., Hong, G.F., Hill, D.F. & Petersen, G.B. (1982) Nucleotide sequence

Sidman, K.E.; George, D.G., Barker, W.C. & Hunt, L.T. (1988) The protein identification

Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences.

Smyth, D.G.; Stein, W.H. & Moore, S. (1963) The Sequence of Amino Acid Residues in

Schneider, M.V.; Watson, J., Attwood, T., Rother, K., Budd, A., McDowall, J., Via, A.,

Sonnhammer, E.L. & Kahn, D. (1994) Modular arrangement of proteins as inferred from

Sonnhammer, E.L.; Eddy, S.R. & Durbin, R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. *Proteins*, 28(3), 405-420.

Thornton, J. (2011) European Life Sciences Infrastructure for Biological Information, ELIXIR Business Case. European Bioinformatics Institute, Hinxton, Cambridge, UK. UniProt Consortium. (2011) Ongoing and future developments at the Universal Protein

Velankar, S.; Best, C., Beuth, B., Boutselakis, C. H., Cobley, N., Sousa Da Silva, A.W.,

Venter, J.C.; Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J. *et al.* (2001) The sequence of the

Waterfield MD, Scrace GT, Whittle N, Stroobant P, Johnsson A, Wasteson A, Westermark B,

Watson, J.D. & Crick, F.H.C. (1953) Molecular structure of nucleic acids. *Nature*, 171, 737-738. Wilbur, W.J. & Lipman, D.J. (1983) Rapid similarity searches of nucleic acid and protein data

Wood, D.W.; Setubal, J.C., Kaul, R., Monks, D.E., Kitajima, J.P. *et al*. (2001). The Genome of

Dimitropoulos, D., Golovin, A., Hirshberg, M., John, M., Krissinel, E.B., Newman, R., Oldfield, T., Pajon, A. , Penkett, C. J., Pineda-Castillo, J., Sahni, G., Sen, S., Slowley, R., Suarez-Uruena, A., Swaminathan, J., van Ginkel, G., Vranken, W. F., Henrick, K. & Kleywegt, G. J. (2010) PDBe: Protein Data Bank in Europe. *Nucleic* 

Heldin CH, Huang JS, Deuel TF. Platelet-derived growth factor is structurally related to the putative transforming protein p28sis of simian sarcoma virus. Nature.

the Natural Genetic Engineer *Agrobacterium tumefaciens* C58. *Science*, 294, 2317-2323.

Strasser, B. (2008) *GenBank – Natural history in the 21st century?* Science, 322, 537-538.

Resource. *Nucleic Acids Res*., 39(Database issue), D214-D219.

Bovine Pancreatic Ribonuclease: Revisions and Confirmations. *J.Biol.Chem*., 238,

Fernandes, P., Nyronen, T., Blicher, T., Jones, P., Blatter, M.C., De Las Rivas, J., Judge, D.P., van der Gool, W. & Brooksbank, C. (2010) Bioinformatics training: a review of challenges, actions and support requirements. *Brief Bioinform.*, 11(6), 544-

Smith, T.F. (1990) The history of the genetic sequence databases. *Genomics*, 6, 701-707.

bacteriophage phiX174. *J. Mol. Biol*., 125(2), 225-246.

resource (PIR). *Nucleic Acids Res*., 16(5), 1869-1871.

analysis of homology. *Protein Sci.*, 3(3), 482-492.

of bacteriophage lambda DNA. *J.Mol.Biol.*, 162(4), 729-773. Sanger, F. (1988) Sequences, sequences, and sequences. *Ann.Rev.Biochem*., 57, 1-28.

Hutchison, C.A. 3rd, Slocombe, P.M. & Smith, M. (1978) The nucleotide sequence of


Lederberg, J. (1978) Digital Communications and the Conduct of Science; the New Literacy.

Leinonen, R.; Nardone, F., Oyewole, O., Redaschi, N. & Stoehr, P. (2003) The EMBL

Leinonen, R.; Akhtar, R., Birney, E., Bower, L., Cerdeno-Tárraga, A., Cheng, Y., Cleland, I.,

Lipman, D.J. & Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. *Science*,

Needleman, S.B. & Wunsch, C.D. (1971) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol., 48, 443-453. Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence comparison.

Pettifer, S., Ison, J., Kalas, M., Thorne, D., McDermott, P., Jonassen, I., Liaquat, A.,

Rice, P., Longden, I. & Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open

Ryle, A.P.; Sanger, F., Smith, L.F. & Kitai, R. (1955) The disulphide bonds of insulin. *Biochem.* 

Sanger, F. & Tuppy, H. (1951) a) The amino-acid sequence in the phenylalanyl chain of

The investigation of peptides from enzymic hydrolysates. *Ibid.*, 481–490. Sanger, F. & Thompson, E.O.P. (1953) a) The amino-acid sequence in the glycyl chain of

investigation of peptides from enzymic hydrolysates. *Ibid*., 366–374. Sanger, F.; Thompson, E.O.P. & Kitai, R. (1955) The amide groups of insulin. *Biochem. J.*,

insulin. 1. The identification of lower peptides from partial hydrolysates. *Biochem. J*., 49, 463–481; b) The amino-acid sequence in the phenylalanyl chain of insulin. 2.

insulin. 1. The identification of lower peptides from partial hydrolysates. *Biochem. J*., 53, 353–366; b) The amino-acid sequence in the glycyl chain of insulin. 2. The

Fernandez, J.M., Rodriguez, J.M., Partners, I., Pisano, D.G., Blanchet, C., Uludag, M., Rice, P., Bartaseviciute, E., Rapacki, K., Hekkelman, M., Sand, O., Stockinger, H., Clegg, A.B., Bongcam-Rudloff, E., Salzemann, J., Breton, V., Attwood, T.K., Cameron, G. & Vriend, G. (2010) The EMBRACE web service collection. Nucleic

Meyer, E.F. (1997) The first years of the Protein Data Bank. *Protein Science,* 6, 1591-1597. Muirhead, H. & Perutz, M. (1963) Structure of hemoglobin. A three-dimensional fourier synthesis of reduced human hemoglobin at 5.5 Å resolution. *Nature,* 199, 633–38. Nakamura, H.; Ito. N. & Kusunoki, M. (2002) Development of PDBj: Advanced database for protein structures. *Tanpakushitsu Kakusan Koso.,* 47(8 Suppl), 1097-1101.

Faruque, N., Goodgame, N., Gibson, R., Hoad, G., Jang, M., Pakseresht, N., Plaister, S., Radhakrishnan, R., Reddy, K., Sobhany, S., Ten Hoopen, P., Vaughan, R., Zalunin, V. & Cochrane, G. (2011) The European Nucleotide Archive. *Nucleic Acids* 

sequence version archive. *Bioinformatics*, 19(14), 1861-1862.

Nature Editorial. (1999) Vacuum at the heart of Europe. *Nature*, 402, 1.

Philipson, L. (1992) Letter to EMBL Council Delegates, with annexes

Sanger, F. (1945) The free amino groups of insulin. *Biochem. J*., 39, 507-515.

Proc.Natl. Acad.Sci. USA, 85, 2444-2448

Acids Res., 38, Suppl. W683-688

*J.*, 60(4), 541–556.

59(3), 509–518.

Protein Data Bank (1971) *Nature New Biology*, 233, 223. Protein Data Bank (1973) *Acta Crystallogr. sect. B,* 29, 1746.

Software Suite. Trends Genet., 16, 276-277

*Proceedings of the IEEE*, 66, 1314-1319.

*Res*., 39(Database issue), D28-31.

227, 1435-1441.


**Part 2** 

**Data Integration** 


## **Part 2**

**Data Integration** 

38 Bioinformatics – Trends and Methodologies

Wu, C.H.; Yeh, L.S., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis,

Wyckoff, H.W.; Hardman, K.D., Allewell, N.M., Inagami, T., Johnson, L.N. & Richards, F.M.

Zdobnov, E.M., Lopez, R., Apweiler, R. & Etzold, T. (2002) The EBI SRS server – new

Protein Information Resource. *Nucleic Acids Res*., 31(1), 345-347.

features. Bioinformatics, 18, 1149-1150.

3988.

P., Ledley, R.S., Suzek, B.E., Vinayaka, C.R., Zhang, J. & Barker, W.C. (2003) The

(1967). The structure of ribonuclease-S at 3.5 Å resolution. *J. Biol. Chem.*, 242, 3984–

**2** 

**Data Integration in Bioinformatics:** 

*King Abdullah University of Science and Technology (KAUST), Thuwal* 

*3Center for Medical Informatics, 4Department of Computer Science,* 

*7Department of Ecology and Evolutionary Biology, Yale University,* 

*2CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics,* 

With the rapid advancements in next-generation sequencing (NGS) technologies and the consequently fast-growing volume of biological data, a diversity of data sources (databases and web servers) have been created to facilitate data management, accessibility, and analysis. A prerequisite of bioinformatics research has been the ability to find, maneuver and access data deposited in various data sources. For a given bioinformatic task, researchers often need to be skillful in interrogating these data sources, and in the use of extracted information for further data analysis/information search. For example, one must obtain data from one data source, reformat the data and submit to another data source for analysis, parse the analyzed result, and then combine the result with data obtained from the third data source, etc. Undisputedly, data integration becomes tedious and time-consuming, especially regarding the import and export of enormous files of modern NGS and other data. Thus, integration of data from distributed, heterogeneous and voluminous data sources turns out to be a significant obstacle to fully exploit the wealth of big biological data (Davidson, et al., 1995; Stein, 2002). The importance of the integration component of research stemming from studies based on high-throughput technologies (such as NGS), is twofold: (1) due to the great level of automation of the actual experimental procedures, the effort of obtaining the experimental data takes only about 20% or less of the overall research effort in an NGS project; approximately four fifths of the effort goes to the integration and analysis of a collection of the experimental data (Mardis, 2010); (2) the answers to the most important, complex biological questions today are rarely provided directly through the experimental

**1. Introduction** 

*5Department of Genetics, 6Program in Computational Biology and Bioinformatics,* 

**Current Efforts and Challenges** 

Kei-Hoi Cheung3,4,5,6 and Jeffrey P. Townsend6,7 *1Computational Bioscience Research Center (CBRC),* 

Zhang Zhang1, Vladimir B. Bajic1, Jun Yu2,

*Chinese Academy of Sciences, Beijing* 

*New Haven, Connecticut 1Kingdom of Saudi Arabia* 

*3,4,5,6,7United States of America* 

*2China* 

### **Data Integration in Bioinformatics: Current Efforts and Challenges**

Zhang Zhang1, Vladimir B. Bajic1, Jun Yu2, Kei-Hoi Cheung3,4,5,6 and Jeffrey P. Townsend6,7 *1Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 2CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 3Center for Medical Informatics, 4Department of Computer Science, 5Department of Genetics, 6Program in Computational Biology and Bioinformatics, 7Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut 1Kingdom of Saudi Arabia 2China 3,4,5,6,7United States of America* 

#### **1. Introduction**

With the rapid advancements in next-generation sequencing (NGS) technologies and the consequently fast-growing volume of biological data, a diversity of data sources (databases and web servers) have been created to facilitate data management, accessibility, and analysis. A prerequisite of bioinformatics research has been the ability to find, maneuver and access data deposited in various data sources. For a given bioinformatic task, researchers often need to be skillful in interrogating these data sources, and in the use of extracted information for further data analysis/information search. For example, one must obtain data from one data source, reformat the data and submit to another data source for analysis, parse the analyzed result, and then combine the result with data obtained from the third data source, etc. Undisputedly, data integration becomes tedious and time-consuming, especially regarding the import and export of enormous files of modern NGS and other data. Thus, integration of data from distributed, heterogeneous and voluminous data sources turns out to be a significant obstacle to fully exploit the wealth of big biological data (Davidson, et al., 1995; Stein, 2002). The importance of the integration component of research stemming from studies based on high-throughput technologies (such as NGS), is twofold: (1) due to the great level of automation of the actual experimental procedures, the effort of obtaining the experimental data takes only about 20% or less of the overall research effort in an NGS project; approximately four fifths of the effort goes to the integration and analysis of a collection of the experimental data (Mardis, 2010); (2) the answers to the most important, complex biological questions today are rarely provided directly through the experimental

Data Integration in Bioinformatics: Current Efforts and Challenges 43

sources, transforming the data and importing it into the data warehouse. Representative

 Atlas (Shah, et al., 2005) is a biological data warehouse that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. It includes data from BIND, DIP, Entrez Gene (Maglott, et al., 2011), GO, GenBank, HomoloGene, HPRD (Human Protein Reference Database) (Keshava Prasad, et al., 2009), IntAct, LocusLink (Pruitt and Maglott, 2001), MINT, RefSeq, OMIM (Online Mendelian Inheritance in Man) (Amberger, et al., 2009), Taxonomy, and UniProt (The UniProt Consortium, 2011). BioWarehouse (Lee, et al., 2006) is an open source toolkit for constructing data warehouses. It incorporates data from BioCyc (Karp, et al., 2005), CMR, ENZYME (Bairoch, 2000), GenBank, GO, KEGG, Taxonomy, and UniProt and integrates its component databases into a common representational framework within a single

 BIOZON (Birkland and Yona, 2006) is a unified biological resource on DNA sequences, proteins, complexes and cellular pathways. It relies on an extensive database schema that integrates information at the macro-molecular level as well as at the cellular level from a variety of data sources, including BIND, DIP, Genbank, InterPro (Hunter, et al., 2009), KEGG, PDB, RefSeq, Swiss-Prot (Bairoch, et al., 2004), UniGene (Sayers, et al.,

 COLUMBA (Trissl, et al., 2005) is an integrated database of information on proteins, structures and annotations. It integrates twelve different databases, including CATH,

 VINEdb (Hariharaputran, et al., 2007) is a data warehouse for integration and interactive exploration of life science data. It manages diverse data from GO, IntAct, KEGG, OMIM, and UniProt and emphasizes the visualization of the integrated data in a

The data warehouse approach has several advantages. (1) The user does not need to access many web sites for multiple data sources. Data warehouses provide one single access point to conveniently manipulate a large variety of data. (2) All queries requested by users are executed within the data warehouse (rather than on distributed data sources) and therefore, data warehousing eliminates network bottlenecks and obtains high performance with fast response. (3) Due to data storage at a single managed point, data warehousing obtains

Despite its advantages, the data warehouse approach has a major problem; it requires continuous and often human-guided updates to keep the data comprehensive of the evolution of data sources, resulting in high costs for maintenance. In general, there are two kinds of changes. (1) Changes in data volume or revisions of data. Whenever extant data is revised or the volume of data in any data source is changed, the data warehouse must monitor for such remote changes and update the warehouse to store the new data. (2) Changes in data structure, including adding new data types and tables, changing database tables and their relationships, and changing output formats. Many biological data sources change their data structures roughly twice a year (Stein, 2003). Whenever the data sources change their data structures, consequent data translation into the data warehouse must be updated in response. Usually, modification of data translation is labor-intensive and

ENZYME, GO, KEGG, PDB, SCOP (Andreeva, et al., 2008), and Swiss-Prot.

benefits in data control, yielding easy customization to meet users' needs.

examples of data warehousing include:

database management system.

2011), and UniProt.

comprehensible manner.

expensive.

results; to bring potential answers to the surface, downstream bioinformatics analysis often involves the integration of diverse data from multiple data sources.

The objective of data integration in bioinformatics is to establish automated and efficient ways to integrate large, heterogeneous biological datasets from multiple sources. However, this objective is challenged by data sources that are geographically distributed and heterogeneous in terms of their functions, structures, data access methods and dissemination formats. According to the 2010 update on the Bioinformatics Links Directory (Brazas, et al., 2010), there are almost 1500 unique publicly-available data sources. Based on their functions, data sources can be classified into diverse categories: (1) sequence databases, e.g., GenBank (Benson, et al., 2006), RefSeq (Pruitt, et al., 2009), CMR (Comprehensive Microbial Resource) (Davidsen, et al., 2010); (2) functional genomics databases, e.g., ArrayExpress (Parkinson, et al., 2011), FFGED (Filamentous Fungal Gene Expression Database) (Zhang and Townsend, 2010), GEO (Gene Expression Omnibus) (Barrett, et al., 2011); (3) protein-protein interaction databases, e.g., BIND (Biomolecular Interaction Network Database) (Bader, et al., 2003), DIP (Database of Interacting Proteins) (Salwinski, et al., 2004), IntAct (Aranda, et al., 2010), MINT (Molecular Interactions Database) (Ceol, et al., 2010); (4) pathway databases, e.g., KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa, et al., 2010); (5) structure databases, e.g., CATH (Greene, et al., 2007), PDB (Protein Data Bank) (Rose, et al., 2011); (6) annotation databases, e.g., GO (Gene Ontology) (Ashburner, et al., 2000), NCBI Taxonomy (Sayers, et al., 2011). Moreover, data sources differ in data accessibility and dissemination. That is, different levels of provision are made by the data source managers for human-reading, computer-reading, or both. Certainly, data sources can also be classified by species of interest, such as, filamentous fungi (Zhang and Townsend, 2010), fly (Gilbert, 2007), mouse (Blake, et al., 2011), and yeast (Engel, et al., 2010). Despite the challenges, the promise of data integration is high: heterogeneous data sources provide biological data encompassing a wide range of research fields. Therefore, data integration has the potential to facilitate a better and more comprehensive scope of inference for biological studies. Although efforts have been devoted to biological data integration over the past two decades, it remains challenging and laborious. Here we review current efforts and illustrate several approaches used for data integration. With a specific consideration of the exponentially-growing NGS data, we also describe challenges in this context and discuss potential trends.

#### **2. Current efforts of data integration in bioinformatics**

Several major approaches have been proposed for data integration, which can be roughly classified into five groups (Goble and Stevens, 2008; Zhang, et al., 2009): data warehousing, federated databasing, service-oriented integration, semantic integration and wiki-based integration. Across all of these groups, to a significant extent, an increasingly important component of data integration is the community effort in developing a variety of biomedical ontologies (see Section 3.2), to deal in a more specific manner with the technicality and globality of descriptors and identifiers of information that has to be shared and integrated across various resources (Antezana, et al., 2009; Maojo, et al., 2011; Rubin, et al., 2008).

#### **2.1 Data warehousing**

The data warehouse approach offers a "one-stop shop" solution to ease access and management of a large variety of biological data from different data sources. Data warehouses focus on data translation, fetching all accessible data from many disparate data

results; to bring potential answers to the surface, downstream bioinformatics analysis often

The objective of data integration in bioinformatics is to establish automated and efficient ways to integrate large, heterogeneous biological datasets from multiple sources. However, this objective is challenged by data sources that are geographically distributed and heterogeneous in terms of their functions, structures, data access methods and dissemination formats. According to the 2010 update on the Bioinformatics Links Directory (Brazas, et al., 2010), there are almost 1500 unique publicly-available data sources. Based on their functions, data sources can be classified into diverse categories: (1) sequence databases, e.g., GenBank (Benson, et al., 2006), RefSeq (Pruitt, et al., 2009), CMR (Comprehensive Microbial Resource) (Davidsen, et al., 2010); (2) functional genomics databases, e.g., ArrayExpress (Parkinson, et al., 2011), FFGED (Filamentous Fungal Gene Expression Database) (Zhang and Townsend, 2010), GEO (Gene Expression Omnibus) (Barrett, et al., 2011); (3) protein-protein interaction databases, e.g., BIND (Biomolecular Interaction Network Database) (Bader, et al., 2003), DIP (Database of Interacting Proteins) (Salwinski, et al., 2004), IntAct (Aranda, et al., 2010), MINT (Molecular Interactions Database) (Ceol, et al., 2010); (4) pathway databases, e.g., KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa, et al., 2010); (5) structure databases, e.g., CATH (Greene, et al., 2007), PDB (Protein Data Bank) (Rose, et al., 2011); (6) annotation databases, e.g., GO (Gene Ontology) (Ashburner, et al., 2000), NCBI Taxonomy (Sayers, et al., 2011). Moreover, data sources differ in data accessibility and dissemination. That is, different levels of provision are made by the data source managers for human-reading, computer-reading, or both. Certainly, data sources can also be classified by species of interest, such as, filamentous fungi (Zhang and Townsend, 2010), fly (Gilbert, 2007), mouse (Blake, et al., 2011), and yeast (Engel, et al., 2010). Despite the challenges, the promise of data integration is high: heterogeneous data sources provide biological data encompassing a wide range of research fields. Therefore, data integration has the potential to facilitate a better and more comprehensive scope of inference for biological studies. Although efforts have been devoted to biological data integration over the past two decades, it remains challenging and laborious. Here we review current efforts and illustrate several approaches used for data integration. With a specific consideration of the exponentially-growing NGS data, we also describe challenges in this context and discuss

involves the integration of diverse data from multiple data sources.

**2. Current efforts of data integration in bioinformatics** 

Several major approaches have been proposed for data integration, which can be roughly classified into five groups (Goble and Stevens, 2008; Zhang, et al., 2009): data warehousing, federated databasing, service-oriented integration, semantic integration and wiki-based integration. Across all of these groups, to a significant extent, an increasingly important component of data integration is the community effort in developing a variety of biomedical ontologies (see Section 3.2), to deal in a more specific manner with the technicality and globality of descriptors and identifiers of information that has to be shared and integrated across various resources (Antezana, et al., 2009; Maojo, et al., 2011; Rubin, et al., 2008).

The data warehouse approach offers a "one-stop shop" solution to ease access and management of a large variety of biological data from different data sources. Data warehouses focus on data translation, fetching all accessible data from many disparate data

potential trends.

**2.1 Data warehousing** 

sources, transforming the data and importing it into the data warehouse. Representative examples of data warehousing include:


The data warehouse approach has several advantages. (1) The user does not need to access many web sites for multiple data sources. Data warehouses provide one single access point to conveniently manipulate a large variety of data. (2) All queries requested by users are executed within the data warehouse (rather than on distributed data sources) and therefore, data warehousing eliminates network bottlenecks and obtains high performance with fast response. (3) Due to data storage at a single managed point, data warehousing obtains benefits in data control, yielding easy customization to meet users' needs.

Despite its advantages, the data warehouse approach has a major problem; it requires continuous and often human-guided updates to keep the data comprehensive of the evolution of data sources, resulting in high costs for maintenance. In general, there are two kinds of changes. (1) Changes in data volume or revisions of data. Whenever extant data is revised or the volume of data in any data source is changed, the data warehouse must monitor for such remote changes and update the warehouse to store the new data. (2) Changes in data structure, including adding new data types and tables, changing database tables and their relationships, and changing output formats. Many biological data sources change their data structures roughly twice a year (Stein, 2003). Whenever the data sources change their data structures, consequent data translation into the data warehouse must be updated in response. Usually, modification of data translation is labor-intensive and expensive.

Data Integration in Bioinformatics: Current Efforts and Challenges 45

Queries in federated databases are executed within remote data sources and results displayed in federated databases are extracted remotely from the data sources. Due to this capability, federated databasing has two major advantages. (1) Federated databases can be regarded as an on-demand approach to provide immediate access to up-to-date data deposited in multiple data sources. (2) Compared with data warehousing, federated databasing does not replicate data in data sources; therefore, it presents relatively inexpensive costs for storage and curation. However, federated databasing still has to update its query translation to keep pace with data access methods at diverse remote data sources. In addition, since data is retrieved from remote data sources, federated databasing depends heavily on network connectivity and query complexity, which may lead to low

Data warehousing and federated databasing both focus on centralizing data access, through data translation and query translation, respectively. They confront some similar problems stemming from data storage and curation, frequent updates, and high costs for data exchange and/or maintenance. In part to evade these issues, a decentralized approach has also been advanced, in which individual data sources agree to open their data via Web Services (WS). WS are designed for communication between computers over the Web and described by the Web Services Description Language (WSDL). There are several different protocols for WS, e.g., SOAP (Simple Object Access Protocol; a protocol for exchanging XML-based messages over computer networks), REST (REpresentational State Transfer; a simple protocol implemented using HTTP methods). WS support computer-to-computer interaction through Web Application Programming Interface (Web API) (Shi, 2007) and can perform a database query or computation. In the context of data integration, data can be programmatically accessed via WS and data sources serve as service providers. Therefore, this approach can be seen as a service-oriented approach. The service-oriented approach enables data integration from multiple heterogeneous data sources through computer interoperability. Several representative examples for service-oriented integration include: BioMOBY (Kawas, et al., 2006; Wilkinson and Links, 2002; Wilkinson, et al., 2008) is an open source ontology-based integration system for accessing distributed and heterogeneous data sources via WS. It implements a WS registry and uses standard ontology terms to annotate WS. BioMOBY adopts SOAP for data exchange and allows interoperability among different data sources to achieve automated data integration

 DAS (Distributed Annotation System) is a client-server system to provide access to complete distributed genome annotations using SOAP-based WS (Dowell, et al., 2001; Katayama, et al., 2010; Olason, 2005). It allows a single machine to collect all annotations from multiple distributed data sources and display them to the user in a single view.

(http://en.wikipedia.org/wiki/Distributed\_Annotation\_System) and adopted by several systems, including Ensembl, WormBase, and the Berkeley Drosophila Genome

Project (Jenkinson, et al., 2008; Messina and Sonnhammer, 2009; Olason, 2005).

al., 2010) , and Swiss-Prot.

efficiency and speed in data retrieval.

and sharing (Neerincx and Leunissen, 2005).

DAS is widely used in the genome annotation community

**2.3 Service-oriented integration** 

using an ontology of biological concepts (Stevens, et al., 2000). The prototype version of TAMBIS contains five data sources, viz., BLAST, CATH, ENZYME, PROSITE (Sigrist, et

#### **2.2 Federated databasing**

Unlike data warehousing (with its focus on data translation), federated databasing focuses on query translation. The federated databasing approach executes all queries on the distributed sources by translating a query against the federated database into a query against many data sources. The federated database fetches the data from disparate data sources and then displays the fetched data for its user base. Representative examples for federated databasing include:


using an ontology of biological concepts (Stevens, et al., 2000). The prototype version of TAMBIS contains five data sources, viz., BLAST, CATH, ENZYME, PROSITE (Sigrist, et al., 2010) , and Swiss-Prot.

Queries in federated databases are executed within remote data sources and results displayed in federated databases are extracted remotely from the data sources. Due to this capability, federated databasing has two major advantages. (1) Federated databases can be regarded as an on-demand approach to provide immediate access to up-to-date data deposited in multiple data sources. (2) Compared with data warehousing, federated databasing does not replicate data in data sources; therefore, it presents relatively inexpensive costs for storage and curation. However, federated databasing still has to update its query translation to keep pace with data access methods at diverse remote data sources. In addition, since data is retrieved from remote data sources, federated databasing depends heavily on network connectivity and query complexity, which may lead to low efficiency and speed in data retrieval.

#### **2.3 Service-oriented integration**

44 Bioinformatics – Trends and Methodologies

Unlike data warehousing (with its focus on data translation), federated databasing focuses on query translation. The federated databasing approach executes all queries on the distributed sources by translating a query against the federated database into a query against many data sources. The federated database fetches the data from disparate data sources and then displays the fetched data for its user base. Representative examples for

 BioMart (Haider, et al., 2009) is a query-oriented data integration system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI). It provides a user-friendly and unified way to retrieve data from one or multiple data sources located at diverse geographical locations, including Ensembl (Flicek, et al., 2011), HGNC, Uniprot, Reactome (Croft, et al., 2011),

 DiscoveryLink (Haas, et al., 2001) developed by IBM is a system for integrated access to life sciences data from heterogeneous data sources, including GenBank, MedLine and Swiss-Prot. It features query optimization and cross-source queries that access relational

 K2/Kleisli (Chung and Wong, 1999; Davidson, et al., 2001) is a federated database system, integrating data from EcoCyc (Keseler, et al., 2011), GenBank, GSDB (Harger, et al., 1998), dbEST (Boguski, et al., 1993), GDB (Letovsky, et al., 1998), KEGG and SRSindexed databases. Kleisli uses a high-level query language called Collection Programming Language (CPL) as its query language, which was developed specifically for parsing, optimizing and executing queries. K2 is the newer version of Kleisli and replaces CPL by a powerful and easy-to-use SQL-like query language, Object Query

 MRS (Hekkelman and Vriend, 2005) allows for very rapid queries in a large number of flat-file data banks, including EMBL, UniProt, OMIM, dbEST, PDB, KEGG. It combines a fast and reliable backend with a very user-friendly implementation of all the

 QIS (Query Integrator System) is based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers and relies on a combination of SQL-like syntax and XML (eXtensible Markup Language; a widely used standard for data description and exchange), to formulate a query (Marenco, et al., 2004). It stores diverse queries for data integration from continuously changing heterogeneous data sources in the biosciences, including CellPropDB (Crasto and Shepherd, 2007), Brain Architecture Management System (Bota and Swanson, 2010), Yale Microarray Database (Cheung, et al., 2002), a local Gene Annotation Database and

 SRS (Sequence Retrieval System) is an index-based integration system and combines some features of data warehousing and federated databasing (Zdobnov, et al., 2002). SRS uses a keyword-based indexing language ICARUS to describe each integrated data source and locally creates a full-text index over all data sources. Meanwhile, it allows a single query to execute on multiple data sources based on local indexed entries. SRS contains a number of biological databases (see details in http://srs.ebi.ac.uk/

 TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) is an integration application to perform bioinformatics tasks over multiple data sources by

**2.2 Federated databasing** 

federated databasing include:

Language (OQL).

GO.

Wormbase, and PRIDE (Jones, et al., 2008).

commonly used information retrieval facilities.

srsbin/cgi-bin/wgetz?-page+databanks+-noSession).

databases and retrieve the data from diverse data sources.

Data warehousing and federated databasing both focus on centralizing data access, through data translation and query translation, respectively. They confront some similar problems stemming from data storage and curation, frequent updates, and high costs for data exchange and/or maintenance. In part to evade these issues, a decentralized approach has also been advanced, in which individual data sources agree to open their data via Web Services (WS). WS are designed for communication between computers over the Web and described by the Web Services Description Language (WSDL). There are several different protocols for WS, e.g., SOAP (Simple Object Access Protocol; a protocol for exchanging XML-based messages over computer networks), REST (REpresentational State Transfer; a simple protocol implemented using HTTP methods). WS support computer-to-computer interaction through Web Application Programming Interface (Web API) (Shi, 2007) and can perform a database query or computation. In the context of data integration, data can be programmatically accessed via WS and data sources serve as service providers. Therefore, this approach can be seen as a service-oriented approach. The service-oriented approach enables data integration from multiple heterogeneous data sources through computer interoperability. Several representative examples for service-oriented integration include:


(http://en.wikipedia.org/wiki/Distributed\_Annotation\_System) and adopted by several systems, including Ensembl, WormBase, and the Berkeley Drosophila Genome Project (Jenkinson, et al., 2008; Messina and Sonnhammer, 2009; Olason, 2005).

Data Integration in Bioinformatics: Current Efforts and Challenges 47

http://www.w3.org/2001/sw/hcls/), established by W3C, aims to explore the potential benefits of the Semantic Web in the health care and life sciences domains (Cheung, et al., 2008) and advocates the application of the Semantic Web for advancing translational research (Ruttenberg, et al., 2007). The HCLS Knowledge Base (HCLS-KB; http://www.w3.org/TR/hcls-kb) is a Semantic Web system that imports data from many data sources in multiple domains of life sciences, including not only general sources, e.g., Entrez Gene, GO, HomoloGene, but also domain-specific sources, e.g., Allen Brain Atlas (an interactive, genome-wide image database of gene expression in the mouse brain; http://www.brain-map.org) (Lein, et al., 2007), SenseLab (a collection of neuroscience data; http://neuroweb.med.yale.edu/senselab) (Crasto, et al., 2007) and SWAN (Semantic Web Applications in Neuromedicine; aiming to organize and annotate scientific knowledge about Alzheimer disease and other neurodegenerative disorders) (Ciccarese, et al., 2008; Clark and Kinoshita, 2007; Kinoshita and Clark, 2007). YeastHub (Cheung, et al., 2005) is an integrated database in RDF format for the yeast community. It creates a RDF repository for RDF storage and provides a utility to convert tabular format into RDF format. YeastHub integrates different types of yeast data provided by different data sources (SGD, YGDP, MIPS, BIND, GO and TRIPLES)

HCLS (The Health Care and Life Sciences Interest Group;

and supports RDF-based queries to retrieve and query the data.

promising to facilitate data integration, exposure, sharing, and connecting.

conversion scripts must be updated consequently.

**2.5 Wiki-based integration** 

Application of the Semantic Web technologies to biological data integration is a significant advancement for bioinformatics, enabling automated data processing and reasoning. The semantic integration uses ontologies for data description and thus represents ontologybased integration (Noy, 2004). However, the Semantic Web continues to evolve and its application in biological data integration has several limitations. The semantic integration locally stores a large collection of RDF documents, by copying data from multiple data sources and converting data into RDF format. From this view, the semantic integration can be regarded as a special data warehouse with data in RDF format. As a consequence, it inherits the pros and cons of data warehousing and is vulnerable to updates in data sources. To keep the RDF documents up-to-date, it requires tedious and periodical data retrieval and RDF conversion. In addition, once any data source changes data structure, the RDF

Currently, there is an ongoing project, the World Wide Web Consortium's SWEO (Semantic Web Education and Outreach) Linking Open Data Project (Bizer, 2009; Zhao, et al., 2009) that uses the Semantic Web technologies to connect related distributed data across the Web. Technically, linked data rely on RDF to create typed links between data from different data sources. Linked data is machine-readable, explicitly defined, and inter-linked to other data,

A weakness common to all the above approaches is that the quantity of users' participations in the process is inadequate. With the increasing volume of biological data, data integration inevitably will require a large number of users' participations. A successful example that harnesses collective intelligence for data aggregation and knowledge collection is Wikipedia, an online encyclopedia (http://www.wikipedia.org) that allows any user to create and edit content. Wikipedia features collaborative integration, continuous and frequent update, up-to-date content, huge content coverage and low cost for maintenance

 Taverna (Oinn, et al., 2004), a part of MyGrid (Stevens, et al., 2003), is a graphical workflow workbench application, aiming to integrate the growing number of molecular biology tools and databases (Hull, et al., 2006). Workflows in Taverna, written by a custom XML-based language called Simple Conceptual Unified Flow Language (SCUFL), can automatically record all data involved, provenance metadata, and results, facilitating complex data processing in a dynamic distributed environment.

The service-oriented approach features data integration through computer-to-computer communication via Web API and up-to-date data retrieval from diverse data sources. Thus, it befits well with the dynamic nature of bioinformatics. However, it remains challenging, primarily because its success in heterogeneous data integration requires that many data sources should become service providers by opening their data via WS and by standardizing data identities and nomenclature to ease data exchange and analysis. In addition, a unified WS registry is also necessitated, not only to establish standards for WS registration, but also to formulate standards for service-oriented workflows or pipelines (Zhang, et al., 2009).

#### **2.4 Semantic integration**

Most web pages in biological data sources are designed for human reading (e.g., HTML). The Semantic Web (Dibernardo, et al., 2008; Good and Wilkinson, 2006; Hendler, 2003; Lord, et al., 2004) aims to describe data in a way that computers can understand and to build an interconnected network that computers can easily and unambiguously process. According to the statement of definition from the World Wide Web Consortium (W3C), the purpose of the Semantic Web is to create a universal medium for the exchange of data using several standards, including Resource Description Framework (RDF; http://www.w3.org/RDF), RDF schema (RDFS—RDF Vocabulary Description Language; http://www.w3.org/TR/rdfschema), Web Ontology Language (OWL; http://www.w3.org/owl), and standard Web query language SPARQL (http://www.w3.org/TR/rdf-sparql-query) for RDF. RDF provides standard formats (e.g, XML format) for data interchange and describes data as a simple statement, containing a set of triples: a *subject*, a *predicate* and an *object*. Any two statements can be linked by an identical *subject* or *object*. OWL builds on RDF and Uniform Resource Identifier (URI) and describes data structure and meaning based on ontology, which enables automated data reasoning and inferences by computers. The Semantic Web provides an machine-readable way for data representation and interoperability (Antezana, et al., 2009). Several studies have applied the Semantic Web technologies in data integration and representative examples of semantic integration are described below.

 Bio2RDF (Belleau, et al., 2008) is a mashup system that creates an integrated space of RDF documents linked together with normalized URIs. Bio2RDF applies the Semantic Web technologies to multiple data sources, such as Entrez Gene, HGNC, KEGG, MGI, OMIM PDB, PubMed and UniProt, and converts data into RDF format based on RDFizer (a set of tools for converting various data formats into RDF; http://simile.mit.edu/wiki/RDFizers), Sesame (an open source framework for storage, inference and querying of RDF data; http://www.openrdf.org) and OWL ontology. In Bio2RDF, each RDF document is expressed as a URI. When a query is requested to Bio2RDF for a given URI, for example, http://bio2rdf.org/go:0004396, the URI identifies RDF triples containing the GO term of Hexokinase (GO:0004396). Bio2RDF supports query via SPARQL.

 Taverna (Oinn, et al., 2004), a part of MyGrid (Stevens, et al., 2003), is a graphical workflow workbench application, aiming to integrate the growing number of molecular biology tools and databases (Hull, et al., 2006). Workflows in Taverna, written by a custom XML-based language called Simple Conceptual Unified Flow Language (SCUFL), can automatically record all data involved, provenance metadata, and results,

The service-oriented approach features data integration through computer-to-computer communication via Web API and up-to-date data retrieval from diverse data sources. Thus, it befits well with the dynamic nature of bioinformatics. However, it remains challenging, primarily because its success in heterogeneous data integration requires that many data sources should become service providers by opening their data via WS and by standardizing data identities and nomenclature to ease data exchange and analysis. In addition, a unified WS registry is also necessitated, not only to establish standards for WS registration, but also to formulate standards for service-oriented workflows or pipelines

Most web pages in biological data sources are designed for human reading (e.g., HTML). The Semantic Web (Dibernardo, et al., 2008; Good and Wilkinson, 2006; Hendler, 2003; Lord, et al., 2004) aims to describe data in a way that computers can understand and to build an interconnected network that computers can easily and unambiguously process. According to the statement of definition from the World Wide Web Consortium (W3C), the purpose of the Semantic Web is to create a universal medium for the exchange of data using several standards, including Resource Description Framework (RDF; http://www.w3.org/RDF), RDF schema (RDFS—RDF Vocabulary Description Language; http://www.w3.org/TR/rdfschema), Web Ontology Language (OWL; http://www.w3.org/owl), and standard Web query language SPARQL (http://www.w3.org/TR/rdf-sparql-query) for RDF. RDF provides standard formats (e.g, XML format) for data interchange and describes data as a simple statement, containing a set of triples: a *subject*, a *predicate* and an *object*. Any two statements can be linked by an identical *subject* or *object*. OWL builds on RDF and Uniform Resource Identifier (URI) and describes data structure and meaning based on ontology, which enables automated data reasoning and inferences by computers. The Semantic Web provides an machine-readable way for data representation and interoperability (Antezana, et al., 2009). Several studies have applied the Semantic Web technologies in data integration

and representative examples of semantic integration are described below.

 Bio2RDF (Belleau, et al., 2008) is a mashup system that creates an integrated space of RDF documents linked together with normalized URIs. Bio2RDF applies the Semantic Web technologies to multiple data sources, such as Entrez Gene, HGNC, KEGG, MGI, OMIM PDB, PubMed and UniProt, and converts data into RDF format based on RDFizer (a set of tools for converting various data formats into RDF; http://simile.mit.edu/wiki/RDFizers), Sesame (an open source framework for storage, inference and querying of RDF data; http://www.openrdf.org) and OWL ontology. In Bio2RDF, each RDF document is expressed as a URI. When a query is requested to Bio2RDF for a given URI, for example, http://bio2rdf.org/go:0004396, the URI identifies RDF triples containing the GO term of Hexokinase (GO:0004396). Bio2RDF

facilitating complex data processing in a dynamic distributed environment.

(Zhang, et al., 2009).

**2.4 Semantic integration** 

supports query via SPARQL.


Application of the Semantic Web technologies to biological data integration is a significant advancement for bioinformatics, enabling automated data processing and reasoning. The semantic integration uses ontologies for data description and thus represents ontologybased integration (Noy, 2004). However, the Semantic Web continues to evolve and its application in biological data integration has several limitations. The semantic integration locally stores a large collection of RDF documents, by copying data from multiple data sources and converting data into RDF format. From this view, the semantic integration can be regarded as a special data warehouse with data in RDF format. As a consequence, it inherits the pros and cons of data warehousing and is vulnerable to updates in data sources. To keep the RDF documents up-to-date, it requires tedious and periodical data retrieval and RDF conversion. In addition, once any data source changes data structure, the RDF conversion scripts must be updated consequently.

Currently, there is an ongoing project, the World Wide Web Consortium's SWEO (Semantic Web Education and Outreach) Linking Open Data Project (Bizer, 2009; Zhao, et al., 2009) that uses the Semantic Web technologies to connect related distributed data across the Web. Technically, linked data rely on RDF to create typed links between data from different data sources. Linked data is machine-readable, explicitly defined, and inter-linked to other data, promising to facilitate data integration, exposure, sharing, and connecting.

#### **2.5 Wiki-based integration**

A weakness common to all the above approaches is that the quantity of users' participations in the process is inadequate. With the increasing volume of biological data, data integration inevitably will require a large number of users' participations. A successful example that harnesses collective intelligence for data aggregation and knowledge collection is Wikipedia, an online encyclopedia (http://www.wikipedia.org) that allows any user to create and edit content. Wikipedia features collaborative integration, continuous and frequent update, up-to-date content, huge content coverage and low cost for maintenance

Data Integration in Bioinformatics: Current Efforts and Challenges 49

protocols that can be used for creating WS. Among them, SOAP and REST have been widely adopted (Figure 1). SOAP is a well-defined standard with XML-structured messaging for request and response, whereas REST is relatively lightweight, relying on HTTP methods (viz., POST, GET, PUT or DELETE). Most commercial applications expose their services as

RESTful Web APIs (Figure 1), largely due to its simplicity and easy implementation.

http://www.programmableweb.com/apis, which collects more than 3,000 Web APIs; last

Due to the complex nature of biology, there are a wide variety of biological data types, e.g., sequence data, gene expression data, protein-protein interaction data, pathway data (Karasavvas, et al., 2004). Data sources store different data types as different formats (Li, 2006): flat file (e.g., tab-delimited file), sequence file (e.g., FASTA), structure file (e.g., PSF— Protein Structure File), and XML file (e.g., KGML—KEGG Markup Language for describing graph objects). Data sources often adopt their preferable data formats; even for a same data type, data formats in different sources are often incompatible. It is also noted that new data formats are often invented along with the development of related technologies. Examples of newly invented file formats include SAM (Sequence Alignment/MAP; a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly; Li, et al., 2009), and GVF (Genome Variation Format; a simple tab-delimited format for describing genome variation data; Reese, et al., 2010). In addition, data sources output their data in diverse formats, such as HTML, raw file formats, and XML-based file formats. Taken together, diverse and heterogeneous data formats

Standards for biological data formats can ease data exchange and integration. There has been a successful attempt for standardizing biological pathway data. Pathway-related data sources differed in their data representation, making data integration difficult and inefficient. For this reason, BioPAX (Demir, et al., 2010) has been developed to deliver a compatible standard, facilitating integration, exchange, visualization and analysis of biological pathway data. Another effort related to cope with data incompatibilities of bioinformatics repositories has been devoted to the standardization issues of data exchange formats and WS (Katayama, et al., 2010). In short, establishing standard formats for biological data can realize efficient data exchange and integration. In return, standard data formats facilitate subsequent data analysis and visualization as well as downstream

Fig. 1. Statistics of Web API protocols (obtained from

complicate data exchange, posing challenges for data integration.

access: February 27, 2011).

software development.

**3.2 Standards for biological data** 

(McLean, et al., 2007). Although there are fears of inconsistency and inaccuracy since users can freely and anonymously change any content and/or add new content in the wiki (Arita, 2009; Bidartondo, 2008), it is testified that Wikipedia outperforms the traditional Encyclopedia in accuracy (Giles, 2005).

In consideration of the success of Wikipedia, a wiki-based approach has been on the horizon to store, manage and organize biological data (Giles, 2007; Salzberg, 2007; Waldrop, 2008; Yager, 2006). The wiki-based integration makes full use of collective intelligence and efforts for biological data integration. Representative examples include: WikiGenes (a wiki system that combines gene annotation with explicit authorship; Hoffmann, 2008), WikiProteins (a wiki-based system for protein annotation; Mons, et al., 2008), BOWiki (a ontology-based wiki for data annotation and knowledge integration; Hoehndorf, et al., 2009), Gene Wiki (a wiki for human gene annotation; Huss, et al., 2010; Huss, et al., 2008) and PDBWiki (a scientific wiki for the community annotation of protein structures; Stehr, et al., 2010). However, the wiki-based integration has its own shortcomings, including the unstructured data generated, the lack of a standard format for data exchange, the lack of credit for authorship and vulnerability to malicious editing (Lee, 2008; Potthast, et al., 2008).

#### **3. Challenges ahead**

Although a number of current efforts have been devoted to data integration, none of them have achieved a pre-eminent impact on their field yet. Since NGS data are growing at an exponential rate, the need for data integration is continually demanding and challenges for data integration are greatly increasing.

#### **3.1 Data as a service**

The low-cost and high-throughput NGS technologies can generate huge amounts of data at a relatively short period. To keep pace with the revolution of sequencing technologies, genome sequencing projects have transitioned from classical model organisms (e.g., fly, mouse, yeast), to other organisms (e.g., camel, dog, panda) and eventually, to sequencing individuals within populations, exemplified by the 1000 Genomes Project—a collection of the genomes of 1,000 humans (http://www.1000genomes.org) and the Genome 10K Project—a genomic zoo of genome sequences of 10,000 vertebrate species (http://www.genome10k.org). The era of \$1000 personal genome sequencing is approaching within the following years and would produce unparalleled large-scale data, presenting considerable challenges for data integration.

It is infeasible to integrate such large amounts of data into a single point (such as a data warehouse). Data sources are developed for different purposes and fulfill different functions. Therefore, it is promising to establish an efficient way for data exchange among these distributed and heterogeneous data sources. However, a dozen of data sources are designed merely for data storage, but not for data exchange. The growing volume of biological data also requires "computer-readable" approaches for data integration. To ease data integration, data sources need to turn into service providers. In other words, data sources should not only serve as data providers that provide data for human reading with web interfaces (e.g., HTML), but also function as service providers that provide data for computer interoperability via WS. Service providers supply data as a WS, facilitating computer-to-computer interactions and thus enabling automated data integration from multiple data sources (Hansen, et al., 2003). As mentioned, there are several different

(McLean, et al., 2007). Although there are fears of inconsistency and inaccuracy since users can freely and anonymously change any content and/or add new content in the wiki (Arita, 2009; Bidartondo, 2008), it is testified that Wikipedia outperforms the traditional

In consideration of the success of Wikipedia, a wiki-based approach has been on the horizon to store, manage and organize biological data (Giles, 2007; Salzberg, 2007; Waldrop, 2008; Yager, 2006). The wiki-based integration makes full use of collective intelligence and efforts for biological data integration. Representative examples include: WikiGenes (a wiki system that combines gene annotation with explicit authorship; Hoffmann, 2008), WikiProteins (a wiki-based system for protein annotation; Mons, et al., 2008), BOWiki (a ontology-based wiki for data annotation and knowledge integration; Hoehndorf, et al., 2009), Gene Wiki (a wiki for human gene annotation; Huss, et al., 2010; Huss, et al., 2008) and PDBWiki (a scientific wiki for the community annotation of protein structures; Stehr, et al., 2010). However, the wiki-based integration has its own shortcomings, including the unstructured data generated, the lack of a standard format for data exchange, the lack of credit for

authorship and vulnerability to malicious editing (Lee, 2008; Potthast, et al., 2008).

Although a number of current efforts have been devoted to data integration, none of them have achieved a pre-eminent impact on their field yet. Since NGS data are growing at an exponential rate, the need for data integration is continually demanding and challenges for

The low-cost and high-throughput NGS technologies can generate huge amounts of data at a relatively short period. To keep pace with the revolution of sequencing technologies, genome sequencing projects have transitioned from classical model organisms (e.g., fly, mouse, yeast), to other organisms (e.g., camel, dog, panda) and eventually, to sequencing individuals within populations, exemplified by the 1000 Genomes Project—a collection of the genomes of 1,000 humans (http://www.1000genomes.org) and the Genome 10K Project—a genomic zoo of genome sequences of 10,000 vertebrate species (http://www.genome10k.org). The era of \$1000 personal genome sequencing is approaching within the following years and would produce unparalleled large-scale data,

It is infeasible to integrate such large amounts of data into a single point (such as a data warehouse). Data sources are developed for different purposes and fulfill different functions. Therefore, it is promising to establish an efficient way for data exchange among these distributed and heterogeneous data sources. However, a dozen of data sources are designed merely for data storage, but not for data exchange. The growing volume of biological data also requires "computer-readable" approaches for data integration. To ease data integration, data sources need to turn into service providers. In other words, data sources should not only serve as data providers that provide data for human reading with web interfaces (e.g., HTML), but also function as service providers that provide data for computer interoperability via WS. Service providers supply data as a WS, facilitating computer-to-computer interactions and thus enabling automated data integration from multiple data sources (Hansen, et al., 2003). As mentioned, there are several different

Encyclopedia in accuracy (Giles, 2005).

**3. Challenges ahead** 

**3.1 Data as a service** 

data integration are greatly increasing.

presenting considerable challenges for data integration.

protocols that can be used for creating WS. Among them, SOAP and REST have been widely adopted (Figure 1). SOAP is a well-defined standard with XML-structured messaging for request and response, whereas REST is relatively lightweight, relying on HTTP methods (viz., POST, GET, PUT or DELETE). Most commercial applications expose their services as RESTful Web APIs (Figure 1), largely due to its simplicity and easy implementation.

Fig. 1. Statistics of Web API protocols (obtained from http://www.programmableweb.com/apis, which collects more than 3,000 Web APIs; last

#### **3.2 Standards for biological data**

access: February 27, 2011).

Due to the complex nature of biology, there are a wide variety of biological data types, e.g., sequence data, gene expression data, protein-protein interaction data, pathway data (Karasavvas, et al., 2004). Data sources store different data types as different formats (Li, 2006): flat file (e.g., tab-delimited file), sequence file (e.g., FASTA), structure file (e.g., PSF— Protein Structure File), and XML file (e.g., KGML—KEGG Markup Language for describing graph objects). Data sources often adopt their preferable data formats; even for a same data type, data formats in different sources are often incompatible. It is also noted that new data formats are often invented along with the development of related technologies. Examples of newly invented file formats include SAM (Sequence Alignment/MAP; a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly; Li, et al., 2009), and GVF (Genome Variation Format; a simple tab-delimited format for describing genome variation data; Reese, et al., 2010). In addition, data sources output their data in diverse formats, such as HTML, raw file formats, and XML-based file formats. Taken together, diverse and heterogeneous data formats complicate data exchange, posing challenges for data integration.

Standards for biological data formats can ease data exchange and integration. There has been a successful attempt for standardizing biological pathway data. Pathway-related data sources differed in their data representation, making data integration difficult and inefficient. For this reason, BioPAX (Demir, et al., 2010) has been developed to deliver a compatible standard, facilitating integration, exchange, visualization and analysis of biological pathway data. Another effort related to cope with data incompatibilities of bioinformatics repositories has been devoted to the standardization issues of data exchange formats and WS (Katayama, et al., 2010). In short, establishing standard formats for biological data can realize efficient data exchange and integration. In return, standard data formats facilitate subsequent data analysis and visualization as well as downstream software development.

Data Integration in Bioinformatics: Current Efforts and Challenges 51

large-scale data and automatically discover knowledge. From this view, the Semantic Web befits well with the exponential growth of biological data and promises in providing solutions for data integration and advancing translational research (Ruttenberg, et al., 2007). Semantic Web technologies have been applied for data integration as mentioned above. Nevertheless, these applications in essence belong to semantic warehouses and still have pains for integrating dynamic data. One potential solution is to combine WS with Semantic Web technologies and to provide Semantic WS (Matos, et al., 2010; Vandervalk, et al., 2009), namely, RDF-based WS for automated data processing and reasoning. As mentioned, WS are designed not only to perform a query, but also to conduct a computation. Considering that NGS technologies can swiftly generate hundreds of gigabases of sequencing data, WS would become increasingly data-intensive and computation-intensive (e.g., alignment of multiple large-scale sequences). Therefore, to deal with such large-scale data management and analysis, Semantic WS necessitate to adopt advances in high performance computing (Schadt, et al., 2010), such as, cloud/grid computing (Bateman and Wood, 2009; Stein, 2010) and Service-Oriented Computing (Papazoglou, et al., 2008). In addition, a Semantic WS framework (Wilkinson, et al., 2010) is also needed, in order to set up Semantic WS

As a critical topic in bioinformatics, data integration bears fundamental significance for biological studies. Efforts have been devoted to this topic and the corresponding approaches for data integration have moved from traditional ones, e.g., data warehousing and federated databasing, to modern ones based on several advanced technologies, e.g., Web Service, Semantic Web and Wiki. The rapid development of sequencing technologies poses tremendous challenges for data integration. Integration of large-scale data not only requires adoption of informatics advances, but also needs communications and collaborations among people in related biological communities to maximize data openness via WS, set up standards for biological data, create Semantic WS-based pipelines and form a scientific social community. Such community harnesses collective intelligence and collaborative efforts for data integration, analysis and sharing, having the potential to be an ideal

Amberger, J.*, et al.* (2009) McKusick's Online Mendelian Inheritance in Man (OMIM), *Nucleic* 

Andreeva, A.*, et al.* (2008) Data growth and its impact on the SCOP database: new

Antezana, E.*, et al.* (2009) Biological knowledge management: the emerging role of the

Aranda, B.*, et al.* (2010) The IntAct molecular interaction database in 2010, *Nucleic Acids Res*,

Arita, M. (2009) A pitfall of wiki solution for biological databases, *Brief Bioinform*, 10, 295-

Ashburner, M.*, et al.* (2000) Gene ontology: tool for the unification of biology. The Gene

community of the people, by the people, and for the people.

developments, *Nucleic Acids Res*, 36, D419-425.

Ontology Consortium, *Nat Genet*, 25, 25-29.

Semantic Web technologies, *Brief Bioinform*, 10, 392-407.

*Acids Res*, 37, D793-796.

38, D525-531.

296.

workflows or pipelines.

**4. Conclusions** 

**5. References** 

Equally important, data integration also requires standardizing nomenclature and ontologies for biological data (Rubin, et al., 2008). Suppose two data sources need to exchange gene annotations. They must share a standard regarding gene name. Otherwise, any ambiguity or inconsistency in nomenclature would bring a burden to data integration. Attention has been paid to standardizing nomenclature and ontologies for biological data, e.g., BioPortal (Noy, et al., 2009; Rubin, et al., 2006) for integrating and sharing biomedical ontologies in National Center for Biomedical Ontology, GO (Ashburner, et al., 2000) for standardizing the representation of gene and gene product attributes, HGNC (Seal, et al., 2011) for standardizing human gene symbols and names, OBO (Open Biomedical Ontologies) (Smith, et al., 2007) for creating a suite of orthogonal interoperable reference ontologies in the biomedical domain. However, a centralized system for nomenclature and ontologies standardization may not keep good pace with the rapid accumulation of biological data and any gap in standardization would provoke difficulties for data integration. A wiki-based system might be promising to harness all communities' efforts in standardizing nomenclature and ontologies collaboratively and efficiently.

#### **3.3 WS-based pipelines**

The goal of data integration is to enable combining information from different resources in an automated fashion without human intervention, so as to handle the increasing accumulation of biological data (Sarkar, et al., 2008). Towards this goal, data to be integrated should be re-defined in a broader manner, which include not merely sequences and other raw data, but also methods, tools, algorithms, analyzed results, discovered knowledge (see a paper for knowledge integration; Clark, 2007) and even connections among people (Zhang, et al., 2009). All kinds of data can be provided as a service. That is, raw data should be accessible via WS, methods, tools, and algorithms that are used to analyze data should be offered as WS (that is SaaS, Software as a Service), and analyzed results and discovered knowledge should be also delivered as WS (Zhang, et al., 2009). As a result, WS perform a variety of data manipulation, including data retrieval, integration, analysis, visualization, and sharing.

A pipeline with a combination of multiple WS can achieve data integration (Zhang, et al., 2009). Such WS-based pipelines lower technological entrance barriers and provide users with a lightweight programming environment. WS-based pipelines feature computer-tocomputer data exchange, simplify data integration and analysis, maximize the scope of sharing and reuse, and function as a medium to link users located anywhere with similar research interests, and finally to form a scientific social community (SSC). SSC reflects several key elements of Web 2.0 and enables data integration, analysis and sharing with greater convenience, speed and efficiency (Zhang, et al., 2009). Any user may easily create WS-based pipelines (adding value), publish them online, and subscribe to pipelines created by other users. Consequently, pipelines may be widely shared, re-used and even integrated into other pipelines. As a result, communications and collaborations among users in SSC can be greatly increased, making knowledge discovery through collective intelligence possible. In addition, SSC can also serve as a registry for collecting WS (Bhagat, et al., 2010; Pettifer, et al., 2010).

#### **3.4 Semantic Web Services**

The ever-evolving next-generation Web (NGW), characterized as the Semantic Web, aims to provide information not only for human, but also for computers to semantically process large-scale data and automatically discover knowledge. From this view, the Semantic Web befits well with the exponential growth of biological data and promises in providing solutions for data integration and advancing translational research (Ruttenberg, et al., 2007). Semantic Web technologies have been applied for data integration as mentioned above. Nevertheless, these applications in essence belong to semantic warehouses and still have pains for integrating dynamic data. One potential solution is to combine WS with Semantic Web technologies and to provide Semantic WS (Matos, et al., 2010; Vandervalk, et al., 2009), namely, RDF-based WS for automated data processing and reasoning. As mentioned, WS are designed not only to perform a query, but also to conduct a computation. Considering that NGS technologies can swiftly generate hundreds of gigabases of sequencing data, WS would become increasingly data-intensive and computation-intensive (e.g., alignment of multiple large-scale sequences). Therefore, to deal with such large-scale data management and analysis, Semantic WS necessitate to adopt advances in high performance computing (Schadt, et al., 2010), such as, cloud/grid computing (Bateman and Wood, 2009; Stein, 2010) and Service-Oriented Computing (Papazoglou, et al., 2008). In addition, a Semantic WS framework (Wilkinson, et al., 2010) is also needed, in order to set up Semantic WS workflows or pipelines.

#### **4. Conclusions**

50 Bioinformatics – Trends and Methodologies

Equally important, data integration also requires standardizing nomenclature and ontologies for biological data (Rubin, et al., 2008). Suppose two data sources need to exchange gene annotations. They must share a standard regarding gene name. Otherwise, any ambiguity or inconsistency in nomenclature would bring a burden to data integration. Attention has been paid to standardizing nomenclature and ontologies for biological data, e.g., BioPortal (Noy, et al., 2009; Rubin, et al., 2006) for integrating and sharing biomedical ontologies in National Center for Biomedical Ontology, GO (Ashburner, et al., 2000) for standardizing the representation of gene and gene product attributes, HGNC (Seal, et al., 2011) for standardizing human gene symbols and names, OBO (Open Biomedical Ontologies) (Smith, et al., 2007) for creating a suite of orthogonal interoperable reference ontologies in the biomedical domain. However, a centralized system for nomenclature and ontologies standardization may not keep good pace with the rapid accumulation of biological data and any gap in standardization would provoke difficulties for data integration. A wiki-based system might be promising to harness all communities' efforts in

The goal of data integration is to enable combining information from different resources in an automated fashion without human intervention, so as to handle the increasing accumulation of biological data (Sarkar, et al., 2008). Towards this goal, data to be integrated should be re-defined in a broader manner, which include not merely sequences and other raw data, but also methods, tools, algorithms, analyzed results, discovered knowledge (see a paper for knowledge integration; Clark, 2007) and even connections among people (Zhang, et al., 2009). All kinds of data can be provided as a service. That is, raw data should be accessible via WS, methods, tools, and algorithms that are used to analyze data should be offered as WS (that is SaaS, Software as a Service), and analyzed results and discovered knowledge should be also delivered as WS (Zhang, et al., 2009). As a result, WS perform a variety of data manipulation, including data retrieval, integration, analysis, visualization,

A pipeline with a combination of multiple WS can achieve data integration (Zhang, et al., 2009). Such WS-based pipelines lower technological entrance barriers and provide users with a lightweight programming environment. WS-based pipelines feature computer-tocomputer data exchange, simplify data integration and analysis, maximize the scope of sharing and reuse, and function as a medium to link users located anywhere with similar research interests, and finally to form a scientific social community (SSC). SSC reflects several key elements of Web 2.0 and enables data integration, analysis and sharing with greater convenience, speed and efficiency (Zhang, et al., 2009). Any user may easily create WS-based pipelines (adding value), publish them online, and subscribe to pipelines created by other users. Consequently, pipelines may be widely shared, re-used and even integrated into other pipelines. As a result, communications and collaborations among users in SSC can be greatly increased, making knowledge discovery through collective intelligence possible. In addition, SSC can also serve as a registry for collecting WS (Bhagat, et al., 2010; Pettifer, et

The ever-evolving next-generation Web (NGW), characterized as the Semantic Web, aims to provide information not only for human, but also for computers to semantically process

standardizing nomenclature and ontologies collaboratively and efficiently.

**3.3 WS-based pipelines** 

and sharing.

al., 2010).

**3.4 Semantic Web Services** 

As a critical topic in bioinformatics, data integration bears fundamental significance for biological studies. Efforts have been devoted to this topic and the corresponding approaches for data integration have moved from traditional ones, e.g., data warehousing and federated databasing, to modern ones based on several advanced technologies, e.g., Web Service, Semantic Web and Wiki. The rapid development of sequencing technologies poses tremendous challenges for data integration. Integration of large-scale data not only requires adoption of informatics advances, but also needs communications and collaborations among people in related biological communities to maximize data openness via WS, set up standards for biological data, create Semantic WS-based pipelines and form a scientific social community. Such community harnesses collective intelligence and collaborative efforts for data integration, analysis and sharing, having the potential to be an ideal community of the people, by the people, and for the people.

#### **5. References**


Data Integration in Bioinformatics: Current Efforts and Challenges 53

Croft, D.*, et al.* (2011) Reactome: a database of reactions, pathways and biological processes,

Davidsen, T.*, et al.* (2010) The comprehensive microbial resource, *Nucleic Acids Res*, 38, D340-

Davidson, S.B.*, et al.* (2001) K2/Kleisli and GUS: Experiments in integrated access to

Davidson, S.B.*, et al.* (1995) Challenges in integrating biological data sources, *J Comput Biol*,

Demir, E.*, et al.* (2010) The BioPAX community standard for pathway data sharing, *Nat* 

Dibernardo, M.*, et al.* (2008) Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework, *Journal of biomedical informatics*.

Engel, S.R.*, et al.* (2010) Saccharomyces Genome Database provides mutant phenotype data,

Gilbert, D.G. (2007) DroSpeGe: rapid access database for new Drosophila species genomes,

Goble, C. and Stevens, R. (2008) State of the nation in data integration for bioinformatics, *J* 

Good, B.M. and Wilkinson, M.D. (2006) The Life Sciences Semantic Web is full of creeps!,

Greene, L.H.*, et al.* (2007) The CATH domain structure database: new protocols and

Haas, L.M.*, et al.* (2001) DiscoveryLink: A system for integrated access to life sciences data

Haider, S.*, et al.* (2009) BioMart Central Portal--unified access to biological data, *Nucleic Acids* 

Hansen, M.*, et al.* (2003) Data integration using Web Services, *Lect Notes Comput Sc*, 2590,

Harger, C.*, et al.* (1998) The Genome Sequence DataBase (GSDB): improving data quality

Hariharaputran, S.*, et al.* (2007) VINEdb: a data warehouse for integration and interactive exploration of life science data, *Journal of Integrative Bioinformatics*, 4, 63. Hekkelman, M.L. and Vriend, G. (2005) MRS: a fast and compact retrieval system for

Hoehndorf, R.*, et al.* (2009) BOWiki: an ontology-based wiki for annotation of data and integration of knowledge in biology, *BMC Bioinformatics*, 10 Suppl 5, S5. Hoffmann, R. (2008) A wiki for the life sciences where authorship matters, *Nat Genet*, 40,

Hull, D.*, et al.* (2006) Taverna: a tool for building and running workflows of services, *Nucleic* 

Hendler, J. (2003) Science and the semantic web, *Science (New York, N.Y*, 299, 520-521.

classification levels give a more comprehensive resource for exploring evolution,

Dowell, R.D.*, et al.* (2001) The distributed annotation system, *BMC bioinformatics*, 2, 7.

*Nucleic Acids Res*, 39, D691-697.

genomic data sources, *Ibm Syst J*, 40, 512-531.

Flicek, P.*, et al.* (2011) Ensembl 2011, *Nucleic Acids Res*, 39, D800-806.

Giles, J. (2007) Key biology databases go wiki, *Nature*, 445, 691.

and data access, *Nucleic Acids Res*, 26, 21-26.

biological data, *Nucleic Acids Res*, 33, W766-769.

Giles, J. (2005) Internet encyclopaedias go head to head, *Nature*, 438, 900-901.

345.

2, 557-572.

*Biotechnol*, 28, 935-942.

*Nucleic Acids Res*, 38, D433-436.

*Nucleic Acids Res*, 35, D480-485.

*Briefings in bioinformatics*, 7, 275-286.

*Nucleic Acids Res*, 35, D291-297.

sources, *Ibm Syst J*, 40, 489-511.

*Res*, 37, W23-27.

165-182.

1047-1051.

*acids research*, 34, W729-732.

*Biomed Inform*, 41, 687-693.


Bader, G.D.*, et al.* (2003) BIND: the Biomolecular Interaction Network Database, *Nucleic* 

Bairoch, A.*, et al.* (2004) Swiss-Prot: juggling between evolution and stability, *Brief Bioinform*,

Barrett, T.*, et al.* (2011) NCBI GEO: archive for functional genomics data sets--10 years on,

Belleau, F.*, et al.* (2008) Bio2RDF: towards a mashup to build bioinformatics knowledge

Bhagat, J.*, et al.* (2010) BioCatalogue: a universal catalogue of web services for the life

Birkland, A. and Yona, G. (2006) BIOZON: a hub of heterogeneous biological data, *Nucleic* 

Blake, J.A.*, et al.* (2011) The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics, *Nucleic Acids Res*, 39, D842-848. Boguski, M.S.*, et al.* (1993) dbEST--database for "expressed sequence tags", *Nat Genet*, 4, 332-

Bota, M. and Swanson, L.W. (2010) Collating and Curating Neuroanatomical

Brazas, M.D.*, et al.* (2010) Providing web servers and training in Bioinformatics: 2010 update

Ceol, A.*, et al.* (2010) MINT, the molecular interaction database: 2009 update, *Nucleic Acids* 

Cheung, K.H.*, et al.* (2002) YMD: a microarray database for large-scale gene expression

Cheung, K.H.*, et al.* (2005) YeastHub: a semantic web use case for integrating data in the life

Cheung, K.H.*, et al.* (2008) HCLS 2.0/3.0: health care and life sciences data mashup using

Chung, S.Y. and Wong, L. (1999) Kleisli: a new tool for data integration in biology, *Trends* 

Ciccarese, P.*, et al.* (2008) The SWAN biomedical discourse ontology, *J Biomed Inform*, 41, 739-

Clark, T. (2007) Knowledge Integration in Biomedicine: Technology and Community,

Clark, T. and Kinoshita, J. (2007) Alzforum and SWAN: the present and future of scientific

Crasto, C.J.*, et al.* (2007) SenseLab: new developments in disseminating neuroscience

Crasto, C.J. and Shepherd, G.M. (2007) Managing knowledge in neuroscience, *Methods Mol* 

on the Bioinformatics Links Directory, *Nucleic Acids Res*, 38, W3-6.

Nomenclatures: Principles and Use of the Brain Architecture Knowledge

Bairoch, A. (2000) The ENZYME database in 2000, *Nucleic Acids Res*, 28, 304-305.

Bateman, A. and Wood, M. (2009) Cloud computing, *Bioinformatics*, 25, 1475.

Bidartondo, M.I. (2008) Preserving accuracy in GenBank, *Science*, 319, 1616.

Bizer, C. (2009) The Emerging Web of Linked Data, *Ieee Intell Syst*, 24, 87-92.

Management System (BAMS), *Front Neuroinformatics*, 4, 3.

sciences domain, *Bioinformatics*, 21 Suppl 1, i85-96.

web communities, *Briefings in bioinformatics*, 8, 163-171.

*Acids Res*, 31, 248-250.

*Nucleic Acids Res*, 39, D1005-1010.

systems, *J Biomed Inform*, 41, 706-716.

*acids research*, 34, D235-242.

sciences, *Nucleic Acids Res*, 38, W689-694.

Benson, D.A.*, et al.* (2006) GenBank, *Nucleic Acids Res*, 34, D16-20.

5, 39-55.

333.

751.

*Biol*, 401, 3-21.

*Res*, 38, D532-539.

*Biotechnol*, 17, 351-355.

analysis, *Proc AMIA Symp*, 140-144.

*Briefings in bioinformatics*, 8, E1-E3.

information, *Brief Bioinform*, 8, 150-162.

Web 2.0/3.0, *J Biomed Inform*, 41, 694-705.


Data Integration in Bioinformatics: Current Efforts and Challenges 55

Marenco, L.*, et al.* (2004) QIS: A framework for biomedical database federation, *J Am Med* 

Matos, E.E.*, et al.* (2010) CelOWS: an ontology based framework for the provision of semantic web services related to biological models, *J Biomed Inform*, 43, 125-136. McLean, R.*, et al.* (2007) The effect of Web 2.0 on the future of medical practice and

Messina, D.N. and Sonnhammer, E.L. (2009) DASher: a stand-alone protein sequence client for DAS, the Distributed Annotation System, *Bioinformatics*, 25, 1333-1334. Mons, B.*, et al.* (2008) Calling on a million minds for community annotation in WikiProteins,

Neerincx, P.B. and Leunissen, J.A. (2005) Evolution of web services in bioinformatics, *Brief* 

Noy, N.F. (2004) Semantic integration: A survey of ontology-based approaches, *Sigmod* 

Noy, N.F.*, et al.* (2009) BioPortal: ontologies and integrated data resources at the click of a

Oinn, T.*, et al.* (2004) Taverna: a tool for the composition and enactment of bioinformatics

Olason, P.I. (2005) Integrating protein annotation resources through the Distributed

Papazoglou, M.P.*, et al.* (2008) Service-oriented computing: a research roadmap, *International* 

Parkinson, H.*, et al.* (2011) ArrayExpress update--an archive of microarray and high-

Pettifer, S.*, et al.* (2010) The EMBRACE web service collection, *Nucleic Acids Res*, 38, W683-

Potthast, M.*, et al.* (2008) Automatic vandalism detection in Wikipedia, *Advances in* 

Pruitt, K.D. and Maglott, D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources,

Pruitt, K.D.*, et al.* (2009) NCBI Reference Sequences: current status, policy and new

Reese, M.G.*, et al.* (2010) A standard variation file format for human genome sequences,

Rose, P.W.*, et al.* (2011) The RCSB Protein Data Bank: redesigned web site and web services,

Rubin, D.L.*, et al.* (2006) National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge, *OMICS*, 10, 185-198. Rubin, D.L.*, et al.* (2008) Biomedical ontologies: a functional perspective, *Brief Bioinform*, 9,

Ruttenberg, A.*, et al.* (2007) Advancing translational research with the Semantic Web, *BMC* 

Salwinski, L.*, et al.* (2004) The Database of Interacting Proteins: 2004 update, *Nucleic Acids* 

throughput sequencing-based functional genomics experiments, *Nucleic Acids Res*,

education: Darwikinian evolution or folksonomic revolution?, *Medical Journal of* 

*Inform Assoc*, 11, 523-534.

*Australia*, 187, 174-177.

*Genome Biol*, 9, R89.

*Bioinform*, 6, 178-188.

mouse, *Nucleic Acids Res*, 37, W170-173.

workflows, *Bioinformatics*, 20, 3045-3054.

*Information Retrieval*, 4956, 663-668.

initiatives, *Nucleic Acids Res*, 37, D32-36.

*Nucleic Acids Res*, 29, 137-140.

*Nucleic Acids Res*, 39, D392-401.

*Bioinformatics*, 8 Suppl 3, S2.

*Res*, 32, D449-451.

*Genome Biol*, 11, R88.

Annotation System, *Nucleic acids research*, 33, W468-470.

*Journal of Cooperative Information Systems*, 17, 223-255.

*Record*, 33, 65-70.

39, D1002-1004.

688.

75-90.


Hunter, S.*, et al.* (2009) InterPro: the integrative protein signature database, *Nucleic Acids Res*,

Huss, J.W., 3rd*, et al.* (2010) The Gene Wiki: community intelligence applied to human gene

Huss, J.W., 3rd*, et al.* (2008) A gene wiki for community annotation of gene function, *PLoS* 

Jenkinson, A.M.*, et al.* (2008) Integrating biological data--the Distributed Annotation System,

Jones, P.*, et al.* (2008) PRIDE: new developments and new datasets, *Nucleic Acids Res*, 36,

Kanehisa, M.*, et al.* (2010) KEGG for representation and analysis of molecular networks

Karasavvas, K.A.*, et al.* (2004) Bioinformatics integration and agent technology, *J Biomed* 

Karp, P.D.*, et al.* (2005) Expansion of the BioCyc collection of pathway/genome databases to

Katayama, T.*, et al.* (2010) The DBCLS BioHackathon: standardization and interoperability

Kawas, E.*, et al.* (2006) BioMoby extensions to the Taverna workflow management and

Keseler, I.M.*, et al.* (2011) EcoCyc: a comprehensive database of Escherichia coli biology,

Keshava Prasad, T.S.*, et al.* (2009) Human Protein Reference Database--2009 update, *Nucleic* 

Kinoshita, J. and Clark, T. (2007) Alzforum, *Methods in molecular biology (Clifton, N.J*, 401, 365-

Lee, T.J.*, et al.* (2006) BioWarehouse: a bioinformatics database warehouse toolkit, *BMC* 

Lee, T.L. (2008) Big data: open-source format needed to aid wiki collaboration, *Nature*, 455,

Lein, E.S.*, et al.* (2007) Genome-wide atlas of gene expression in the adult mouse brain,

Li, H.*, et al.* (2009) The Sequence Alignment/Map format and SAMtools, *Bioinformatics*, 25,

Lord, P.*, et al.* (2004) Applying Semantic Web services to bioinformatics: Experiences gained,

Maglott, D.*, et al.* (2011) Entrez Gene: gene-centered information at NCBI, *Nucleic Acids Res*,

Maojo, V.*, et al.* (2011) Biomedical Ontologies: Toward Scientific Debate, *Methods Inf Med*, 50,

lessons learnt, *Semantic Web - Iswc 2004, Proceedings*, 3298, 350-364.

Mardis, E.R. (2010) The \$1,000 genome, the \$100,000 analysis?, *Genome Med*, 2, 84.

Letovsky, S.I.*, et al.* (1998) GDB: the Human Genome Database, *Nucleic Acids Res*, 26, 94-99. Li, A. (2006) Facing the challenges of data integreation in biosciences, *Engineering Letters*, 13,

for bioinformatics web services and workflows. The DBCLS BioHackathon

involving diseases and drugs, *Nucleic Acids Res*, 38, D355-360.

37, D211-215.

*biology*, 6, e175.

*Inform*, 37, 205-219.

D878-883.

annotation, *Nucleic Acids Res*, 38, D633-639.

160 genomes, *Nucleic Acids Res*, 33, 6083-6089.

enactment software, *BMC bioinformatics*, 7, 523.

Consortium\*, *J Biomed Semantics*, 1, 8.

*Nucleic Acids Res*, 39, D583-590.

*Acids Res*, 37, D767-772.

*bioinformatics*, 7, 170.

*Nature*, 445, 168-176.

EL\_13\_13\_13.

2078-2079.

39, D52-57.

[Epub ahead of print].

381.

461.

*BMC Bioinformatics*, 9 Suppl 8, S3.


Roland Kienast and Christian Baumgartner *Institute of Electrical, Electronic and Bioengineering*

*Austria*

**3**

*University for Health Sciences, Medical Informatics and Technology*

**Using Semantic Web Technologies** 

Contemporary life sciences research requires an understanding of systems across wide ranges of scale and distribution. Therefore, there is an urgent need to integrate biomedical knowledge generated by different communities and separate subfields (Shadbolt et al., 2006). Scientific publications and curated databases together hold a vast amount of this useable knowledge. Additionally the number, size, and complexity of life science databases continues to grow (Kei-Hoi et al., 2009). Therefore scientists in the field of genomics, proteomics, metabolomics, clinical medicine and drug discovery need a concept to integrate their data, (Shadbolt et al., 2006) which is a prominent problem (Kei-Hoi et al., 2009). But to generate such a uniform data integration concept there are still some challenges to overcome such as handling the variety and amount of available data, inconsistency with data heterogeneity from the different sources, the autonomy and differing capabilities of the sources and a lack of standards for such an integration concept. Many heterogeneity conflicts remain in data integration due to the lack of semantics (Gagnon, 2007). In order, to efficiently exploit the knowledge from different resources, it will be important to connect the sources in a manner that machine processes can traverse and intelligently identify these links (Neumann et al., 2004). A promising approach to integrate heterogeneous data sources could be the use of *Semantic Web technologies*. They provide a framework to deal with the afore mentioned problems and fulfil the requirements

**Semantic Data Integration on Biomedical Data** 

This book chapter provides an overview of data integration on biomedical data using Semantic Web technologies including existing techniques (standards, specifications and

Data integration is the task of "combining the data residing at different sources, and providing the user with a unified view of the data" (Calì et al., 2001; 2003). But to accomplish the task of combining different heterogeneous sources there are some challenges to be overcome.

In the dictionary<sup>1</sup> heterogeneity is defined as *"the quality of being diverse and not comparable in kind"*. In computer science this inability to compare can be divided into four different classes

**2.1 Challenges in integrating information from heterogeneous data sources**

<sup>1</sup> Webster's Online Dictionary http://www.websters-online-dictionary.org

**1. Introduction**

for machine processing.

(Ouksel & Sheth, 1999):

**2. Basics of data integration**

methods), challenges, approaches and projects.

Salzberg, S.L. (2007) Genome re-annotation: a wiki solution?, *Genome biology*, 8, 102.


### **Semantic Data Integration on Biomedical Data Using Semantic Web Technologies**

Roland Kienast and Christian Baumgartner *Institute of Electrical, Electronic and Bioengineering University for Health Sciences, Medical Informatics and Technology Austria*

#### **1. Introduction**

56 Bioinformatics – Trends and Methodologies

Sarkar, I.N.*, et al.* (2008) Automated simultaneous analysis phylogenetics (ASAP): an

Sayers, E.W.*, et al.* (2011) Database resources of the National Center for Biotechnology

Schadt, E.E.*, et al.* (2010) Computational solutions to large-scale data management and

Seal, R.L.*, et al.* (2011) genenames.org: the HGNC resources in 2011, *Nucleic Acids Res*, 39,

Shah, S.P.*, et al.* (2005) Atlas - a data warehouse for integrative bioinformatics, *BMC* 

Shi, X. (2007) Semantic Web Services: An Unfulfilled Promise, *IEEE IT Professional*, 9, 42-45. Sigrist, C.J.*, et al.* (2010) PROSITE, a protein domain database for functional characterization

Smith, B.*, et al.* (2007) The OBO Foundry: coordinated evolution of ontologies to support

Stehr, H.*, et al.* (2010) PDBWiki: added value through community annotation of the Protein

Stein, L.D. (2010) The case for cloud computing in genome informatics, *Genome Biol*, 11, 207. Stevens, R.*, et al.* (2000) TAMBIS: transparent access to multiple bioinformatics information

Stevens, R.D.*, et al.* (2003) myGrid: personalised bioinformatics on the information grid,

The UniProt Consortium (2011) Ongoing and future developments at the Universal Protein

Trissl, S.*, et al.* (2005) Columba: an integrated database of proteins, structures, and

Vandervalk, B.P.*, et al.* (2009) Moby and Moby 2: creatures of the deep (web), *Brief Bioinform*,

Wilkinson, M.D. and Links, M. (2002) BioMOBY: an open source biological web services

Wilkinson, M.D.*, et al.* (2010) SADI, SHARE, and the in silico scientific method, *BMC* 

Wilkinson, M.D.*, et al.* (2008) Interoperability with Moby 1.0--it's better than sharing your

Zdobnov, E.M.*, et al.* (2002) The EBI SRS server--recent developments, *Bioinformatics*, 18, 368-

Zhang, Z. and Townsend, J.P. (2010) The filamentous fungal gene expression database

Zhao, J.*, et al.* (2009) Linked data and provenance in biological data webs, *Brief Bioinform*, 10,

Yager, K. (2006) Wiki ware could harness the Internet for science, *Nature*, 440, 278.

Zhang, Z.*, et al.* (2009) Bringing Web 2.0 to bioinformatics, *Brief Bioinform*, 10, 1-10.

Salzberg, S.L. (2007) Genome re-annotation: a wiki solution?, *Genome biology*, 8, 102.

enabling tool for phlyogenomics, *BMC bioinformatics*, 9, 103.

Information, *Nucleic Acids Res*, 39, D38-51.

and annotation, *Nucleic Acids Res*, 38, D161-166.

Data Bank, *Database (Oxford)*, 2010, baq009.

sources, *Bioinformatics*, 16, 184-185.

Resource, *Nucleic Acids Res*, 39, D214-219.

Waldrop, M. (2008) Big data: Wikiomics, *Nature*, 455, 22-25.

(FFGED), *Fungal Genet Biol*, 47, 199-204.

*Bioinformatics*, 11 Suppl 12, S7.

proposal, *Briefings in bioinformatics*, 3, 331-341.

toothbrush!, *Briefings in bioinformatics*, 9, 220-231.

annotations, *BMC Bioinformatics*, 6, 81.

biomedical data integration, *Nat Biotechnol*, 25, 1251-1255.

Stein, L. (2002) Creating a bioinformatics nation, *Nature*, 417, 119-120. Stein, L.D. (2003) Integrating biological databases, *Nat Rev Genet*, 4, 337-345.

*Bioinformatics (Oxford, England)*, 19 Suppl 1, i302-304.

analysis, *Nat Rev Genet*, 11, 647-657.

D514-519.

10, 114-128.

373.

139-152.

*Bioinformatics*, 6, 34.

Contemporary life sciences research requires an understanding of systems across wide ranges of scale and distribution. Therefore, there is an urgent need to integrate biomedical knowledge generated by different communities and separate subfields (Shadbolt et al., 2006). Scientific publications and curated databases together hold a vast amount of this useable knowledge. Additionally the number, size, and complexity of life science databases continues to grow (Kei-Hoi et al., 2009). Therefore scientists in the field of genomics, proteomics, metabolomics, clinical medicine and drug discovery need a concept to integrate their data, (Shadbolt et al., 2006) which is a prominent problem (Kei-Hoi et al., 2009). But to generate such a uniform data integration concept there are still some challenges to overcome such as handling the variety and amount of available data, inconsistency with data heterogeneity from the different sources, the autonomy and differing capabilities of the sources and a lack of standards for such an integration concept. Many heterogeneity conflicts remain in data integration due to the lack of semantics (Gagnon, 2007). In order, to efficiently exploit the knowledge from different resources, it will be important to connect the sources in a manner that machine processes can traverse and intelligently identify these links (Neumann et al., 2004). A promising approach to integrate heterogeneous data sources could be the use of *Semantic Web technologies*. They provide a framework to deal with the afore mentioned problems and fulfil the requirements for machine processing.

This book chapter provides an overview of data integration on biomedical data using Semantic Web technologies including existing techniques (standards, specifications and methods), challenges, approaches and projects.

#### **2. Basics of data integration**

Data integration is the task of "combining the data residing at different sources, and providing the user with a unified view of the data" (Calì et al., 2001; 2003). But to accomplish the task of combining different heterogeneous sources there are some challenges to be overcome.

#### **2.1 Challenges in integrating information from heterogeneous data sources**

In the dictionary<sup>1</sup> heterogeneity is defined as *"the quality of being diverse and not comparable in kind"*. In computer science this inability to compare can be divided into four different classes (Ouksel & Sheth, 1999):

<sup>1</sup> Webster's Online Dictionary http://www.websters-online-dictionary.org

• *Disadvantages:* It is necessary to build and maintain the warehouse and there is a danger of antiquated data. Therefore the warehouse system must regularly check the underlying sources for new or updated data and modify the local copy of the data if

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 59

**Mediator based integration** concentrates on query translation. A mediator is a system which provides a query translation from a single mediated schema to the local schema of the underlying data source (Hernandez & Kambhampati, 2004). The data flow between mediators and data sources is provided by software components called *Wrappers*. Unlike warehousing, data is not centrally stored but it is accessed directly from the distributed

• *Advantages:* The data is always up to date and there is no need to maintain a storage

• *Disadvantages:* Mediator based integration is sensitive to network bottlenecks, low

An other possibility is using Semantic Web technologies. The goal of the Semantic Web approach to data integration is to add machine readable metadata to resources and to define and describe relations among them. This makes it easier to automatically process and integrate information available within the different resources (W3C, 2004a) (see figure 1).

response times and temporarily unavailable sources.


The system can process the data and provide the results to the user.

 -

Fig. 1. The goal of data integration using Semantic Web technologies. *Right:* The user must consult several resources individually through different user interfaces to derive a result. *Left:* Semantic Web technology allows the integration of various heterogeneous resources.

Tim Berners-Lee , the director of the World Wide Web Consortium (W3C), coined the term Semantic Web (Berners-Lee et al., 2001) and it is mainly used to describe the model and technologies provided by the W3C which is the main international standards organization for the World Wide Web. The aim of the Semantic Web is to add structured meta-information to

  - -

 

 

**-**

 

**- - -**


 -

 - 

required (Davidson et al., 1995).

sources.


> - -

**3. Semantic Web in a nutshell**

system.


This heterogeneity leads to some challenges in integrating information from multiple data sources. Some general problems are (Cheung et al., 2007):

	- **–** *structured data:* e.g. different databases
	- **–** *semi-structured data:* e.g. HTML, XML data
	- **–** *unstructured data:* e.g text documents, images

#### **2.2 Different integration approaches**

There are different approaches to integrate different data sources by using *warehousing*, *mediation* or a combination of both.

	- *Advantages:* Warehousing eliminates various problems such as network bottlenecks, low response times, and temporarily unavailable sources. It allows to filter, validate, modify, and annotate the data obtained from the sources (Davidson et al., 1995).

<sup>2</sup> Haemophilia is a genetic disease which interferes with blood clotting.

2 Will-be-set-by-IN-TECH

• **System heterogeneity** is a result of different hardware platforms and operation systems.

• **Structural heterogeneity** rises from different data models or structure in various data

• **Semantic heterogeneity** results from differences in the interpretation of the meaning of

This heterogeneity leads to some challenges in integrating information from multiple data

• **Locating Resources:** To be able to integrate data it is important to find relevant and inter-operable data sources. But to find such sources it is beneficial to have a widely

• **Different data formats:** Different resources often provide heterogeneous data formats. For

• **Identify Synonyms and Homonyms:** Before large scale databases where created, researchers independently named biological entities. As a consequence many synonyms exist. The ability to distinguish between synonyms and homonyms is very important for

• **Detect Ambiguity:** Different terms can be used to represent different concepts. For

• **Recognize Granularity:** Different biological data sources may provide knowledge at different levels of granularity. For example one source provides information about different genetic diseases and their symptoms. Another source might only contain detailed

• **Scaling conflicts:** These conflicts occur when different reference systems are used to

There are different approaches to integrate different data sources by using *warehousing*,

**Warehouse integration** consists in catalouging the data from multiple sources into a local database called the *warehouse*. All queries are executed on the data contained in the warehouse (Hernandez & Kambhampati, 2004; Kugler et al., 2008; Pfeifer et al., 2007). The task of importing data from a source into the warehouse is called the *ETL (Extract -*

• *Advantages:* Warehousing eliminates various problems such as network bottlenecks, low response times, and temporarily unavailable sources. It allows to filter, validate, modify, and annotate the data obtained from the sources (Davidson et al., 1995).

example the term *insulin* can represent the concept *hormone* or *drug*.

measure a value e.g., different date formats or size measures.

<sup>2</sup> Haemophilia is a genetic disease which interferes with blood clotting.

• **Syntactic heterogeneity** is a difference of data representation formats.

sources. Some general problems are (Cheung et al., 2007):

accepted standard for describing the content of data.

**–** *structured data:* e.g. different databases **–** *semi-structured data:* e.g. HTML, XML data **–** *unstructured data:* e.g text documents, images

information about haemophilia2.

**2.2 Different integration approaches**

*mediation* or a combination of both.

*Transform - Load)* process.

sources.

example:

data integration.

different resources.

	- *Advantages:* The data is always up to date and there is no need to maintain a storage system.
	- *Disadvantages:* Mediator based integration is sensitive to network bottlenecks, low response times and temporarily unavailable sources.

An other possibility is using Semantic Web technologies. The goal of the Semantic Web approach to data integration is to add machine readable metadata to resources and to define and describe relations among them. This makes it easier to automatically process and integrate information available within the different resources (W3C, 2004a) (see figure 1).

Fig. 1. The goal of data integration using Semantic Web technologies. *Right:* The user must consult several resources individually through different user interfaces to derive a result. *Left:* Semantic Web technology allows the integration of various heterogeneous resources. The system can process the data and provide the results to the user.

#### **3. Semantic Web in a nutshell**

Tim Berners-Lee , the director of the World Wide Web Consortium (W3C), coined the term Semantic Web (Berners-Lee et al., 2001) and it is mainly used to describe the model and technologies provided by the W3C which is the main international standards organization for the World Wide Web. The aim of the Semantic Web is to add structured meta-information to

**4. Semantic Web approach to data integration**

(XML).

The W3C defines the abilities of the Semantic Web as follows (W3C, 2011):

*databases which are connected not by wires but by being about the same thing."*

**4.1 Important technologies for data integration in greater detail**

Framework), URIs are used to refer to resources (Hitzler et al., 2008).

integration based on Semantic Web technologies.

**4.1.1 URI (Uniform Resource Identifiers)**

**4.1.2 XML (eXtensible Markup Language)**

information is also called *metadata*. Problems with XML and data integration:

*"The Semantic Web is about two things. It is about common formats for integration and combination of data drawn from diverse sources, where on the original Web mainly concentrated on the interchange of documents. It is also about language for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of*

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 61

Semantic Web approach to data integration can deal with heterogeneity by providing structured meta-information to existing documents and data. A key feature integrating information is the use of semantics which gives meaning to a word or concept (Gardner, 2005). Semantics can solve the problem of homonyms and synonyms between different sources because it is able to ensure the equivalence of two concepts which might have different names and forms (synonyms) or the dissimilarity of two concepts which might have the same name and form (homonyms). Semantics describe relationships between concepts. This enables a fully descriptive representation of the available information, showing the interaction between concepts and allows inferences. Semantic Web technologies provide a tool to describe such semantic: The use of *Ontologies*. In order to achieve a beneficial use of ontologies, it is important to link the data to its semantic knowledge. In other words, it is important to annotate instances to ontologies. But these data often have different data formats (relational databases, text files, web sites, etc.). Adding metadata can solve this problem. But to benefit from this metadata, it should be standardized and machine readable. Such a kind of metadata provided by the Semantic Web technology is based on the Extensible Markup Language

This section describes the most important technologies which are needed for a semantic data

A URI is defined in RFC3986 (Berners-Lee et al., 2005): *"A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource."* In the web URIs typically refer to websites or other data. But in general URIs can be used to generate unique identifiers for different resources. For example the namespaces of a XML (Extensible Markup Language) document are identified by URI references. Also, in RDF (Resource Description

XML is a machine readable, standardized meta-language. It is an important basic technology for the Semantic Web (W3C, 2001) with which it is possible to create structured documents. These documents are text based and provide their data in a hierarchical and logically structured form which can be read by humans and by machines. It is an markup based language and uses tags for this purpose. In Informatics markup languages are used to extend parts of an document with additional information to describe it in more detail. This additional

XML is standardized, machine readable and defines the syntactical structure of a document. But in the view of the Semantic Web, XML tags are not much better than the natural language

existing documents and data in order to give it a well defined semantic meaning. This enables machines to process semantic information but *"not human speech and writings"* (Berners-Lee et al., 2001). This semantic extension makes it easier for machines to automatically process and integrate information available on the Web (W3C, 2004a).

The basic idea behind the Semantic Web is to add machine readable metadata3 to resources within the World Wide Web to define and describe relations among them. Semantic Web technologies are able to assimilate this gained information. Furthermore, they do not build a separate web, but function as an extension of the current web. The Semantic Web technology consists of a hierarchical use of various standards and technology in which each layer uses the capabilities of the layers below. The architecture of the semantic web is illustrated in figure 2. A brief description of each layer is summarized below:

Fig. 2. Semantic Web stack


<sup>3</sup> According to the Dictionary of Computing (http://dictionary.reference.com/browse/ meta-data) metadata is *"definitional data that provides information about or documentation of other data managed within an application or environment."* In relation to the Semantic Web Tim Berners-Lee defines metadata as (Berners-Lee, 1997): *"machine understandable information about web resources or other thing"*. In short, metadata is data about data.

#### **4. Semantic Web approach to data integration**

4 Will-be-set-by-IN-TECH

existing documents and data in order to give it a well defined semantic meaning. This enables machines to process semantic information but *"not human speech and writings"* (Berners-Lee et al., 2001). This semantic extension makes it easier for machines to automatically process

The basic idea behind the Semantic Web is to add machine readable metadata3 to resources within the World Wide Web to define and describe relations among them. Semantic Web technologies are able to assimilate this gained information. Furthermore, they do not build a separate web, but function as an extension of the current web. The Semantic Web technology consists of a hierarchical use of various standards and technology in which each layer uses the capabilities of the layers below. The architecture of the semantic web is illustrated in figure 2.

User interface and applications

Unifying Logic

Identifiers: **URI** Character set:

• **Character Set: UNICODE** defines a fundamental coding standard for data.

• **Syntax: XML** provides a fundamental syntax for structured documents.

• **Identifiers: URI** is a standard for the identification of resources.

• **Query: SPARQL** is a protocol and a query language for RDF.

• **Rules: RIF** defines the rules of semantic data.

• **Unifying Logic** allows to draw a conclusion. • **Proof** attempts to verify the conclusions.

In short, metadata is data about data.

Query: **OWL SPARQL**

> Syntax: **XML** Data interchange: **RDF**

• **Data interchange: RDF** is a data model for resources and relations between them. It uses

• **Taxonomies: RDFS** is an extension of RDF and provides a vocabulary for describing RDF

• **Ontologies: OWL** offers more opportunities to add semantic information to resources than

• **Trust** provides trusted principles and authentification methods between different agents. <sup>3</sup> According to the Dictionary of Computing (http://dictionary.reference.com/browse/ meta-data) metadata is *"definitional data that provides information about or documentation of other data managed within an application or environment."* In relation to the Semantic Web Tim Berners-Lee defines metadata as (Berners-Lee, 1997): *"machine understandable information about web resources or other thing"*.

**UNICODE**

Taxonomies: **RDFS** Rules: **RIF** Ontology:

Proof

Trust

Cryptography

and integrate information available on the Web (W3C, 2004a).

A brief description of each layer is summarized below:

Fig. 2. Semantic Web stack

the XML syntax.

resources.

RDFS.

The W3C defines the abilities of the Semantic Web as follows (W3C, 2011):

*"The Semantic Web is about two things. It is about common formats for integration and combination of data drawn from diverse sources, where on the original Web mainly concentrated on the interchange of documents. It is also about language for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing."*

Semantic Web approach to data integration can deal with heterogeneity by providing structured meta-information to existing documents and data. A key feature integrating information is the use of semantics which gives meaning to a word or concept (Gardner, 2005). Semantics can solve the problem of homonyms and synonyms between different sources because it is able to ensure the equivalence of two concepts which might have different names and forms (synonyms) or the dissimilarity of two concepts which might have the same name and form (homonyms). Semantics describe relationships between concepts. This enables a fully descriptive representation of the available information, showing the interaction between concepts and allows inferences. Semantic Web technologies provide a tool to describe such semantic: The use of *Ontologies*. In order to achieve a beneficial use of ontologies, it is important to link the data to its semantic knowledge. In other words, it is important to annotate instances to ontologies. But these data often have different data formats (relational databases, text files, web sites, etc.). Adding metadata can solve this problem. But to benefit from this metadata, it should be standardized and machine readable. Such a kind of metadata provided by the Semantic Web technology is based on the Extensible Markup Language (XML).

#### **4.1 Important technologies for data integration in greater detail**

This section describes the most important technologies which are needed for a semantic data integration based on Semantic Web technologies.

#### **4.1.1 URI (Uniform Resource Identifiers)**

A URI is defined in RFC3986 (Berners-Lee et al., 2005): *"A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource."* In the web URIs typically refer to websites or other data. But in general URIs can be used to generate unique identifiers for different resources. For example the namespaces of a XML (Extensible Markup Language) document are identified by URI references. Also, in RDF (Resource Description Framework), URIs are used to refer to resources (Hitzler et al., 2008).

#### **4.1.2 XML (eXtensible Markup Language)**

XML is a machine readable, standardized meta-language. It is an important basic technology for the Semantic Web (W3C, 2001) with which it is possible to create structured documents. These documents are text based and provide their data in a hierarchical and logically structured form which can be read by humans and by machines. It is an markup based language and uses tags for this purpose. In Informatics markup languages are used to extend parts of an document with additional information to describe it in more detail. This additional information is also called *metadata*.

Problems with XML and data integration:

XML is standardized, machine readable and defines the syntactical structure of a document. But in the view of the Semantic Web, XML tags are not much better than the natural language

This abstract definition is understandable on the basis of a simple example. It contains a brief

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 63

The abstract model includes the terms *animal*, *fish*, *mammal* and *puma*. These terms come from

animal

fish mammal

the "phenomenon" of animals. Every term is explicit. The term puma is explicitly defined as a animal. It cannot be confused with the clothing brand Puma. Puma also have clear limitations: a puma is an organism which is a mammal and not a fish and belongs to the animals. An ontology is represented as a directed graph. A graph is formal and machine readable. The ontology is also shared because not only one individual can infer knowledge

The structure of an ontology is a directed acyclic graph. That makes it possible to support complex relationships which allow terms to have more than one parent. For example the Gene Ontology<sup>4</sup> term *GO:0070229 : negative regulation of lymphocyte apoptosis* is a subclass of *GO:2000107 : negative regulation of leukocyte apoptosis* and *GO:0070228 : regulation of lymphocyte apoptosis*. Ontologies are able to describe the semantic of the information sources in order to make their content explicit. A basic module of ontologies is the so called *"triple"*. Broadly defined, a triple contains two terms and a relation between them5. With these elements an ontology can be represented as a directed graph. The terms are the nodes and the relations are

Like XML, RDF only provides a syntax for exchanging data. RDF properties can be considered as attributes of resources and also represent relationships between them. But it provides no mechanisms for adding a vocabulary to describing these attributes or relationships. RDFS, or also called *RDF Vocabulary Description Language*, extends RDF to describe such vocabularies (W3C, 2004c;e) and add terminological knowledge (schema knowledge) to this vocabulary. For that reason it can be seen as a semantic extension of RDF. RDF Schema vocabulary descriptions are written in RDF syntax (W3C, 2004e). It makes statements about the semantic relationship between terms within an arbitrarily defined vocabulary inside a RDFS document. This ability to define terminological knowledge allows RDFS to create *"light-weight"* ontologies (Hitzler et al., 2008; Volz et al., 2003) to describe semantic

Figure 4 shows a simple RDFS document in graph representation. RDFS organize RDF statements hierarchically into classes (terminological knowledge) and instances (assertional knowledge). Properties are used to describe relationships between classes. The terminological part includes the ontology while the assertional part presents conclusions about concrete

puma

abstract of the ontology of animals (see figure 4).

Fig. 4. A brief abstract of the ontology of animals.

and it is accepted by a group of biologists.

the edges of the graph (Smith et al., 2005).

**4.1.5 RDFS (RDF Schema)**

dependences within a domain.

<sup>5</sup> see section 4.1.3 for a detailed description

<sup>4</sup> see section 5.3.2

(Hitzler et al., 2008). These tags can be ambiguous, their relationship is not clearly defined and they provide no meaning for machines.

#### **4.1.3 RDF (Resource Description Framework)**

Originally RDF was designed for adding metadata to web resources but it has become a framework for adding semantic information to resources. RDF is machine readable. Therefore it enables the encoding, exchange, and reuse of structured metadata and allows structured and semi-structured data to be mixed, exposed and shared across different applications(W3C, 2010a) which can make use of the semantic information (Fensel, 2004).

RDF provides a simple data model for describing relationships between resources in terms of named properties and their values. While XML can only describe documents in a tree structure, RDF is a framework for representing information about resources in the form of a directed graph. An edge of this graph describes the relationship between two resources. RDF documents can be written in Notation 3 (N3) (W3C, 2005), N-Triples (W3C, 2004d), Turtle (W3C, 2008c) syntax or in a XML syntax. This XML syntax is called RDF/XML (W3C, 2004f). But XML can only describe a tree structure whereas RDF represents a graph. Therefore it is necessary to *serialize* these complex data objects into strings. RDF uses so-called *"triples"* (3-tuples) to describe relationships between resources to serialize the graph. A RDF-triple consists of only three elements (W3C, 2004g):


A triple is conventionally written in the order subject, predicate, object and can be illustrated by a node and directed arc diagram (see figure 3). A set of these triples form a directed graph. A problem in RDF is that URI references can not describe a conclusive semantic interpretation

#### Fig. 3. Illustration of a triple

of RDF coded information (Hitzler et al., 2008) because a URI can also be a homonym or synonym of another URI. This principle is also known as *Non Unique Name Assumption*. A solution to this problem is to use thematic vocabularies such as FOAF (Friend of a Friend) vocabulary which can be used for linking people and information about them (Brickley & Miller, 2010).

#### **4.1.4 Ontologies to share semantic information**

(Gruber, 1993) defines an ontology as: *"An ontology is an explicit specification of a conceptualization."* This definition was slightly modified by (Studer et al., 1998): *"An ontology is a formal, explicit specification of a shared conceptualization."*

A *conceptualization* refers to an abstract model of a phenomenon in the world which identifies the relevant concepts of that phenomenon. *Explicit* correlates to the formed types of concepts and their limitations, which are defined explicitly. *Formal* is based on the fact that an ontology should be machine readable. *Shared* means that an ontology should cover matching knowledge. This knowledge is not limited to an individual and is accepted by a group (Fensel, 2004; Studer et al., 1998).

This abstract definition is understandable on the basis of a simple example. It contains a brief abstract of the ontology of animals (see figure 4).

The abstract model includes the terms *animal*, *fish*, *mammal* and *puma*. These terms come from

Fig. 4. A brief abstract of the ontology of animals.

the "phenomenon" of animals. Every term is explicit. The term puma is explicitly defined as a animal. It cannot be confused with the clothing brand Puma. Puma also have clear limitations: a puma is an organism which is a mammal and not a fish and belongs to the animals. An ontology is represented as a directed graph. A graph is formal and machine readable. The ontology is also shared because not only one individual can infer knowledge and it is accepted by a group of biologists.

The structure of an ontology is a directed acyclic graph. That makes it possible to support complex relationships which allow terms to have more than one parent. For example the Gene Ontology<sup>4</sup> term *GO:0070229 : negative regulation of lymphocyte apoptosis* is a subclass of *GO:2000107 : negative regulation of leukocyte apoptosis* and *GO:0070228 : regulation of lymphocyte apoptosis*. Ontologies are able to describe the semantic of the information sources in order to make their content explicit. A basic module of ontologies is the so called *"triple"*. Broadly defined, a triple contains two terms and a relation between them5. With these elements an ontology can be represented as a directed graph. The terms are the nodes and the relations are the edges of the graph (Smith et al., 2005).

#### **4.1.5 RDFS (RDF Schema)**

6 Will-be-set-by-IN-TECH

(Hitzler et al., 2008). These tags can be ambiguous, their relationship is not clearly defined

Originally RDF was designed for adding metadata to web resources but it has become a framework for adding semantic information to resources. RDF is machine readable. Therefore it enables the encoding, exchange, and reuse of structured metadata and allows structured and semi-structured data to be mixed, exposed and shared across different applications(W3C,

RDF provides a simple data model for describing relationships between resources in terms of named properties and their values. While XML can only describe documents in a tree structure, RDF is a framework for representing information about resources in the form of a directed graph. An edge of this graph describes the relationship between two resources. RDF documents can be written in Notation 3 (N3) (W3C, 2005), N-Triples (W3C, 2004d), Turtle (W3C, 2008c) syntax or in a XML syntax. This XML syntax is called RDF/XML (W3C, 2004f). But XML can only describe a tree structure whereas RDF represents a graph. Therefore it is necessary to *serialize* these complex data objects into strings. RDF uses so-called *"triples"* (3-tuples) to describe relationships between resources to serialize the graph. A RDF-triple

A triple is conventionally written in the order subject, predicate, object and can be illustrated by a node and directed arc diagram (see figure 3). A set of these triples form a directed graph. A problem in RDF is that URI references can not describe a conclusive semantic interpretation

subject predicate object

of RDF coded information (Hitzler et al., 2008) because a URI can also be a homonym or synonym of another URI. This principle is also known as *Non Unique Name Assumption*. A solution to this problem is to use thematic vocabularies such as FOAF (Friend of a Friend) vocabulary which can be used for linking people and information about them (Brickley &

(Gruber, 1993) defines an ontology as: *"An ontology is an explicit specification of a conceptualization."* This definition was slightly modified by (Studer et al., 1998): *"An ontology*

A *conceptualization* refers to an abstract model of a phenomenon in the world which identifies the relevant concepts of that phenomenon. *Explicit* correlates to the formed types of concepts and their limitations, which are defined explicitly. *Formal* is based on the fact that an ontology should be machine readable. *Shared* means that an ontology should cover matching knowledge. This knowledge is not limited to an individual and is accepted by a group (Fensel,

2010a) which can make use of the semantic information (Fensel, 2004).

and they provide no meaning for machines.

**4.1.3 RDF (Resource Description Framework)**

consists of only three elements (W3C, 2004g):

**4.1.4 Ontologies to share semantic information**

*is a formal, explicit specification of a shared conceptualization."*

2. *The predicate:* Is a RDF URI reference.

Fig. 3. Illustration of a triple

2004; Studer et al., 1998).

Miller, 2010).

1. *The subject:* Is a RDF URI reference or a blank node.

3. *The object:* Is a RDF URI reference, a blank node or a literal.

Like XML, RDF only provides a syntax for exchanging data. RDF properties can be considered as attributes of resources and also represent relationships between them. But it provides no mechanisms for adding a vocabulary to describing these attributes or relationships. RDFS, or also called *RDF Vocabulary Description Language*, extends RDF to describe such vocabularies (W3C, 2004c;e) and add terminological knowledge (schema knowledge) to this vocabulary. For that reason it can be seen as a semantic extension of RDF. RDF Schema vocabulary descriptions are written in RDF syntax (W3C, 2004e). It makes statements about the semantic relationship between terms within an arbitrarily defined vocabulary inside a RDFS document. This ability to define terminological knowledge allows RDFS to create *"light-weight"* ontologies (Hitzler et al., 2008; Volz et al., 2003) to describe semantic dependences within a domain.

Figure 4 shows a simple RDFS document in graph representation. RDFS organize RDF statements hierarchically into classes (terminological knowledge) and instances (assertional knowledge). Properties are used to describe relationships between classes. The terminological part includes the ontology while the assertional part presents conclusions about concrete

<sup>4</sup> see section 5.3.2

<sup>5</sup> see section 4.1.3 for a detailed description

which are able to reason over OWL 2 ontologies (Grau et al., 2008). This is achieved because OWL Lite and DL are basically very expressive description logics (DL) where OWL DL is based on the SHOIN (*D*)*DL* (Hitzler et al., 2008) and OWL Lite to the slightly simpler

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 65

Description Logics (DL) stem from semantic networks (Donini et al., 1996). They model *concepts* (equal to a class in OWL), *roles* (equal to a property in OWL) and *individuals* (equal to a object in OWL), and their *relationships*. Therefore they can be used to represent the knowledge of a specific domain in a formal and structured way. Here the context of ontologies is clearly visible. As described in 4.1.4 an ontology consists of axioms, which are used to provide information about classes and properties of a specific domain. The knowledge which is provided by DL is divided into a *TBox* and an *ABox* (Donini et al., 1996). The TBox (terminological box) contains sentences describing concept hierarchies and the ABox (assertional box) contains sentences about the individuals and where they are in the hierarchy (Van Harmelen et al., 2008). For example the statement *"Every protein is made of amino acids"* belongs to the TBox, while the statement *"Leucine is a amino acid"* belongs to the ABox. The drawing of logical conclusions in OWL are based on the concept of the so-called *Open World Assumption (OWA)*. In contrast to the *Closed World Assumption (CWA)*, this assumption specifies that statements are neither true nor false if they can not be derived from a set of facts based on inference rules. The OWA does not assume that a answer is false unless it can be absolutely proven that the answer is false (Pollock, 2009). Listing 1 shows an example of both

Listing 1. Example for the open- and closed world assumption

**4.1.7 SPARQL (Simple Protocol and RDF Query Language)**

Answer: CWA: No.

Listing 2. Simple SPARQL query PREFIX ex: <http:*//example.com/>*

{ex:ATP ex:hasLongName ?name. ?name ex:producedIn ?part}

SELECT ?longName ?part

with two single variables.

WHERE

Knowledge Base: The protein p53 is involved in apoptosis. Query: Is the protein p53 involved in cell repair?

*ATP* and where it is produced, the SPARQL query would look like Listening 2.

OWA: Maybe or unknown.

SPARQL is a protocol and query language for RDF which since January 2008 is an official W3C recommendation (W3C, 2008a). SPARQL queries often contain a set of triple patterns. These patterns, or also called basic graph patterns, look like RDF triples. The difference is that every subject, predicate or object, can be expressed as a variable. A match can be found by replacing variables through substituting RDF terms. If the result of the substitution is equivalent to a subgraph of the RDF data a match is found. For example, to find the meaning of the acronym

The SELECT clause defines the variables which appear in the result and the WHERE clause provides the basic graph pattern. In this case the graph pattern consists of two triple patterns

As a simple knowledge basis following RDF data (see Listing3) in Turtle notation is used.

SHIF(*D*)*DL*.

assumptions.

qualities of the subject. This ontology describes, for example, that the class cell is a subclass of the class organ and that every cell consumes energy. Further, it is possible to derive implicit knowledge. If the muscle cell is an instance of the class cell and the ATP (Adenosine Tri-Phosphate) is an instance of high energy chemical bond, then it is possible to infer that a muscle cell is part of a human and ATP is a kind of energy.

Fig. 5. Simple RDFS-Ontology in graph representation

#### **4.1.6 OWL (Web Ontology Language)**

OWL is designed to enable machine processing of information content. OWL can explicitly represent the meaning in terms of vocabularies and their relationship with each other to build an ontology. Since October 2009 the version OWL 2 is recommended from W3C(W3C, 2009a). In contrast to RDFS, OWL has more opportunities to expressing meaning and semantics. Therefore OWL can be seen as an extension of RDFS (W3C, 2004b). An OWL ontology is an RDF graph which consists as a set of triples. It also can be written in different syntactic forms but the most common syntax is RDF/XML for representing these triples.

OWL provides three increasingly expressive sub-languages (Alesso & Smith, 2006):


Since OWL is an extension of RDFS and therefore also from RDF, any RDF document will generally be in OWL Full. OWL DL and OWL Lite also extend the RDF vocabulary, but they put restrictions on the use of this vocabulary (W3C, 2004a;b) for better machine processing. These restrictions guarantee computational completeness and decidability of reasoning systems like FaCT++ (Tsarkov & Horrocks, 2006) and the Pellet (Sirin et al., 2007) 8 Will-be-set-by-IN-TECH

qualities of the subject. This ontology describes, for example, that the class cell is a subclass of the class organ and that every cell consumes energy. Further, it is possible to derive implicit knowledge. If the muscle cell is an instance of the class cell and the ATP (Adenosine Tri-Phosphate) is an instance of high energy chemical bond, then it is possible to infer that a

ex:Cell ex:high-

muscle cell ex:consumes\_enery ATP rdf:type rdf:type

OWL is designed to enable machine processing of information content. OWL can explicitly represent the meaning in terms of vocabularies and their relationship with each other to build an ontology. Since October 2009 the version OWL 2 is recommended from W3C(W3C, 2009a). In contrast to RDFS, OWL has more opportunities to expressing meaning and semantics. Therefore OWL can be seen as an extension of RDFS (W3C, 2004b). An OWL ontology is an RDF graph which consists as a set of triples. It also can be written in different syntactic

1. **OWL Lite** to generate a classification hierarchy and simplify constraints. -> *Easily*

2. **OWL DL** (description logic) supports maximum expressiveness while retaining

3. **OWL Full** provides maximum expression and syntactic freedom of RDF but with no

Since OWL is an extension of RDFS and therefore also from RDF, any RDF document will generally be in OWL Full. OWL DL and OWL Lite also extend the RDF vocabulary, but they put restrictions on the use of this vocabulary (W3C, 2004a;b) for better machine processing. These restrictions guarantee computational completeness and decidability of reasoning systems like FaCT++ (Tsarkov & Horrocks, 2006) and the Pellet (Sirin et al., 2007)

forms but the most common syntax is RDF/XML for representing these triples. OWL provides three increasingly expressive sub-languages (Alesso & Smith, 2006):

computational completeness and decidability. -> *Mechanizable logic*

(predicate)

ex:consumes\_energy

rdfs:domain rdfs:range

ex:consumes

rdfs:subPropertyOf

energy\_chemical\_bond

(object)

ex:Energy

Class Property Instance

rdfs:subClassOf

muscle cell is part of a human and ATP is a kind of energy.

ex:Organ

rdfs:subClassOf

rdfs:subClassOf

(subject)

Fig. 5. Simple RDFS-Ontology in graph representation

computational guarantees. -> *Complete Logic*

terminological

assertional

*implementable*

knowledge

(RDF)

**4.1.6 OWL (Web Ontology Language)**

knowledge

(RDFS)

ex:Human

which are able to reason over OWL 2 ontologies (Grau et al., 2008). This is achieved because OWL Lite and DL are basically very expressive description logics (DL) where OWL DL is based on the SHOIN (*D*)*DL* (Hitzler et al., 2008) and OWL Lite to the slightly simpler SHIF(*D*)*DL*.

Description Logics (DL) stem from semantic networks (Donini et al., 1996). They model *concepts* (equal to a class in OWL), *roles* (equal to a property in OWL) and *individuals* (equal to a object in OWL), and their *relationships*. Therefore they can be used to represent the knowledge of a specific domain in a formal and structured way. Here the context of ontologies is clearly visible. As described in 4.1.4 an ontology consists of axioms, which are used to provide information about classes and properties of a specific domain. The knowledge which is provided by DL is divided into a *TBox* and an *ABox* (Donini et al., 1996). The TBox (terminological box) contains sentences describing concept hierarchies and the ABox (assertional box) contains sentences about the individuals and where they are in the hierarchy (Van Harmelen et al., 2008). For example the statement *"Every protein is made of amino acids"* belongs to the TBox, while the statement *"Leucine is a amino acid"* belongs to the ABox.

The drawing of logical conclusions in OWL are based on the concept of the so-called *Open World Assumption (OWA)*. In contrast to the *Closed World Assumption (CWA)*, this assumption specifies that statements are neither true nor false if they can not be derived from a set of facts based on inference rules. The OWA does not assume that a answer is false unless it can be absolutely proven that the answer is false (Pollock, 2009). Listing 1 shows an example of both assumptions.

Listing 1. Example for the open- and closed world assumption


#### **4.1.7 SPARQL (Simple Protocol and RDF Query Language)**

SPARQL is a protocol and query language for RDF which since January 2008 is an official W3C recommendation (W3C, 2008a). SPARQL queries often contain a set of triple patterns. These patterns, or also called basic graph patterns, look like RDF triples. The difference is that every subject, predicate or object, can be expressed as a variable. A match can be found by replacing variables through substituting RDF terms. If the result of the substitution is equivalent to a subgraph of the RDF data a match is found. For example, to find the meaning of the acronym *ATP* and where it is produced, the SPARQL query would look like Listening 2.

Listing 2. Simple SPARQL query

```
PREFIX ex: <http://example.com/>
SELECT ?longName ?part
WHERE
{ex:ATP ex:hasLongName ?name.
?name ex:producedIn ?part}
```
The SELECT clause defines the variables which appear in the result and the WHERE clause provides the basic graph pattern. In this case the graph pattern consists of two triple patterns with two single variables.

As a simple knowledge basis following RDF data (see Listing3) in Turtle notation is used.

• **Top-domain ontologies** contains core concepts of a given domain. For example: *Organism* or *Cell* for a biological domain. They work like an interface between top-level and domain

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 67

• **Domain ontologies** include only domain specific concepts and therefore only describe a

The ability of ontologies to provide a map of concepts in relationships enables semantic data integration. In this context, ontologies are used to describe the semantics of the data sources in order to make their content explicit (Boury-Brisset, 2003). The integration can take place on an extremely granular level to map data from different resources, no mater if the resources

Ontology-based approaches to data integration usually provide a three-layer architecture where a semantic layer working as a mediator is between the presentation layer and the physical layer. This semantic mediator exploits mapping models and transforms queries into execution plans. Wrappers exploit the description of the data sources at the physical layer. This enables a transparent access to diverse data sources by using a unified query language (Boury-Brisset, 2003) like SPARQL. Ontologies are used in the mediator layer because they provide a common vocabulary for the integration of data, where each concept has a unique defined name, associated properties and clearly defined synonyms. Furthermore, an ontology is not a rigid structure, it can grow with time and can be connected to other ontologies. Wache (Wache et al., 2001) describes three approaches for ontology-based data integration: • **Single ontology approach:** This approach uses only a single global ontology to integrate different sources. All information sources are related to the global ontology. The global ontology can be a combination of different specialized ontologies. This approach requires data sources with a similar view on the domain and a similar granularity. A disadvantage of this approach is that the integration of new information sources can lead to big changes

• **Multiple ontologies approach:** The semantic of an source is described by its own local ontology. There is no common vocabulary and therefore inter-ontology mapping is required. An advantage of this approach is that new data sources, and their local ontologies, can be easily integrated. But the lack of common vocabulary can make the

• **Hybrid approach:** This is a combination of the two preceding approaches. As with the multiple ontologies approach, resources are also described by local ontologies. But to avoid the disadvantages and to make these ontologies comparable, they are built from a shared global vocabulary. This vocabulary contains basic terms of a domain and allows querying through a shared vocabulary. The vocabulary can also be an ontology. Then it is also possible to dispense with the mapping between the local ontologies and only define mappings between the shared global ontology and the local ones. New sources can be

An example of using ontologies for data integration in biomedicine is the Gene Ontology Annotation (GOA) <sup>6</sup> project run by the European Bioinformatics Institute (EBI). GOA is based on the single ontology approach and has as target to provide *"high quality electronic and manual"* annotations to the UniProt knowledgebase <sup>7</sup> (UniProtKB)(Barrell et al., 2009). For

• **Local ontologies** describe the semantic of a single information resource.

contain structured or unstructured data (Gardner, 2005).

mapping between ontologies very difficult to define.

easily added with no need to modify existing mappings.

ontologies (Stenzhorn et al., 2008).

certain domain.

in the used ontology.

<sup>6</sup> http://www.ebi.ac.uk/GOA <sup>7</sup> http://www.ebi.ac.uk/uniprot

Fig. 6. Classification of ontologies. The reusability decreases with increasing specification. The availability behaves exactly opposite.

Listing 3. Simple RDF data

```
PREFIX ex: <http://example.com/cell>
ex:ATP ex:hasLongName "Adenosine Tri-Phosphate"
ex:ATP ex:producedIn ex:mitochondrion
```
Querying this RDF 3 data with the SPARQL query 2 obtained the result shown in table 1. It


Table 1. Result of SPARQL Query 2 on RDF Data 3

is also possible to generate complex graph patterns out of a number of simple patterns or to define filters to restrict the result. SPARQL provides four query forms which form a result SELECT, ASK set or RDF graphs CONSTRUCT, DESCRIBE out of the pattern matching. To serialize a result from a SELECT or from an ASK query into a XML document the *SPARQL Variable Binding Results XML Format*(W3C, 2008b) can be used.

#### **5. Using ontologies for data integration**

Biomedical ontologies play an important role in the process of data integration and support both approaches for data integration: warehousing and meditation (Bodenreider, 2008). Ontologies are a type of controlled vocabulary that attempt to capture the knowledge of a specific domain. This is the standardization required from *warehousing approaches*, where different sources are transformed into a common format and converted to a common vocabulary. On the other hand, the *mediation-based approach* ontologies can be used for defining global schema and mapping between the global schema and local schemes of the sources to integrate. An example of a system using this approach is *ONTOFUSION* (Perez-Rey et al., 2006). The terminological part of ontologies, which contain a list of names for the entities represented in these ontologies, is also an important resource for natural language processing (Altman et al., 2008).

Based on their granularity, ontologies can be divided into four classes (see figure 6):

• **Top-level ontologies** describe very general concepts which are independent of a particular problem or domain (Guarino, 1998) and are highly reusable across specific domains.

10 Will-be-set-by-IN-TECH

**top -level ontologies top-domain ontologies**

**domain ontologies**

reusability

The availability behaves exactly opposite.

PREFIX ex: <http:*//example.com/cell>*

ex:ATP ex:producedIn ex:mitochondrion

ex:ATP ex:hasLongName "Adenosine Tri-Phosphate"

Listing 3. Simple RDF data

Table 1. Result of SPARQL Query 2 on RDF Data 3

**5. Using ontologies for data integration**

(Altman et al., 2008).

*Variable Binding Results XML Format*(W3C, 2008b) can be used.

usability

**local ontologies**

Fig. 6. Classification of ontologies. The reusability decreases with increasing specification.

Querying this RDF 3 data with the SPARQL query 2 obtained the result shown in table 1. It

is also possible to generate complex graph patterns out of a number of simple patterns or to define filters to restrict the result. SPARQL provides four query forms which form a result SELECT, ASK set or RDF graphs CONSTRUCT, DESCRIBE out of the pattern matching. To serialize a result from a SELECT or from an ASK query into a XML document the *SPARQL*

Biomedical ontologies play an important role in the process of data integration and support both approaches for data integration: warehousing and meditation (Bodenreider, 2008). Ontologies are a type of controlled vocabulary that attempt to capture the knowledge of a specific domain. This is the standardization required from *warehousing approaches*, where different sources are transformed into a common format and converted to a common vocabulary. On the other hand, the *mediation-based approach* ontologies can be used for defining global schema and mapping between the global schema and local schemes of the sources to integrate. An example of a system using this approach is *ONTOFUSION* (Perez-Rey et al., 2006). The terminological part of ontologies, which contain a list of names for the entities represented in these ontologies, is also an important resource for natural language processing

Based on their granularity, ontologies can be divided into four classes (see figure 6):

• **Top-level ontologies** describe very general concepts which are independent of a particular problem or domain (Guarino, 1998) and are highly reusable across specific domains.

name part *"Adenosine Tri-Phosphate" http://example.org/cell/mitochondrion/*


The ability of ontologies to provide a map of concepts in relationships enables semantic data integration. In this context, ontologies are used to describe the semantics of the data sources in order to make their content explicit (Boury-Brisset, 2003). The integration can take place on an extremely granular level to map data from different resources, no mater if the resources contain structured or unstructured data (Gardner, 2005).

Ontology-based approaches to data integration usually provide a three-layer architecture where a semantic layer working as a mediator is between the presentation layer and the physical layer. This semantic mediator exploits mapping models and transforms queries into execution plans. Wrappers exploit the description of the data sources at the physical layer. This enables a transparent access to diverse data sources by using a unified query language (Boury-Brisset, 2003) like SPARQL. Ontologies are used in the mediator layer because they provide a common vocabulary for the integration of data, where each concept has a unique defined name, associated properties and clearly defined synonyms. Furthermore, an ontology is not a rigid structure, it can grow with time and can be connected to other ontologies. Wache (Wache et al., 2001) describes three approaches for ontology-based data integration:


An example of using ontologies for data integration in biomedicine is the Gene Ontology Annotation (GOA) <sup>6</sup> project run by the European Bioinformatics Institute (EBI). GOA is based on the single ontology approach and has as target to provide *"high quality electronic and manual"* annotations to the UniProt knowledgebase <sup>7</sup> (UniProtKB)(Barrell et al., 2009). For

<sup>6</sup> http://www.ebi.ac.uk/GOA

<sup>7</sup> http://www.ebi.ac.uk/uniprot

• **Terminological resources:** Metathesaurus: Includes biomedical and health related source

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 69

• **Ontological resources:** Semantic Network: Contains categorization of all concepts represented in the UMLS Metathesaurus and relationships between these categories. The *Semantic Network* (SN) can be seen as a collection of ontologies. In order to use these with Semantic Web technologies it is necessary to convert the SN to OWL DL. There are some approaches to map or convert UMLS SN to RDF (Zeng & Bodenreider, 2007), to OWL (Kashyap & Borgida, 2003; Schulz et al., 2009) or only parts to OWL (Chabalier et al., 2007). But there are formalism problems concerning this task like the complex semantics or the rich

BioTop is a top-domain ontology for the Life Sciences with the goal to provide *"an ontologically sound layer for linking and integrating various specific domain ontologies from the life sciences*

The OBO Foundry is a collaborative experiment involving science based ontology developers. The goal is to create orthogonal inter-operable reference ontologies in the biomedical domain (OBO Foundry, n.d.). These ontologies typically have the OBO flat file format. Like OWL, OBO is also an ontology representation languag (Richter, 2006). Ontologies based on the OBO flat file format can be bi-directionally converted to the OWL-DL format (Aranguren et al., 2007; Hoehndorf et al., 2010; Smith B. et al., 2007). The two most significant OBO are the Gene Ontology (GO), which contains the principle attributes of gene products, and the Sequence

The GO project<sup>10</sup> contains defined terms which represent gene product properties. The GO

• **Molecular function:** the elemental activities of a gene product at the molecular level, such

• **Biological process:** operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs and

The SO Project11 contains defined terms which describe the features and properties of biological sequences. SO is a sister project of the GO and also part of OBO (Eilbeck et al.,

vocabularies, concepts and the relationships between them.

attribute set of the UMLS SN.

*domain."* (Beisswanger et al., 2008).

**5.3.2 Gene Ontology (GO)**

as binding or catalysis.

**5.3.3 Sequence Ontology (SO)**

<sup>10</sup> http://www.geneontology.org <sup>11</sup> http://www.sequenceontology.org

organisms.

2005).

**5.3 Examples of existing domain ontologies**

**5.3.1 Open Biological and Biomedical Ontologies (OBO)**

Ontology, which describes the features of biological sequences.

covers three aspects of separate ontologies(Gene Ontology, n.d.):

• **Cellular component**: the parts of a cell or its extracellular environment.

**5.2.2 BioTop**

this purpose, GOA uses the standardized vocabulary of the Gene Ontology (GO) 5.3.2 and the International Protein Index (IPI) (Camon et al., 2004). The IPI offers complete, non redundant data sets representing the human, mouse and rat proteomes (Kersey et al., 2004).

Another advantageous feature of ontologies is that terms are organized in a hierarchical manner (Stein, 2003). That means more specific terms are specializations of more general terms. This could help to find the most specific common term shared by two data sources. An example of such a benefit could look like the following:

One research group might create a database in which gene products annotated to the *"negative regulation of T cell apoptosis"*-class of the Gene Ontology. Another group might identify gene products which negatively regulate the programmed cell death. If both groups use the terms of the GO, the two databases can be integrated by finding the most specific common term by traversing up the hierarchy (see figure 7). Without such an organized hierarchy of common concepts, the integration task comes down to tedious and error-prone work by hand (Stein, 2003).

Fig. 7. Find the most specific common term by traversing up the hierarchy. (This figure shows an extract of the Gene Ontology http://www.geneontology.org)

#### **5.1 Examples of existing top-level ontologies**

#### **5.1.1 Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE)**

DOLCE is the first module of the WonderWeb <sup>8</sup> foundational ontologies library. *"It aims at capturing the ontological categories underlying natural language and human commonsense."* (Masolo et al., 2003). The Dolce foundational ontology and its extensions provide a domain-independent framework to build ontologies on the basis of highly-reusable patterns.

#### **5.1.2 Basic Formal Ontology (BFO)**

The BFO is narrowly focused on the task of providing a genuine upper ontology which can be used in support of domain ontologies developed for scientific research, for example in biomedicine within the framework of the OBO Foundry (IFOMIS, Saarland University, 2010).

#### **5.2 Examples of existing top-domain ontologies**

#### **5.2.1 The Unified Medical Language System (UMLS)**

Having identified terminology is a key factor for data integration (Bodenreider, 2004) therefore the UMLS was developed by the National Library of Medicine (NLM)9 and consists of three knowledge Sources which can be used separately or together (U.S. National Library of Medicine, 2010):

• **Lexical resources:** SPECIALIST lexicon: Intends to be a general English lexicon which includes many biomedical terms.

<sup>8</sup> http://wonderweb.semanticweb.org

<sup>9</sup> http://www.nlm.nih.gov


The *Semantic Network* (SN) can be seen as a collection of ontologies. In order to use these with Semantic Web technologies it is necessary to convert the SN to OWL DL. There are some approaches to map or convert UMLS SN to RDF (Zeng & Bodenreider, 2007), to OWL (Kashyap & Borgida, 2003; Schulz et al., 2009) or only parts to OWL (Chabalier et al., 2007). But there are formalism problems concerning this task like the complex semantics or the rich attribute set of the UMLS SN.

#### **5.2.2 BioTop**

12 Will-be-set-by-IN-TECH

this purpose, GOA uses the standardized vocabulary of the Gene Ontology (GO) 5.3.2 and the International Protein Index (IPI) (Camon et al., 2004). The IPI offers complete, non redundant

Another advantageous feature of ontologies is that terms are organized in a hierarchical manner (Stein, 2003). That means more specific terms are specializations of more general terms. This could help to find the most specific common term shared by two data sources. An

One research group might create a database in which gene products annotated to the *"negative regulation of T cell apoptosis"*-class of the Gene Ontology. Another group might identify gene products which negatively regulate the programmed cell death. If both groups use the terms of the GO, the two databases can be integrated by finding the most specific common term by traversing up the hierarchy (see figure 7). Without such an organized hierarchy of common concepts, the integration task comes down to tedious and error-prone work by hand (Stein,

> **- -**

> >

**-** 

data sets representing the human, mouse and rat proteomes (Kersey et al., 2004).

example of such a benefit could look like the following:

**-**

Fig. 7. Find the most specific common term by traversing up the hierarchy. (This figure shows an extract of the Gene Ontology http://www.geneontology.org)

**5.1.1 Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE)**

DOLCE is the first module of the WonderWeb <sup>8</sup> foundational ontologies library. *"It aims at capturing the ontological categories underlying natural language and human commonsense."* (Masolo et al., 2003). The Dolce foundational ontology and its extensions provide a domain-independent framework to build ontologies on the basis of highly-reusable patterns.

The BFO is narrowly focused on the task of providing a genuine upper ontology which can be used in support of domain ontologies developed for scientific research, for example in biomedicine within the framework of the OBO Foundry (IFOMIS, Saarland University, 2010).

Having identified terminology is a key factor for data integration (Bodenreider, 2004) therefore the UMLS was developed by the National Library of Medicine (NLM)9 and consists of three knowledge Sources which can be used separately or together (U.S. National Library

• **Lexical resources:** SPECIALIST lexicon: Intends to be a general English lexicon which

**-** 

**5.1 Examples of existing top-level ontologies**

**5.2 Examples of existing top-domain ontologies 5.2.1 The Unified Medical Language System (UMLS)**

includes many biomedical terms.

<sup>8</sup> http://wonderweb.semanticweb.org

<sup>9</sup> http://www.nlm.nih.gov

**5.1.2 Basic Formal Ontology (BFO)**

of Medicine, 2010):

2003).

BioTop is a top-domain ontology for the Life Sciences with the goal to provide *"an ontologically sound layer for linking and integrating various specific domain ontologies from the life sciences domain."* (Beisswanger et al., 2008).

#### **5.3 Examples of existing domain ontologies 5.3.1 Open Biological and Biomedical Ontologies (OBO)**

The OBO Foundry is a collaborative experiment involving science based ontology developers. The goal is to create orthogonal inter-operable reference ontologies in the biomedical domain (OBO Foundry, n.d.). These ontologies typically have the OBO flat file format. Like OWL, OBO is also an ontology representation languag (Richter, 2006). Ontologies based on the OBO flat file format can be bi-directionally converted to the OWL-DL format (Aranguren et al., 2007; Hoehndorf et al., 2010; Smith B. et al., 2007). The two most significant OBO are the Gene Ontology (GO), which contains the principle attributes of gene products, and the Sequence Ontology, which describes the features of biological sequences.

#### **5.3.2 Gene Ontology (GO)**

The GO project<sup>10</sup> contains defined terms which represent gene product properties. The GO covers three aspects of separate ontologies(Gene Ontology, n.d.):


#### **5.3.3 Sequence Ontology (SO)**

The SO Project11 contains defined terms which describe the features and properties of biological sequences. SO is a sister project of the GO and also part of OBO (Eilbeck et al., 2005).

<sup>10</sup> http://www.geneontology.org

<sup>11</sup> http://www.sequenceontology.org

Database A

**7. Data integration and knowledge acquisition from biomedical literature**

2. *Information extraction (IE):* Extraction of relevant information from the document. 3. *Data mining (DM):* Discover of associations between information extracted by IE.

GoPubMed organize the results of a PubMed search using the GO.

Medicine and the National Institutes of Health.

The process of IR can be improved by adding a semantic layer. This layer formulates semantic queries, offering a higher expressive power than keyword matching (Spasic et al., 2005). However, adding semantic information to enhance the process of finding relevant information is generally a main part of Semantic Web technology. An example of such query systems are: • **GoPubMed** (www.gopubmed.org): This system submits keywords to PubMed12. The resulting abstracts are matched against Gene Ontology and Medical Subject Headings (MeSH) (Doms & Schroeder, 2005) to be classified. To find a match, a term extraction algorithm based on local sequence alignment is used (Delfs et al., 2004). In other words

<sup>12</sup> PubMed (http://www.pubmed.gov) is a literature database provided by the National Library of

Fig. 8. Two approaches which use the RDB2RDF mapping language

1. *Information retrieval (IR):* Retrieve of relevant documents.

language processing (NLP).

**7.1 Information retrieval (IR)**

Sections of TM are:

Database B

Mapping RDF <-> relational data RDB2RDF

Query (e.g. SPARQL)

**Semantic Web (Technologie)**

RDF copy of relational data

Query

ETL process SQL query

The quantity of biomedical literature is steadily growing with a rate of several thousand papers per week (Ananiadou et al., 2006). A large percentage of information is encoded in literature (Krallinger et al., 2008). But for a scientist it is next to impossible to read all relevant literature on a specific topic. Therefore it is important to extract semantic information out of literature to enable machine processing. This section provides an overview of how Semantic Web technologies support this task. Ontologies in particular are able to handle this influx of information and enable the data integration of biomedical literature (Spasic et al., 2005). Basic techniques to extract information from natural language are text mining (TM) and natural

Query (e.g. SPARQL)

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 71

#### **6. Relational database integration using the Semantic Web Approach**

A lot of biomedical data is available to the scientific community on the web. Much of this information is stored in a variety of different databases. The content of these databases differ from the type of biological data they provide (Baker & Cheung, 2007). For example:


Computational analyses of biological data often require using multiple datasets. Currently, the integration of different data sets usually happens manually. This approach is very time consuming which requires integrated datasets with rich, flexible and efficient interfaces (Smith A. et al., 2007).

#### **6.1 Problems of heterogeneous database integration**


Resolving such heterogeneity and enabling database integration is a key problem which the Semantic Web aims to address (Baker & Cheung, 2007). Therefore a mapping language between RDF and relational databases called RDB2RDF is under development.

#### **6.2 RDB2RDF**

A workshop hosted by the W3C on *"RDF accesses to Relational Databases"* in October 2007 resulted in creating a RDB2RDF Incubator Group (W3C, 2010b), which operated from 2008 to 2009. The objective of this group was to create a group to develop a standardized mapping language between RDF and relational databases (W3C, 2009c). The resulting RDB2RDF working group started in 2009 with: *" The mission of the RDB2RDF Working Group, part of the Semantic Web Activity, is to standardize a language for mapping relational data and relational database schemas into RDF and OWL, tentatively called the RDB2RDF Mapping Language, R2RML."* (W3C, 2009b). The results of this working group are scheduled for release September 30*th*,2011. The RDB2RDF mapping language could be used in two ways (see figure 8):


Fig. 8. Two approaches which use the RDB2RDF mapping language

#### **7. Data integration and knowledge acquisition from biomedical literature**

The quantity of biomedical literature is steadily growing with a rate of several thousand papers per week (Ananiadou et al., 2006). A large percentage of information is encoded in literature (Krallinger et al., 2008). But for a scientist it is next to impossible to read all relevant literature on a specific topic. Therefore it is important to extract semantic information out of literature to enable machine processing. This section provides an overview of how Semantic Web technologies support this task. Ontologies in particular are able to handle this influx of information and enable the data integration of biomedical literature (Spasic et al., 2005). Basic techniques to extract information from natural language are text mining (TM) and natural language processing (NLP).

Sections of TM are:

14 Will-be-set-by-IN-TECH

A lot of biomedical data is available to the scientific community on the web. Much of this information is stored in a variety of different databases. The content of these databases differ

• *Sequence databases* like EMBL Nucleotide Sequence Database (EBI, n.d.a) or NCBI's

• *Microarray gene expression databases* like the EMBL ArrayExpress Archive (EMBL-EBI, n.d.), NCBI's Gene Expression Omnibus (GEO)(NCBI, n.d.) or the Stanford Microarray Database

• *Pathway databases* like KEGG (Kanehisa-Laboratories, n.d.) or the Human Protein

Computational analyses of biological data often require using multiple datasets. Currently, the integration of different data sets usually happens manually. This approach is very time consuming which requires integrated datasets with rich, flexible and efficient interfaces (Smith

• **Technical heterogeneity** results from different access protocols, file formats, query

Resolving such heterogeneity and enabling database integration is a key problem which the Semantic Web aims to address (Baker & Cheung, 2007). Therefore a mapping language

A workshop hosted by the W3C on *"RDF accesses to Relational Databases"* in October 2007 resulted in creating a RDB2RDF Incubator Group (W3C, 2010b), which operated from 2008 to 2009. The objective of this group was to create a group to develop a standardized mapping language between RDF and relational databases (W3C, 2009c). The resulting RDB2RDF working group started in 2009 with: *" The mission of the RDB2RDF Working Group, part of the Semantic Web Activity, is to standardize a language for mapping relational data and relational database schemas into RDF and OWL, tentatively called the RDB2RDF Mapping Language, R2RML."* (W3C, 2009b). The results of this working group are scheduled for release September 30*th*,2011.

1. To extract the data from the relational database and store the content in RDF. In this case the data is physically converted to RDF in a ETL (Extract-Transform-Load) and then stored in a RDF triple store. An advantage of this approach is its easy implementation. A

2. To generate virtual mapping between the Semantic Web technologies and the relational database. This virtual mapping queries via SPARQL which will be translated into SQL

• **Data model heterogeneity** arises because of different models storing the same data. • **Semantic heterogeneity** occurs during combination of different databases with various but related data. For example combine a gene database to a protein database. A gene may

have gene products and therefore these two databases are related.

between RDF and relational databases called RDB2RDF is under development.

The RDB2RDF mapping language could be used in two ways (see figure 8):

disadvantage is that there is always a separate copy of the relational data.

queries on the underlying relational data.

**6. Relational database integration using the Semantic Web Approach**

Reference Database (HPRD) (Keshava Prasad et al., 2008).

GenBank (NCBI, 2004).

A. et al., 2007).

**6.2 RDB2RDF**

languages and so on.

(SMD) (Stanford University, n.d.).

• *Proteomic Databases* like the UniProt (EBI, n.d.b).

**6.1 Problems of heterogeneous database integration**

from the type of biological data they provide (Baker & Cheung, 2007). For example:


#### **7.1 Information retrieval (IR)**

The process of IR can be improved by adding a semantic layer. This layer formulates semantic queries, offering a higher expressive power than keyword matching (Spasic et al., 2005). However, adding semantic information to enhance the process of finding relevant information is generally a main part of Semantic Web technology. An example of such query systems are:

• **GoPubMed** (www.gopubmed.org): This system submits keywords to PubMed12. The resulting abstracts are matched against Gene Ontology and Medical Subject Headings (MeSH) (Doms & Schroeder, 2005) to be classified. To find a match, a term extraction algorithm based on local sequence alignment is used (Delfs et al., 2004). In other words GoPubMed organize the results of a PubMed search using the GO.

<sup>12</sup> PubMed (http://www.pubmed.gov) is a literature database provided by the National Library of Medicine and the National Institutes of Health.

• *Text2Onto:* This is a framework for ontology learning from textual resources. It is based on

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 73

• *OntoLearn:* OntoLearn is based on a linguistic processor and a syntactic parser. It is able to extract syntactically plausible terminological noun phrases (Navigli & Velardi, 2004;

A challenge faced by data integration is the individual naming of objects. For example a KEGG13 entry refers to a collection of proteins involved in a pathway whereas a UniProt entry refers to a class of proteins, a class of variant proteins or some viral protein. To integrate these two resources mapping is required. One approach is to designate an authoritative names commission to manage the definitive list of such names (Stein, 2003). An example is the HUGO Gene Nomenclature Committee <sup>14</sup> for gene names and symbols (short-form abbreviation). But because of the dynamic in the field of biomedical research this approach

Another way could be the creation of globally unique biological identifiers. For this purpose URIs can be used which allows for the unique identifying of resources. This is central for the use of Semantic Web technologies. Therefore a process is needed which routinely assigns URIs to objects (Shadbolt et al., 2006) to create common, shared identities and names (Goble

For efficient use of Semantic Web technologies, it would be useful to automatically or semi-automatically extract the semantic information from existing sources. Therefore a big challenge is to develop methods which support such a task. This would aid two main tasks in

1. **Annotate sources to existing ontologies:** This is a process which extracts information from the data source to automatically or semi-automatically annotate this source to an existing

2. **Creation process of ontologies:** This is a task which extracts information from different data sources belonging to a specific domain. The goal is to automatically or semi-automatically create an ontology based on the extracted domain information. A large percentage of information encoded in literature (Krallinger et al., 2008) is in the form of natural language. Some approaches for such "semantic information extraction" from

Ontologies must be developed, managed and endorsed by committed practice communities (Shadbolt et al., 2006). Furthermore, an ontology is a "living structure" which means that concepts can change constantly because of new knowledge. They can be added, changed, replaced or removed. Therefore ontologies are not fixed for all time and must be constantly maintained. Another problem is the quality assurance (QA) of ontologies. According to

Gruber (Gruber, 1995) design and quality criteria for ontologies should be:

<sup>13</sup> KEGG: Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/) <sup>14</sup> http://www.hugo-international.org/comm\_genenomenclaturecommittee.php

algorithms calculating the relative term frequency (Cimiano & Volker, 2005).

**8. Challenges in data integration using Semantic Web technologies**

**8.2 Extraction of the semantic information out of existing knowledge**

data integration using Semantic Web technologies:

**8.3 Ontology development, maintenance and quality**

Velardi et al., 2005).

rarely work in practice (Stein, 2003).

literature can be found in section 7.

**8.1 Uniform naming**

& Stevens, 2008).

ontology.

• **Textpresso** (http://www.textpresso.org): A tool for neuroscience which has its own literature filled database. It uses a custom ontology to query nine different categories (Müller et al., 2008).

#### **7.2 Information Extraction (IE), Data Mining (DM)**

There are two ways to enhance the process of IE respectively, use TM and NLP supporting "literature data integration" based on Semantic Web technologies:


Generally, text mining is used to aid experts in extracting knowledge from a large volume of text by automatically filtering relevant information. A known problem is to find terms which represent specific classes of biomedical entities (e.g. protein names). This process is called *Named Entity Recognition* (NER). The integration of knowledge, supported by ontologies, can improve NER. The goal is to extract terms and map them to concepts of a domain specific ontology. A challenge in this process is the myriad variations of terms used to describe things in natural language. Approximately one third of term occurrences are variants (Jacquemin, 2001) and therefore only synonyms of known terms. Another problem is the specific terminology in biomedical texts. To have terminological knowledge is of vital importance to TM for characterizing knowledge in the domain. This knowledge is stored in ontologies and can enhance the process of IE by (Spasic et al., 2005):

	- **–** *Passive ontology use (Ontology-based IE):* The goal of this approach is to map recognized terms in ontology concepts by look-up.
	- **–** *Active ontology use(Ontology-driven IE)*: involves ontologies directly in the process of term recognition.

#### **7.3 Semi-automatic or automatic ontology engineering**

An advanced task is semi-automatic or automatic engineering of ontologies from a specific domain on the basis of information extracted from literature. Currently the development of ontologies *"is largely a manual process, based on personal experience and intuition"* (Alexopoulou et al., 2008). Two primary parts of this process are:


For an automatic terminology development it is important to extract terms from a text. This automatic identification of possible candidates for terms is called *automatic term recognition (ATR)*. At the moment ATR is not able to fully automate the process of ontology design, but it can speed up this process by providing lists of useful domain-specific terms extracted from domain specific literature. Therefore it can support a semi-automatic creation of ontologies (Alexopoulou et al., 2008). Examples of frameworks which support ATR and further identify the semantic relations between them are:


#### **8. Challenges in data integration using Semantic Web technologies**

#### **8.1 Uniform naming**

16 Will-be-set-by-IN-TECH

• **Textpresso** (http://www.textpresso.org): A tool for neuroscience which has its own literature filled database. It uses a custom ontology to query nine different categories

There are two ways to enhance the process of IE respectively, use TM and NLP supporting

2. Semi-automatic or automatic engineering of ontologies by a specific domain based on

Generally, text mining is used to aid experts in extracting knowledge from a large volume of text by automatically filtering relevant information. A known problem is to find terms which represent specific classes of biomedical entities (e.g. protein names). This process is called *Named Entity Recognition* (NER). The integration of knowledge, supported by ontologies, can improve NER. The goal is to extract terms and map them to concepts of a domain specific ontology. A challenge in this process is the myriad variations of terms used to describe things in natural language. Approximately one third of term occurrences are variants (Jacquemin, 2001) and therefore only synonyms of known terms. Another problem is the specific terminology in biomedical texts. To have terminological knowledge is of vital importance to TM for characterizing knowledge in the domain. This knowledge is stored in ontologies

• Using Ontology as a training set for NER by reducing it to a list of classified terms. This

**–** *Passive ontology use (Ontology-based IE):* The goal of this approach is to map recognized

**–** *Active ontology use(Ontology-driven IE)*: involves ontologies directly in the process of

• Using ontologies to improve machine learning approaches for TM tasks, such as term

An advanced task is semi-automatic or automatic engineering of ontologies from a specific domain on the basis of information extracted from literature. Currently the development of ontologies *"is largely a manual process, based on personal experience and intuition"* (Alexopoulou

For an automatic terminology development it is important to extract terms from a text. This automatic identification of possible candidates for terms is called *automatic term recognition (ATR)*. At the moment ATR is not able to fully automate the process of ontology design, but it can speed up this process by providing lists of useful domain-specific terms extracted from domain specific literature. Therefore it can support a semi-automatic creation of ontologies (Alexopoulou et al., 2008). Examples of frameworks which support ATR and further identify

(Müller et al., 2008).

**7.2 Information Extraction (IE), Data Mining (DM)**

information extracted from literature.

can be done in two ways:

term recognition.

and can enhance the process of IE by (Spasic et al., 2005):

terms in ontology concepts by look-up.

**7.3 Semi-automatic or automatic ontology engineering**

et al., 2008). Two primary parts of this process are:

2. Finding relationships between different concepts.

the semantic relations between them are:

classification, term clustering and term relation extraction.

1. Extracting terms which represent a concept in the specific domain.

"literature data integration" based on Semantic Web technologies: 1. Ontology assisted extraction of meta-information from literature.

> A challenge faced by data integration is the individual naming of objects. For example a KEGG13 entry refers to a collection of proteins involved in a pathway whereas a UniProt entry refers to a class of proteins, a class of variant proteins or some viral protein. To integrate these two resources mapping is required. One approach is to designate an authoritative names commission to manage the definitive list of such names (Stein, 2003). An example is the HUGO Gene Nomenclature Committee <sup>14</sup> for gene names and symbols (short-form abbreviation). But because of the dynamic in the field of biomedical research this approach rarely work in practice (Stein, 2003).

> Another way could be the creation of globally unique biological identifiers. For this purpose URIs can be used which allows for the unique identifying of resources. This is central for the use of Semantic Web technologies. Therefore a process is needed which routinely assigns URIs to objects (Shadbolt et al., 2006) to create common, shared identities and names (Goble & Stevens, 2008).

#### **8.2 Extraction of the semantic information out of existing knowledge**

For efficient use of Semantic Web technologies, it would be useful to automatically or semi-automatically extract the semantic information from existing sources. Therefore a big challenge is to develop methods which support such a task. This would aid two main tasks in data integration using Semantic Web technologies:


A large percentage of information encoded in literature (Krallinger et al., 2008) is in the form of natural language. Some approaches for such "semantic information extraction" from literature can be found in section 7.

#### **8.3 Ontology development, maintenance and quality**

Ontologies must be developed, managed and endorsed by committed practice communities (Shadbolt et al., 2006). Furthermore, an ontology is a "living structure" which means that concepts can change constantly because of new knowledge. They can be added, changed, replaced or removed. Therefore ontologies are not fixed for all time and must be constantly maintained. Another problem is the quality assurance (QA) of ontologies. According to Gruber (Gruber, 1995) design and quality criteria for ontologies should be:

<sup>13</sup> KEGG: Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/)

<sup>14</sup> http://www.hugo-international.org/comm\_genenomenclaturecommittee.php

**8.5 Query RDF data**

handle RDF data.

**8.6 Visualization**

**8.7 Availability**

**8.8 Different ontology formats**

problem because of multilingualism:

<sup>19</sup> This license is freely available for research purposes

<sup>16</sup> http://www.openrdf.org/ <sup>17</sup> http://jena.sourceforge.net/ARQ/ <sup>18</sup> http://jena.sourceforge.net

**8.9 Multilingualism**

two groups (Haase et al., 2010; Kei-Hoi et al., 2009):

combined again into one and represented to the user.

SPARQL overcomes the old problem of different, non standard query languages. Now it is possible to query RDF data using a standard query language (Quilitz & Leser, 2008). But it is important that content providers integrate SPARQL-endpoints to make their data available. Such endpoints provide a machine-friendly interface towards the knowledge base and enables queries using the SPARQL language. One challenge is to query more than just one endpoint at the same time with only one query. There are several approaches which can be divided into

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 75

• **Warehousing:** This approach stores all RDF data from the different resources in one central database. This database is typically a *triple store* which is designed to efficiently store and

• **Federated query**: A query engine decomposes a single query into sub-queries. Each of these queries can be answered by an individual endpoint. After that, all results are

Two examples of Java frameworks are *Sesame*<sup>16</sup> witch supports the warehouse approach and the *ARQ*<sup>17</sup> extension of the *Jena Ontology API*<sup>18</sup> which provides the federated query approach.

The semantic integration of different resources results in increasing the amount of semantically linked data. Semantic Web technologies use RDF, defining links between data. Therefore the challenge is to create an interface to visualize and navigate a massive RDF graph without information overload. The visualization should help the user to easily explore and

There are two issues: The availability of ontologies and content. A key to integrating data using Semantic Web technologies is the availability of ontologies. Many ontologies are freely available but concerns arise if an ontology is commercial or only partially released. For example a license is necessary to access UMLS19. On the other side it is important to access content which is annotated to ontologies. But this may cause problems if this content is not available due to technical problems, deleted static web sites and legal restrictions, etc.

The Semantic Web defines ontologies in the OWL format. But other ontologies exists with different formats (for example the UMLS Rich Release Format (RRF) or the OBO format).

A challenge is also multilingualism when using Semantic Web technologies (Börner, 2006). It plays a role in ontology development, annotation of data and representing multilingual informations in user interfaces (Benjamins et al., 2002). For example, a scenario that leads to a

*User A* annotates a document in French to *Term A* of an ontology designed in English. *User B*

Therefore, mapping must be defined to convert these different formats to OWL.

quickly find relevant information (Le Grand & Soto, 2002) in the structure.


The quality of an ontology can be checked either collaboratively by users or centrally, by experts. To test the coherence of an ontology *Ontology-Reasoners* like Pellet15 could be used. Ontology Reasoning is a process of automated logical inference of knowledge with ontologies. It is used to check the consistency of knowledge models and to infer new knowledge in accordance with the laws of logic.

#### **8.4 Mapping, merging, alignment and integration of ontologies**

Many individual ontologies are created and therefore the semantic mapping between different ontologies has become a core issue for the Semantic Web and data integration using its technology. To handle the increasing number of ontologies it is necessary to develop semi-automatic or automatic approaches (Ehrig & Sure, 2004).

The problem with the mapping of ontologies is their heterogeneity which can be divided into *metadata heterogeneity* and *instance heterogeneity* (Tang et al., 2006). Metadata heterogeneity is concerned with the intended meaning of the information held in different ontologies and deal with *structural conflicts* and *name conflicts*. Structural conflicts arise from ontologies which cover the same domain but have different taxonomies (Ehrig & Sure, 2004), and naming conflicts concern homonyms and synonyms between concepts of different ontologies. For instance heterogeneity referreds to the variation in notation different e.g. different date formats.

Merging, aligning and integration is an ontology reuse process to create a new ontology. The task of each process is as follows (Choi et al., 2006; Ding et al., 2002):


Data which covers different domains can not often be described by only one ontology. Therefore it is necessary to map different ontologies. There are different strategies for mapping various ontologies:


<sup>15</sup> Pellet is a OWL 2 Reasoner for Java (http://clarkparsia.com/pellet).

#### **8.5 Query RDF data**

18 Will-be-set-by-IN-TECH

1. *Clarity:* The intended meaning should be clearly defined and the definitions should be

3. *Minimal encoding bias:* No particular symbol-level encoding should be used to specify

4. *Minimal ontological commitment:* An ontology should use as few terms and relationships as

5. *Coherence:* The content of the ontology should be coherent. In other words inferences

The quality of an ontology can be checked either collaboratively by users or centrally, by experts. To test the coherence of an ontology *Ontology-Reasoners* like Pellet15 could be used. Ontology Reasoning is a process of automated logical inference of knowledge with ontologies. It is used to check the consistency of knowledge models and to infer new knowledge in

Many individual ontologies are created and therefore the semantic mapping between different ontologies has become a core issue for the Semantic Web and data integration using its technology. To handle the increasing number of ontologies it is necessary to develop

The problem with the mapping of ontologies is their heterogeneity which can be divided into *metadata heterogeneity* and *instance heterogeneity* (Tang et al., 2006). Metadata heterogeneity is concerned with the intended meaning of the information held in different ontologies and deal with *structural conflicts* and *name conflicts*. Structural conflicts arise from ontologies which cover the same domain but have different taxonomies (Ehrig & Sure, 2004), and naming conflicts concern homonyms and synonyms between concepts of different ontologies. For instance heterogeneity referreds to the variation in notation different e.g. different date

Merging, aligning and integration is an ontology reuse process to create a new ontology. The

• **Merging** is the task of generating a single ontology by merging two or more different

• **Alignment** is a process of creating links between two ontologies when the sources are consistent but kept separate. This addresses the problem of mapping between ontologies. • **Integration** generates a single ontology by combining two ore more different ontologies in

Data which covers different domains can not often be described by only one ontology. Therefore it is necessary to map different ontologies. There are different strategies for

• *Ontology mapping between a global ontology and local ontologies* (Beneventano et al., 2003):

• *Mapping between local ontologies:* These strategies define mapping between local ontologies.

Defines mapping between concepts in local ontologies to global ontology.

<sup>15</sup> Pellet is a OWL 2 Reasoner for Java (http://clarkparsia.com/pellet).

2. *Extendibility:* The effort needed to extend an ontology without invalidating it.

possible to describe the domain being modeled.

**8.4 Mapping, merging, alignment and integration of ontologies**

semi-automatic or automatic approaches (Ehrig & Sure, 2004).

task of each process is as follows (Choi et al., 2006; Ding et al., 2002):

should never contradicts a definition.

accordance with the laws of logic.

ontologies of the same domain.

different subjects.

mapping various ontologies:

objective.

terms.

formats.

SPARQL overcomes the old problem of different, non standard query languages. Now it is possible to query RDF data using a standard query language (Quilitz & Leser, 2008). But it is important that content providers integrate SPARQL-endpoints to make their data available. Such endpoints provide a machine-friendly interface towards the knowledge base and enables queries using the SPARQL language. One challenge is to query more than just one endpoint at the same time with only one query. There are several approaches which can be divided into two groups (Haase et al., 2010; Kei-Hoi et al., 2009):


Two examples of Java frameworks are *Sesame*<sup>16</sup> witch supports the warehouse approach and the *ARQ*<sup>17</sup> extension of the *Jena Ontology API*<sup>18</sup> which provides the federated query approach.

#### **8.6 Visualization**

The semantic integration of different resources results in increasing the amount of semantically linked data. Semantic Web technologies use RDF, defining links between data. Therefore the challenge is to create an interface to visualize and navigate a massive RDF graph without information overload. The visualization should help the user to easily explore and quickly find relevant information (Le Grand & Soto, 2002) in the structure.

#### **8.7 Availability**

There are two issues: The availability of ontologies and content. A key to integrating data using Semantic Web technologies is the availability of ontologies. Many ontologies are freely available but concerns arise if an ontology is commercial or only partially released. For example a license is necessary to access UMLS19. On the other side it is important to access content which is annotated to ontologies. But this may cause problems if this content is not available due to technical problems, deleted static web sites and legal restrictions, etc.

#### **8.8 Different ontology formats**

The Semantic Web defines ontologies in the OWL format. But other ontologies exists with different formats (for example the UMLS Rich Release Format (RRF) or the OBO format). Therefore, mapping must be defined to convert these different formats to OWL.

#### **8.9 Multilingualism**

A challenge is also multilingualism when using Semantic Web technologies (Börner, 2006). It plays a role in ontology development, annotation of data and representing multilingual informations in user interfaces (Benjamins et al., 2002). For example, a scenario that leads to a problem because of multilingualism:

*User A* annotates a document in French to *Term A* of an ontology designed in English. *User B*

<sup>16</sup> http://www.openrdf.org/

<sup>17</sup> http://jena.sourceforge.net/ARQ/

<sup>18</sup> http://jena.sourceforge.net

<sup>19</sup> This license is freely available for research purposes

Altman, R., Bergman, C., Blake, J., Blaschke, C., Cohen, A., Gannon, F., Grivell, L., Hahn,

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 77

Ananiadou, S., Kell, D. & Tsujii, J. (2006). Text mining and its potential applications in systems

Aranguren, M., Bechhofer, S., Lord, P., Sattler, U. & Stevens, R. (2007). Understanding and

Baker, C. & Cheung, K. (2007). *Semantic Web: Revolutionizing knowledge discovery in the life*

Barrell, D., Dimmer, E., Huntley, R., Binns, D., O'Donovan, C. & Apweiler, R. (2009). The goa

Beisswanger, E., Schulz, S., Stenzhorn, H. & Hahn, U. (2008). Biotop: An upper domain

Beneventano, D., Bergamaschi, S., Guerra, F. & Vincini, M. (2003). Synthesizing an integrated

Benjamins, V., Contreras, J., Corcho, Ó. & Gómez-Pérez, A. (2002). Six challenges for the

Berners-Lee, T. (1997). Metadata architecture, URL: http://www.w3.org/DesignIssues/

Berners-Lee, T., Fielding, R. & Masinter, L. (2005). Uniform resource identifier (uri): Generic syntax, URL: http://tools.ietf.org/rfc/rfc3986.txt. 18.03.2011. Berners-Lee, T., Hendler, J. & Lassila, O. (2001). The semantic web, *Sci. Am.* 284(5): 28–37. Bodenreider, O. (2004). The unified medical language system (umls): integrating biomedical

Bodenreider, O. (2008). Ontologies and data integration in biomedicine: Success stories and

Börner, K. (2006). Semantic association networks: Using semantic web technology to improve

Boury-Brisset, A. (2003). Ontology-based approach for information fusion, *Proceedings of the*

Brickley, D. & Miller, L. (2010). Foaf vocabulary specification 0.98, URL: http://xmlns.

Calì, A., Calvanese, D., De Giacomo, G. & Lenzerini, M. (2001). Accessing data integration

*Lecture Notes in Computer Science*, Springer Berlin / Heidelberg, pp. 270–284. Calì, A., Calvanese, D., De Giacomo, G. & Lenzerini, M. (2003). On the expressive power

Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N.,

challenging issues, *in* A. Bairoch, S. Cohen-Boulakia & C. Froidevaux (eds), *Data Integration in the Life Sciences*, Vol. 5109 of *Lecture Notes in Computer Science*, Springer

scholarly knowledge and expertise management, *in* V. Geroimenko & C. Chen (eds),

systems through conceptual schemas, *Conceptual Modeling - ER 2001*, Vol. 2224 of

of data integration systems, *in* S. Spaccapietra, S. March & Y. Kambayashi (eds), *Conceptual Modeling - ER 2002*, Vol. 2503 of *Lecture Notes in Computer Science*, Springer

Lopez, R. & Apweiler, R. (2004). The gene ontology annotation (goa) database:

opinions from leading scientists, *Genome Biol* 9(Suppl 2): S7.

interfaces to obo ontologies, *Applied Ontology* 3(4): 205–212.

terminology, *Nucleic Acids Res.* 32(Database Issue): D267.

*Visualizing the Semantic Web*, Springer London, pp. 183–198.

*Sixth International Conference on Information Fusion*, pp. 522–529.

biology, *Trends Biotechnol* 24(12): 571–579.

ontology, *IEEE Internet Comput* 7(5): 42–51.

semantic web, *KR2002 Semantic Web Workshop*.

owl, *BMC bioinformatics* 8(1): 57.

37(Database issue): D396–D403.

*sciences*, Springer Verlag.

Metadata. 18.03.2011.

Berlin / Heidelberg, pp. 1–4.

com/foaf/spec/. 18.03.2011.

Berlin / Heidelberg, pp. 338–350.

U., Hersh, W. & Hirschman, L. (2008). Text mining for biology-the way forward:

using the meaning of statements in a bio-ontology: recasting the gene ontology in

database in 2009–an integrated gene ontology annotation resource, *Nucleic Acids Res.*

ontology for the life sciences. a description of its current structure, contents and

searches for *Term A* in English and finds a document related to what he is interested in, but it is written in French.

#### **9. Discussion**

The idea behind the Semantic Web is to transform the Web into a global knowledge base (Kei-Hoi et al., 2009). The key to make this possible is data integration. Therefore Semantic Web technologies offer a more or less standardized hierarchical framework for data integration and enable a decentralized semantic integration of different heterogeneous data sources. For this integration, it is not necessary to change the structure of the data to assemble knowledge from structured and unstructured sources. This technology extends the source by adding machine readable semantic metadata using the Resource Description Framework (RDF). This metadata contains sets of relations between data and concepts. This will enable people to clearly and commonly define the concepts and logic within any document (Neumann et al., 2004). Furthermore, Semantic Web technologies support an automatic traverse of the connected resources. This queries the integrated sources or even infers new knowledge using the standard query language SPARQL. The prerequisite for meaningful semantic data integration is the presence of ontologies. They enable a unique identification of entities in heterogeneous information systems and provide semantic data integration on different granular levels. Semantic Web technologies provide standard languages including the RDF Schema (RDFS), and the Web Ontology Language (OWL) for creating ontologies. The quality of the data integration is tightly correlated with the quality of the used ontologies. But in recent years, many high quality open access biomedical ontologies have been created, such as the Gene Ontology, the Open Biological and Biomedical Ontologies.

In summary, Semantic Web technologies are a promising tool for data integration but there are still some challenges to be overcome such as uniform naming, extraction of the semantic information out of existing knowledge, ontology development, ontology maintenance or query RDF data (see section 8).

#### **10. Additionally**

A public available example software, termed *OBOBrowsA*, can be downloaded following the link http://www.umit.at/page.cfm?vpath=departments/technik/iebe/ tools/obobrowsa&switchLocale=en\_US. It is able to load and display OBO files<sup>20</sup> in tree or graph representation. The software further allows the user to interactively browse through the ontology, search for ontology classes and annotate textual data. The manual and application examples are included in the help function.

#### **11. References**


<sup>20</sup> Link to download OBO formated ontologies: http://www.obofoundry.org

20 Will-be-set-by-IN-TECH

searches for *Term A* in English and finds a document related to what he is interested in, but it

The idea behind the Semantic Web is to transform the Web into a global knowledge base (Kei-Hoi et al., 2009). The key to make this possible is data integration. Therefore Semantic Web technologies offer a more or less standardized hierarchical framework for data integration and enable a decentralized semantic integration of different heterogeneous data sources. For this integration, it is not necessary to change the structure of the data to assemble knowledge from structured and unstructured sources. This technology extends the source by adding machine readable semantic metadata using the Resource Description Framework (RDF). This metadata contains sets of relations between data and concepts. This will enable people to clearly and commonly define the concepts and logic within any document (Neumann et al., 2004). Furthermore, Semantic Web technologies support an automatic traverse of the connected resources. This queries the integrated sources or even infers new knowledge using the standard query language SPARQL. The prerequisite for meaningful semantic data integration is the presence of ontologies. They enable a unique identification of entities in heterogeneous information systems and provide semantic data integration on different granular levels. Semantic Web technologies provide standard languages including the RDF Schema (RDFS), and the Web Ontology Language (OWL) for creating ontologies. The quality of the data integration is tightly correlated with the quality of the used ontologies. But in recent years, many high quality open access biomedical ontologies have been created, such

In summary, Semantic Web technologies are a promising tool for data integration but there are still some challenges to be overcome such as uniform naming, extraction of the semantic information out of existing knowledge, ontology development, ontology maintenance or

A public available example software, termed *OBOBrowsA*, can be downloaded following the link http://www.umit.at/page.cfm?vpath=departments/technik/iebe/ tools/obobrowsa&switchLocale=en\_US. It is able to load and display OBO files<sup>20</sup> in tree or graph representation. The software further allows the user to interactively browse through the ontology, search for ontology classes and annotate textual data. The manual and

Alesso, H. & Smith, C. (2006). *Thinking on the Web: Berners-Lee, Gödel, and Turing*,

Alexopoulou, D., Wächter, T., Pickersgill, L., Eyre, C. & Schroeder, M. (2008). Terminologies

<sup>20</sup> Link to download OBO formated ontologies: http://www.obofoundry.org

for text-mining; an experiment in the lipoprotein metabolism domain, *BMC Bioinf*

as the Gene Ontology, the Open Biological and Biomedical Ontologies.

application examples are included in the help function.

is written in French.

query RDF data (see section 8).

Wiley-Interscience.

9(Suppl 4): S2.

**10. Additionally**

**11. References**

**9. Discussion**


Gruber, T. (1995). Toward principles for the design of ontologies used for knowledge sharing,

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 79

Guarino, N. (1998). Formal ontology in information systems, *Formal ontology in information*

Haase, P., Mathäß, T. & Ziller, M. (2010). An evaluation of approaches to federated query

Hernandez, T. & Kambhampati, S. (2004). Integration of biological sources: current systems

IFOMIS, Saarland University (2010). Basic formal ontology (bfo), URL: http://www.

Jacquemin, C. (2001). *Spotting and discovering terms through natural language processing*, MIT

Kanehisa-Laboratories (n.d.). Kegg: Kyoto encyclopedia of genes and genomes, URL: http:

Kashyap, V. & Borgida, A. (2003). Representing the umls semantic network using owl, *The*

Kei-Hoi, C., Robert, F., Scott, M., Matthias, S., Jun, Z. & Adrian, P. (2009). A journey to semantic web query federation in the life sciences, *BMC Bioinf* 10(Suppl 10): S10. Kersey, P., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E. & Apweiler, R. (2004).

Keshava Prasad, T., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S.,

Kugler, K., Tejada, M., Baumgartner, C., Tilg, B., Graber, A. & Pfeifer, B. (2008). Bridging

Le Grand, B. & Soto, M. (2002). Visualisation of the semantic web: Topic maps

Masolo, C., Borgo, S., Gangemi, A., Guarino, N. & Oltramari, A. (2003). Wonderweb

Müller, H., Rangarajan, A., Teal, T. & Sternberg, P. (2008). Textpresso for neuroscience:

Navigli, R. & Velardi, P. (2004). Learning domain ontologies from document warehouses and

dedicated web sites, *Comput Linguist* 30(2): 151–179.

*SemanticWeb - ISWC 2003*, Vol. 2870 of *Lecture Notes in Computer Science*, Springer

The international protein index: an integrated database for proteomics experiments,

Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A. et al. (2008). Human protein reference database–2009 update, *Nucleic Acids Res.* 37(Database issue): D767–D772. Krallinger, M., Valencia, A. & Hirschman, L. (2008). Linking genes to literature: text mining,

information extraction, and retrieval applications for biology, *Genome Biol* 9(Suppl

data management and knowledge discovery in the life sciences, *Open Bioinformatics*

visualisation, *Sixth International Conference on Information Visualisation, 2002.*

deliverable d18, URL: http://www.loa-cnr.it/Papers/DOLCE2.1-FOL.pdf.

searching the full text of thousands of neuroscience research papers, *Neuroinformatics*

Hitzler, P., Krötzsch, M., Rudolph, S. & Sure, Y. (2008). *Semantic Web: Grundlagen*, Springer. Hoehndorf, R., Oellrich, A., Dumontier, M., Kelso, J., Rebholz-Schuhmann, D. & Herre, H.

*systems: proceedings of the first international conference (FOIS'98), June 6-8, Trento, Italy*,

processing over linked data, *Proceedings of the 6th International Conference on Semantic*

(2010). Relations as patterns: bridging the gap between obo and owl, *BMC Bioinf*

*Int J Hum-Comput St* 43(5): 907–928.

and challenges ahead, *SIGMOD Rec.* 33(3): 51–60.

IOS Press.

11(1): 441.

2): S8.

*Journal* 2: 28–36.

18.03.2011.

6(3): 195–204.

*Proceedings*, pp. 344–349.

*Systems*, ACM, pp. 1–9.

ifomis.org/bfo. 18.03.2011.

Berlin / Heidelberg, pp. 1–16.

*Proteomics* 4(7): 1985–1988.

//www.genome.jp/kegg). 18.03.2011.

press Cambridge, MA.

sharing knowledge in uniprot with gene ontology, *Nucleic Acids Res.* 32(Database Issue): D262–D266.


22 Will-be-set-by-IN-TECH

Chabalier, J., Dameron, O. & Burgun, A. (2007). Integrating and querying disease and pathway

Cheung, K., Smith, A., Yip, K., Baker, C. & Gerstein, M. (2007). Semantic web approach

Choi, N., Song, I. & Han, H. (2006). A survey on ontology mapping, *SIGMOD Rec.* 35(3): 34–41. Cimiano, P. & Volker, J. (2005). A framework for ontology learning and data-driven change

Davidson, S., Overton, C. & Buneman, P. (1995). Challenges in integrating biological data

Delfs, R., Doms, A., Kozlenkov, A. & Schroeder, M. (2004). Gopubmed: ontology-based

Ding, Y., Fensel, D., Klein, M. & Omelayenko, B. (2002). The semantic web: yet another hip?,

Doms, A. & Schroeder, M. (2005). Gopubmed: exploring pubmed with the gene ontology,

Donini, F., Lenzerini, M., Nardi, D. & Schaerf, A. (1996). *Reasoning in description logics*, Center

EBI (n.d.a). Embl nucleotide sequence database, URL: http://www.ebi.ac.uk/embl.

Ehrig, M. & Sure, Y. (2004). Ontology mapping - an integrated approach, *in* C. Bussler,

EMBL-EBI (n.d.). Array express archive, URL: http://www.ebi.ac.uk/

Fensel, D. (2004). *Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce*,

Gagnon, M. (2007). Ontology-based integration of data sources, *10th international Conference*

Gardner, S. (2005). Ontologies and semantic data integration, *Drug Discov Today*

Gene Ontology (n.d.). An introduction to the gene ontology, URL: http://www.

Goble, C. & Stevens, R. (2008). State of the nation in data integration for bioinformatics, *Journal*

Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P. & Sattler, U. (2008). Owl 2: The

Gruber, T. (1993). A translation approach to portable ontology specifications, *Knowl Acquis*

next step for owl, *Web Semantics: Science, Services and Agents on the World Wide Web*

J. Davies, D. Fensel & R. Studer (eds), *The Semantic Web: Research and Applications*, Vol. 3053 of *Lecture Notes in Computer Science*, Springer Berlin / Heidelberg, pp. 76–91. Eilbeck, K., Lewis, S., Mungall, C., Yandell, M., Stein, L., Durbin, R. & Ashburner, M. (2005).

The sequence ontology: a tool for the unification of genome annotations, *Genome Biol*

for the Study of Language and Information, Stanford, CA, USA.

EBI (n.d.b). Uniprot, URL: http://www.ebi.ac.uk/uniprot. 18.03.2011.

*Language to Information Systems (NLDB)*, Springer, pp. 227–238.

Issue): D262–D266.

*Semantic Web*, Springer US, pp. 11–30.

sources, *J. Comput. Biol.* 2(4): 557–572.

*Data Knowl Eng* 41(2-3): 205–227.

microarray-as/ae. 18.03.2011.

*on Information Fusion, Quebec, Canada*.

*of biomedical informatics* 41(5): 687–693.

geneontology.org/GO.doc.shtml. 18.03.2011.

18.03.2011.

6(5): R44.

Springer.

10(14): 1001–1007.

6(4): 309–322.

5: 199–220.

*Bioinformatics Conference. LNBI*, pp. 169–178.

*Nucleic Acids Res.* 33(Web Server Issue): W783–W786.

sharing knowledge in uniprot with gene ontology, *Nucleic Acids Res.* 32(Database

ontologies: building an owl model and using rdfs queries, *ISMB conference*, Citeseer.

to database integration in the life sciences, *in* C. J. O. Baker & K.-H. Cheung (eds),

discovery, *Proceedings of the 10th International Conference on Applications of Natural*

literature search applied to geneontology and pubmed, *Proceedings of German*


Tang, J., Li, J., Liang, B., Huang, X., Li, Y. & Wang, K. (2006). Using bayesian decision for

Semantic Data Integration on Biomedical Data Using Semantic Web Technologies 81

Tsarkov, D. & Horrocks, I. (2006). Fact++ description logic reasoner: System description, *in*

U.S. National Library of Medicine (2010). About the umls, URL: http://www.nlm.nih.

Van Harmelen, F., Lifschitz, V. & Porter, B. (2008). *Handbook of knowledge representation*, Elsevier

Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F., Buitelaar, P., Cimiano, P. & Magnini, B. (2005).

Volz, R., Oberle, D. & Studer, R. (2003). Implementing views for light-weight web ontologies, *Database Engineering and Applications Symposium, International* 0: 160–169. W3C (2001). Xml in 10 points, URL: http://www.w3.org/XML/1999/

W3C (2004a). Owl web ontology language overview, URL: http://www.w3.org/TR/

W3C (2004b). Owl web ontology language reference, URL: http://www.w3.org/TR/

W3C (2004c). Rdf primer, URL: http://www.w3.org/TR/2004/

W3C (2004d). Rdf test cases, URL: http://www.w3.org/TR/rdf-testcases. 18.03.2011. W3C (2004e). Rdf vocabulary description language 1.0: Rdf schema, URL: http://www.w3.

W3C (2004f). Rdf/xml syntax specification (revised), URL: http://www.w3.org/TR/

W3C (2004g). Resource description framework (rdf): Concepts and abstract syntax, URL: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210. 18.03.2011. W3C (2005). Primer: Getting into rdf & semantic web using n3, URL: http://www.w3.org/

W3C (2008a). Sparql protocol for rdf, URL: http://www.w3.org/TR/

W3C (2008b). Sparql query results xml format, URL: http://www.w3.org/TR/

W3C (2008c). Turtle - terse rdf triple language, URL: http://www.w3.org/

W3C (2009a). Owl 2 web ontology language document overview, URL: http://www.w3.

W3C (2009b). Rdb2rdf working group charter, URL: http://www.w3.org/2009/08/

W3C (2009c). W3c rdb2rdf incubator group report, URL: http://www.w3.org/2005/

W3C (2010a). Resource description framework (rdf), URL: http://www.w3.org/RDF.

W3C (2010b). W3c rdb2rdf incubator group, URL: http://www.w3.org/2005/

Incubator/rdb2rdf/XGR-rdb2rdf-20090126. 18.03.2011.

*Computer Science*, Springer Berlin / Heidelberg, pp. 292–297.

gov/research/umls/about\_umls.html. 18.03.2011.

XML-in-10-points.html.en. 18.03.2011.

REC-rdf-primer-20040210. 18.03.2011.

org/TR/rdf-schema. 18.03.2011.

rdf-syntax-grammar. 18.03.2011.

2000/10/swap/Primer. 18.03.2011.

rdf-sparql-protocol. 18.03.2011.

TeamSubmission/turtle. 18.03.2011.

org/TR/owl2-overview. 18.03.2011.

rdb2rdf-charter.html. 18.03.2011.

Incubator/rdb2rdf). 18.03.2011.

18.03.2011.

rdf-sparql-XMLres. 18.03.2011.

2004/REC-owl-features-20040210. 18.03.2011.

4(4): 243–262.

Science Ltd.

owl-ref. 18.03.2011.

Press.

ontology mapping, *Web Semantics: Science, Services and Agents on the World Wide Web*

U. Furbach & N. Shankar (eds), *Automated Reasoning*, Vol. 4130 of *Lecture Notes in*

*Evaluation of OntoLearn, a methodology for automatic learning of domain ontologies*, IOS


24 Will-be-set-by-IN-TECH

NCBI (2004). Genbank overview, URL: http://www.ncbi.nlm.nih.gov/genbank/

NCBI (n.d.). Gene expression omnibus, URL: http://www.ncbi.nlm.nih.gov/geo.

Neumann, E., Miller, E. & Wilbanks, J. (2004). What the semantic web could do for the life

OBO Foundry (n.d.). The open biological and biomedical ontologies, URL: http://www.

Ouksel, A. & Sheth, A. (1999). Semantic interoperability in global information systems,

Perez-Rey, D., Maojo, V., Garcia-Remesal, M., Alonso-Calvo, R., Billhardt, H., Martin-Sanchez,

Pfeifer, B., Aschaber, J., Baumgartner, C., Dreiseitl, S., Modre-Osprian, R., Schreier, G. & Tilg,

Quilitz, B. & Leser, U. (2008). Querying distributed rdf data sources with sparql,

Richter, J. (2006). The obo flat file format specification, version 1.2, URL: http://www.

Schulz, S., Beisswanger, E., Van Den Hoek, L., Bodenreider, O. & Van Mulligen, E. (2009).

Shadbolt, N., Hall, W. & Berners-Lee, T. (2006). The semantic web revisited, *IEEE Intell Syst*

Sirin, E., Parsia, B., Grau, B., Kalyanpur, A. & Katz, Y. (2007). Pellet: A practical owl-dl

Smith, A., Cheung, K., Yip, K., Schultz, M. & Gerstein, M. (2007). Linkhub: a semantic web

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K.,

Spasic, I., Ananiadou, S., McNaught, J. & Kumar, A. (2005). Text mining and ontologies in biomedicine: making sense of raw text, *Brief Bioinform* 6(3): 239–251. Stanford University (n.d.). Stanford microarray database, URL: http://smd.stanford.

Stenzhorn, H., Schulz, S., Beißwanger, E., Hahn, U., Van Den Hoek, L. & Van Mulligen, E.

Studer, R., Benjamins, V. & Fensel, D. (1998). Knowledge engineering: Principles and methods,

Stein, L. (2003). Integrating biological databases, *Nat Rev Genet* 4(5): 337–345.

*International Semantic Web Conference (ISWC)*, Vol. 401, Citeseer.

cancer, *4th International Workshop, p 9ff DILS 2007, Pennsylvania, USA*.

geneontology.org/GO.format.obo-1\_2.shtml. 18.03.2011.

F. & Sousa, A. (2006). Ontofusion: Ontology-based integration of genomic and

B. (2007). A life science data warehouse system to enable systems biology in prostate

*ESWC'08: Proceedings of the 5th European semantic web conference on The semantic web*,

Alignment of the umls semantic network with biotop: methodology and assessment,

reasoner, *Web Semantics: science, services and agents on the World Wide Web* 5(2): 51–53.

system that facilitates cross-database queries and information retrieval in proteomics,

Ireland, A., Mungall, C. & others (2007). The obo foundry coordinated evolution of ontologies to support biomedical data integration, *Nat Biotechnol* 25(11): 1251–1255. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F.,

Rector, A. & Rosse, C. (2005). Relations in biomedical ontologies, *Genome Biol* 5: R46.

(2008). Biotop and chemtop–top-domain ontologies for biology and chemistry, *7th*

GenbankOverview.html. 18.03.2011.

sciences, *Drug Discov Today* 2(6): 228–236.

Pollock, J. (2009). *Semantic Web for Dummies*, For Dummies.

Springer-Verlag, Berlin, Heidelberg, pp. 524–538.

clinical databases, *Comput Biol Med* 36(7-8): 712–730.

obofoundry.org. 18.03.2011.

*SIGMOID Rec.* 28(1): 5–12.

*Bioinformatics* 25(12): i69–i76.

*BMC Bioinf* 8(Suppl 3): S5.

*App* 21(3): 96–101.

edu. 18.03.2011.

*Data Knowl Eng* 25: 161–197.

18.03.2011.


**Part 3** 

**Data Mining and Applications** 


## **Part 3**

**Data Mining and Applications** 

26 Will-be-set-by-IN-TECH

82 Bioinformatics – Trends and Methodologies

W3C (2011). W3c semantic web activity, URL: http://www.w3.org/2001/sw. 18.03.2011. Wache, H., Voegele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H. & Hübner,

Zeng, K. & Bodenreider, O. (2007). Integrating the umls into an rdf-based biomedical

knowledge repository, *AMIA Annu Symp Proc.*, p. 1170.

S. (2001). Ontology-based integration of information-a survey of existing approaches, *IJCAI-01 Workshop: Ontologies and Information Sharing*, Vol. 2001, Citeseer, pp. 108–117.

**0**

**4**

*USA*

**Vector Space Information Retrieval Techniques**

Information retrieval (IR) can be defined as the set of processes involved in querying a collection of objects in order to extract relevant data and information Dominich (2010); Grossman & Frieder (2004). Within this paradigm, various models ranging from deterministic to probabilistic have been applied. The goal of this chapter is to invoke a mathematical structure on bioinformatics database objects that facilitates the use of vector space techniques typically encountered in text mining and information retrieval systems Berry & Browne

Several choices and approaches exist for encoding bioinformatics data such that database objects are transformed and embedded in a linear vector space Baldi & Brunak (1998). Hence, part of the key to developing such an approach lies in invoking an algebraic structure that accurately reflects relevant features within a given database. Some attention must therefore be devoted to the numerical encoding of bioinformatics objects such that relevant biological and chemical characteristics are preserved. Furthermore, the structure must also prove useful for operations typical of data mining such as clustering, knowledge discovery and pattern classification. Under these circumstances, the vector space approach affords us the latitude to explore techniques analogous to those applied in text information retrieval Elden (2004);

While the methods presented in this chapter are quite general and readily applicable to various categories of bioinformatics data such as text, sequence, or structural objects, we focus this work on amino acid sequence data. Specifically, we apply the BLOCKS protein sequence database Henikoff et al. (2000); Pietrokovski et al. (1996) as the template for testing the applied techniques. It is demonstrated that the vector space approach is consistent with pattern search and classification methodologies commonly applied within the bioinformatics literature Baldi & Brunak (1998); Durbin et al. (2004); Wang et al. (2005). In addition, various subspace decomposition approaches are presented and applied to the pattern search and

To summarize, the main contribution of this work is directed towards bioinformatics data mining. We demonstrate that information measures derived from the vector space approach are consistent with and, in many cases, reduce to those typically applied in the bioinformatics literature. In addition, we apply the BLOCKS database in order to demonstrate database

**1. Introduction**

(2005); Langville & Meyer (2006).

pattern classification problems.

• Pattern Classification

Feldman & Sanger (2007); Grossman & Frieder (2004).

search and information retrieval techniques such as

*Department of Computer Science, Morgan State University, Baltimore, MD*

**for Bioinformatics Data Mining**

Eric Sakk and Iyanuoluwa E. Odebode

### **Vector Space Information Retrieval Techniques for Bioinformatics Data Mining**

Eric Sakk and Iyanuoluwa E. Odebode *Department of Computer Science, Morgan State University, Baltimore, MD USA*

#### **1. Introduction**

Information retrieval (IR) can be defined as the set of processes involved in querying a collection of objects in order to extract relevant data and information Dominich (2010); Grossman & Frieder (2004). Within this paradigm, various models ranging from deterministic to probabilistic have been applied. The goal of this chapter is to invoke a mathematical structure on bioinformatics database objects that facilitates the use of vector space techniques typically encountered in text mining and information retrieval systems Berry & Browne (2005); Langville & Meyer (2006).

Several choices and approaches exist for encoding bioinformatics data such that database objects are transformed and embedded in a linear vector space Baldi & Brunak (1998). Hence, part of the key to developing such an approach lies in invoking an algebraic structure that accurately reflects relevant features within a given database. Some attention must therefore be devoted to the numerical encoding of bioinformatics objects such that relevant biological and chemical characteristics are preserved. Furthermore, the structure must also prove useful for operations typical of data mining such as clustering, knowledge discovery and pattern classification. Under these circumstances, the vector space approach affords us the latitude to explore techniques analogous to those applied in text information retrieval Elden (2004); Feldman & Sanger (2007); Grossman & Frieder (2004).

While the methods presented in this chapter are quite general and readily applicable to various categories of bioinformatics data such as text, sequence, or structural objects, we focus this work on amino acid sequence data. Specifically, we apply the BLOCKS protein sequence database Henikoff et al. (2000); Pietrokovski et al. (1996) as the template for testing the applied techniques. It is demonstrated that the vector space approach is consistent with pattern search and classification methodologies commonly applied within the bioinformatics literature Baldi & Brunak (1998); Durbin et al. (2004); Wang et al. (2005). In addition, various subspace decomposition approaches are presented and applied to the pattern search and pattern classification problems.

To summarize, the main contribution of this work is directed towards bioinformatics data mining. We demonstrate that information measures derived from the vector space approach are consistent with and, in many cases, reduce to those typically applied in the bioinformatics literature. In addition, we apply the BLOCKS database in order to demonstrate database search and information retrieval techniques such as

• Pattern Classification

for Bioinformatics Data Mining 3

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 87

Consider categorizing a set of *m* documents based upon the presence or absence of a list of *n* selected terms. Under these circumstances, an *n* × *m* term-document matrix can be constructed where each entry in the matrix might reflect the weighted frequency of occurrence of each term of interest. Table 1 provides an example; in this case, a matrix column vector defines the frequency of occurrence of each term in a given document. Such a construction immediately facilitates the application of matrix analysis for the sake of quantifying the degree of similarity between a query vector and the document vectors contained within the

Given an *n* × *m* term-document matrix *A*, consider an *n* × 1 vector *q* constructed from a query document whose components reflect the presence or absence of entries in the same list of *n* terms used to construct the matrix *A*. The question then naturally arises how one might quantify the similarity between the query vector *q* and the term-document matrix *A*. Defining such a similarity measure would immediately lead to a scoring scheme that can be used to

Given the vector space approach, a natural measure of similarity arises from the inner product. Assuming an 2-norm, if both *q* and the columns of *A* have been normalized to

(where the '*T*' superscript denotes the transpose). Since all components of *q* and *A* are non-negative, all inner products will evaluate to a value such that 0 ≤ cos *θ<sup>j</sup>* ≤ 1. Similar queries approach a value of one indicating a small angle between the query and column vector, dissimilar queries approach a value of zero indicating orthogonality. This specific

where cos *θ* represents a row vector whose components quantify the *relevance* between the

Given the vector space approach, LSI (latent semantic indexing) goes a step further in order to infer semantic dependencies that are not immediately obvious from the raw data contained in the term-document matrix. In terms of linear algebra, the LSI methodology translates into characterizing the column space of *A* based upon some preferred matrix decomposition. A tool commonly applied in this arena is the Singular Value Decomposition (SVD) Golub & Van Loan (1989) where the term-document matrix is factored as follows:

*th* column vector of *A* becomes

*<sup>q</sup>Taj* <sup>=</sup> ||*q*|| ||*aj*|| cos *<sup>θ</sup><sup>j</sup>* <sup>=</sup> cos *<sup>θ</sup><sup>j</sup>* (1)

cos *θ* = *qTA* (2)

*A* = *U*Σ*V<sup>T</sup>* (3)

order results from most relevant to least relevant (ie induce a '*relevance score*').

unit magnitude, then the inner product between *q* and the *j*

measure is called the '*cosine similarity*' and is abbreviated as

query and each column vector of *A*.

Term 1 1 0 1 0 Term 2 0 1 1 0 Term 3 1 1 0 1 Term 4 1 1 0 1 Term 5 1 0 1 0

Table 1. Example of a 5 × 4 term-document matrix.

term-document matrix.

Document 1 Document 2 Document 3 Document 4


The chapter is outlined in Figure 1 as follows. Section 2 provides basic background regarding information retrieval and bioinformatics techniques applied in this work. Given this foundation, Section 3 presents various approaches to encoding bioinformatics sequence data. Section 4 then introduces the subspace decomposition methodology for the vector space approach. Finally, Section 5 develops the approach in the context of various applications listed in Figure 1.

Fig. 1. Flowchart for the chapter

#### **2. Overview and notation**

Part of the goal of this chapter is to phrase the bioinformatics database mining problem in terms of vector space IR (information retrieval) techniques; hence, this section is devoted toward reviewing terms and concepts relevant to this work. In addition, definitions, mathematical notation and conventions for elements such as vectors and matrices are introduced.

#### **2.1 Vector space approach to information retrieval**

Information retrieval can be thought of as a collection of techniques designed to search through a set of objects (e.g. contained within a database, on the internet, etc) in order to extract information that is relevant to the query. Such techniques are applicable, for example, to the design of search engines, as well as performing data mining, text mining, and text categorization Berry & Browne (2005); Elden (2004); Feldman & Sanger (2007); Hand et al. (2001); Langville & Meyer (2006); Weiss et al. (2005). One specific category of this field that has proven useful for the design of search engines and constructing vector space models for text retrieval is known as Latent Semantic Indexing (LSI) Berry et al. (1999; 1995); Deerwester et al. (1990); Salton & Buckley (1990). Using the LSI approach, textual data is transformed (or '*encoded*') into numeric vectors. Matrix analysis techniques Golub & Van Loan (1989) are then applied in order to quantify semantic relationships within the textual data.


Table 1. Example of a 5 × 4 term-document matrix.

2 Will-be-set-by-IN-TECH

The chapter is outlined in Figure 1 as follows. Section 2 provides basic background regarding information retrieval and bioinformatics techniques applied in this work. Given this foundation, Section 3 presents various approaches to encoding bioinformatics sequence data. Section 4 then introduces the subspace decomposition methodology for the vector space approach. Finally, Section 5 develops the approach in the context of various applications listed

Part of the goal of this chapter is to phrase the bioinformatics database mining problem in terms of vector space IR (information retrieval) techniques; hence, this section is devoted toward reviewing terms and concepts relevant to this work. In addition, definitions, mathematical notation and conventions for elements such as vectors and matrices are

Information retrieval can be thought of as a collection of techniques designed to search through a set of objects (e.g. contained within a database, on the internet, etc) in order to extract information that is relevant to the query. Such techniques are applicable, for example, to the design of search engines, as well as performing data mining, text mining, and text categorization Berry & Browne (2005); Elden (2004); Feldman & Sanger (2007); Hand et al. (2001); Langville & Meyer (2006); Weiss et al. (2005). One specific category of this field that has proven useful for the design of search engines and constructing vector space models for text retrieval is known as Latent Semantic Indexing (LSI) Berry et al. (1999; 1995); Deerwester et al. (1990); Salton & Buckley (1990). Using the LSI approach, textual data is transformed (or '*encoded*') into numeric vectors. Matrix analysis techniques Golub & Van Loan (1989) are then

applied in order to quantify semantic relationships within the textual data.

• Compositional Inferences from the Vector Space Models

• Clustering

in Figure 1.

• Knowledge Discovery

Fig. 1. Flowchart for the chapter

**2.1 Vector space approach to information retrieval**

**2. Overview and notation**

introduced.

Consider categorizing a set of *m* documents based upon the presence or absence of a list of *n* selected terms. Under these circumstances, an *n* × *m* term-document matrix can be constructed where each entry in the matrix might reflect the weighted frequency of occurrence of each term of interest. Table 1 provides an example; in this case, a matrix column vector defines the frequency of occurrence of each term in a given document. Such a construction immediately facilitates the application of matrix analysis for the sake of quantifying the degree of similarity between a query vector and the document vectors contained within the term-document matrix.

Given an *n* × *m* term-document matrix *A*, consider an *n* × 1 vector *q* constructed from a query document whose components reflect the presence or absence of entries in the same list of *n* terms used to construct the matrix *A*. The question then naturally arises how one might quantify the similarity between the query vector *q* and the term-document matrix *A*. Defining such a similarity measure would immediately lead to a scoring scheme that can be used to order results from most relevant to least relevant (ie induce a '*relevance score*').

Given the vector space approach, a natural measure of similarity arises from the inner product. Assuming an 2-norm, if both *q* and the columns of *A* have been normalized to unit magnitude, then the inner product between *q* and the *j th* column vector of *A* becomes

$$q^T a\_{\dot{j}} = ||q|| \, ||a\_{\dot{j}}|| \cos \theta\_{\dot{j}} = \cos \theta\_{\dot{j}} \tag{1}$$

(where the '*T*' superscript denotes the transpose). Since all components of *q* and *A* are non-negative, all inner products will evaluate to a value such that 0 ≤ cos *θ<sup>j</sup>* ≤ 1. Similar queries approach a value of one indicating a small angle between the query and column vector, dissimilar queries approach a value of zero indicating orthogonality. This specific measure is called the '*cosine similarity*' and is abbreviated as

$$\cos \theta = q^T A \tag{2}$$

where cos *θ* represents a row vector whose components quantify the *relevance* between the query and each column vector of *A*.

Given the vector space approach, LSI (latent semantic indexing) goes a step further in order to infer semantic dependencies that are not immediately obvious from the raw data contained in the term-document matrix. In terms of linear algebra, the LSI methodology translates into characterizing the column space of *A* based upon some preferred matrix decomposition. A tool commonly applied in this arena is the Singular Value Decomposition (SVD) Golub & Van Loan (1989) where the term-document matrix is factored as follows:

$$A = \mathcal{U}\Sigma V^T \tag{3}$$

for Bioinformatics Data Mining 5

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 89

database searches, pattern classification, clustering and multiple alignments. In doing so, it is our intent that the reader's view of these tools will be expanded toward novel applications

Many choices exist for the encoding of and weighting of entries within the term-document matrix; in addition, there exist a wide range of possibilities for matrix decompositions as well as the construction of similarity and scoring measures Elden (2004); Feldman & Sanger (2007); Hand et al. (2001); Weiss et al. (2005). The goal of this chapter is not to expand on the set of choices for the sake of text retrieval and generic data mining; instead, we must focus on techniques and approaches that are relevant to bioinformatics. Specifically, our attention in this section is devoted toward developing and presenting novel encoding schemes that

An assortment of methods have been proposed and studied for converting a protein from its amino acid sequence space into a numerical vector Bacardit et al. (2009); Baldi & Brunak (1998); Bordo & Argos (1991); Stuart, Moffett & Leader (2002). Scalar techniques generally assign a real number that relates an amino acid to some physically measurable property (e.g. volume, charge, hydrophobicity) Andorf et al. (2002); Eisenberg et al. (1984); Kyte & Doolittle (1982); Wimley & White (1996). On the other hand, orthogonal or 'standard' vector encoding techniques Baldi & Brunak (1998) embed each amino acid into a *k* dimensional vector space where *k* is the number of symbols. For example, if *k* = 20 (as it would be for the complete

standard encoding transforms a sequence of length *L* into an *n* = *Lk* dimensional vector. As an example consider the DNA alphabet A = {*A*, *G*,*C*, *T*}. In this case *k* = 4 and standard

⎟⎟⎠ . *<sup>C</sup>* <sup>=</sup>

Therefore, for an example sequence *s* = *AT* with *L* = 2, this encoding method yields the

Observe that, for typical values of *L*, assuming a data set of *m* sequences, standard encoding

In bioinformatics, given the limitations on biological measurement, the number of experimental observations tends to be limited and values of *m* are often small with respect to *n*. Under these conditions, it is often the case that vector encoding methodologies lead to sparse data matrices (as is the case for text retrieval applications) in high dimensional vector spaces. Observe, for example, that the *k*-gram method reviewed in Section 2.2 fits

We can expand upon the standard encoding approach by categorizing the standard amino acid alphabet into families that take into account physical and chemical characteristics derived from the literature Andorf et al. (2002); Baldi & Brunak (1998). In addition, entries within

10000001�

⎛

⎞

⎟⎟⎠ . *<sup>T</sup>* <sup>=</sup>

.

⎜⎜⎝

*th* amino acid where 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> 20 is represented by a 20 dimensional

*th* position and zero in every other position. In general,

⎛

⎞

⎟⎟⎠ . (7)

⎜⎜⎝

preserve relevant biological and chemical properties of genomic data.

beyond those presented here.

**3. Sequence encoding**

amino acid alphabet), the *j*

vector that is assigned a one at the *j*

encoding transforms the alphabet symbols as

following vector of dimension *n* = *Lk* = 8:

leads to an *n* × *m* data matrix that is sparse.

this description.

*A* =

⎛

⎞

⎟⎟⎠ . *<sup>G</sup>* <sup>=</sup>

⎛

⎞

⎜⎜⎝

*x<sup>T</sup>* = �

⎜⎜⎝

where *<sup>U</sup>* is an *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* orthogonal matrix (i.e. *<sup>U</sup>*−<sup>1</sup> <sup>=</sup> *<sup>U</sup>T*), *<sup>V</sup>* is an *<sup>m</sup>* <sup>×</sup> *<sup>m</sup>* orthogonal matrix (i.e. *<sup>V</sup>*−<sup>1</sup> <sup>=</sup> *<sup>V</sup>T*). Furthermore, <sup>Σ</sup> is an *<sup>n</sup>* <sup>×</sup> *<sup>m</sup>* diagonal matrix of singular values such that

$$
\sigma\_1 \ge \sigma\_2 \ge \dots \ge \sigma\_l > 0 \tag{4}
$$

where *r* = *rank*(*A*) and *σ<sup>i</sup>* ≡ Σ*ii*. It turns out that the first *r* columns of *U* define an orthonormal basis for the column space of the matrix *A*. This basis defines the underlying character of the document vectors and can be used to infer linear dependencies between them. Furthermore, it is possible to expand the matrix *A* in terms of the SVD:

$$A = \sum\_{j=1}^{r} \sigma\_j u\_j v\_j^T \tag{5}$$

where *uj* and *vj* represent the *j th* columns of *U* and *V*. This expansion weights each product *ujv<sup>T</sup> <sup>j</sup>* by the associated singular value *σj*. Hence, if there is a substantial decreasing trend in the singular values such that *σj*/*σ*<sup>1</sup> << 1 for all *j* > *L*, one is then led to truncate the above series in order to focus on the first *L* terms that are responsible for a non-negligible contribution to the expansion. This truncation is called the *low rank approximation* to *A*:

$$A \approx \sum\_{j=1}^{L} \sigma\_j \mu\_j v\_j^T \tag{6}$$

The low rank approximation describes, among other aspects, the degree to which each basis vector in *U* contributes to the matrix *A*. Furthermore, the subspace defined by the first *L* columns of *U* is useful for inferring linear dependencies in the original document space.

#### **2.2 Bioinformatics**

Given this abbreviated overview of vector space approaches to information retrieval, we now put it in the context of bioinformatics research. In particular, the SVD has been applied in many contexts as it can be thought of as a deterministic version of principal component analysis Wall et al. (2003). One specific area of honorable mention is pioneering work dealing with the analysis of microarray data Alter et al. (2000a;b); Kuruvilla et al. (2004).

With regard to information retrieval and LSI in bioinformatics Done (2009); Khatri et al. (2005); Klie et al. (2008), research in this area devoted to phylogenetics and multiple sequence alignment Couto et al. (2007); Stuart & Berry (2004); Stuart, Moffett & Baker (2002) has been reported. Much of this work can be traced back to initial foundations where the encoding of protein sequences has been performed using the frequency of occurrence of amino acid *k*-grams Stuart, Moffett & Leader (2002). Using the *k*-gram approach, column vectors in the data matrix (i.e. what was previously referred to as the 'term-document matrix') are encoded amino acid sequences and their components are the frequency of occurrence of each possible *k*-gram within each sequence. For example, if amino acids are taken *k* = 3 at a time, then there exist *n* = 20*<sup>k</sup>* = 8000 possible 3-grams. Assuming there are *m* amino acid sequences, the associated data matrix will be *n* × *m* = 8000 × *m*. For each amino acid sequence, a sliding, overlapping window of length *k* is used to count the frequency of occurrence of each *k*-gram and entered into the data matrix *A*.

The goal of this chapter is to build upon the IR and bioinformatics foundation in order to introduce novel perspectives on operations and computations commonly encountered in bioinformatics such as the consensus sequence, position specific scoring matrices (PSSM), database searches, pattern classification, clustering and multiple alignments. In doing so, it is our intent that the reader's view of these tools will be expanded toward novel applications beyond those presented here.

#### **3. Sequence encoding**

4 Will-be-set-by-IN-TECH

where *<sup>U</sup>* is an *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* orthogonal matrix (i.e. *<sup>U</sup>*−<sup>1</sup> <sup>=</sup> *<sup>U</sup>T*), *<sup>V</sup>* is an *<sup>m</sup>* <sup>×</sup> *<sup>m</sup>* orthogonal matrix (i.e.

where *r* = *rank*(*A*) and *σ<sup>i</sup>* ≡ Σ*ii*. It turns out that the first *r* columns of *U* define an orthonormal basis for the column space of the matrix *A*. This basis defines the underlying character of the document vectors and can be used to infer linear dependencies between them.

> *r* ∑ *j*=1

*σjujv<sup>T</sup>*

*<sup>j</sup>* by the associated singular value *σj*. Hence, if there is a substantial decreasing trend in the singular values such that *σj*/*σ*<sup>1</sup> << 1 for all *j* > *L*, one is then led to truncate the above series in order to focus on the first *L* terms that are responsible for a non-negligible contribution to

*σ*<sup>1</sup> ≥ *σ*<sup>2</sup> ≥ ··· ≥ *σ<sup>r</sup>* > 0 (4)

*th* columns of *U* and *V*. This expansion weights each product

*<sup>j</sup>* (5)

*<sup>j</sup>* (6)

*<sup>V</sup>*−<sup>1</sup> <sup>=</sup> *<sup>V</sup>T*). Furthermore, <sup>Σ</sup> is an *<sup>n</sup>* <sup>×</sup> *<sup>m</sup>* diagonal matrix of singular values such that

*A* =

*A* ≈

with the analysis of microarray data Alter et al. (2000a;b); Kuruvilla et al. (2004).

*L* ∑ *j*=1

The low rank approximation describes, among other aspects, the degree to which each basis vector in *U* contributes to the matrix *A*. Furthermore, the subspace defined by the first *L* columns of *U* is useful for inferring linear dependencies in the original document space.

Given this abbreviated overview of vector space approaches to information retrieval, we now put it in the context of bioinformatics research. In particular, the SVD has been applied in many contexts as it can be thought of as a deterministic version of principal component analysis Wall et al. (2003). One specific area of honorable mention is pioneering work dealing

With regard to information retrieval and LSI in bioinformatics Done (2009); Khatri et al. (2005); Klie et al. (2008), research in this area devoted to phylogenetics and multiple sequence alignment Couto et al. (2007); Stuart & Berry (2004); Stuart, Moffett & Baker (2002) has been reported. Much of this work can be traced back to initial foundations where the encoding of protein sequences has been performed using the frequency of occurrence of amino acid *k*-grams Stuart, Moffett & Leader (2002). Using the *k*-gram approach, column vectors in the data matrix (i.e. what was previously referred to as the 'term-document matrix') are encoded amino acid sequences and their components are the frequency of occurrence of each possible *k*-gram within each sequence. For example, if amino acids are taken *k* = 3 at a time, then there exist *n* = 20*<sup>k</sup>* = 8000 possible 3-grams. Assuming there are *m* amino acid sequences, the associated data matrix will be *n* × *m* = 8000 × *m*. For each amino acid sequence, a sliding, overlapping window of length *k* is used to count the frequency of occurrence of each *k*-gram

The goal of this chapter is to build upon the IR and bioinformatics foundation in order to introduce novel perspectives on operations and computations commonly encountered in bioinformatics such as the consensus sequence, position specific scoring matrices (PSSM),

*σjujv<sup>T</sup>*

Furthermore, it is possible to expand the matrix *A* in terms of the SVD:

the expansion. This truncation is called the *low rank approximation* to *A*:

where *uj* and *vj* represent the *j*

**2.2 Bioinformatics**

and entered into the data matrix *A*.

*ujv<sup>T</sup>*

Many choices exist for the encoding of and weighting of entries within the term-document matrix; in addition, there exist a wide range of possibilities for matrix decompositions as well as the construction of similarity and scoring measures Elden (2004); Feldman & Sanger (2007); Hand et al. (2001); Weiss et al. (2005). The goal of this chapter is not to expand on the set of choices for the sake of text retrieval and generic data mining; instead, we must focus on techniques and approaches that are relevant to bioinformatics. Specifically, our attention in this section is devoted toward developing and presenting novel encoding schemes that preserve relevant biological and chemical properties of genomic data.

An assortment of methods have been proposed and studied for converting a protein from its amino acid sequence space into a numerical vector Bacardit et al. (2009); Baldi & Brunak (1998); Bordo & Argos (1991); Stuart, Moffett & Leader (2002). Scalar techniques generally assign a real number that relates an amino acid to some physically measurable property (e.g. volume, charge, hydrophobicity) Andorf et al. (2002); Eisenberg et al. (1984); Kyte & Doolittle (1982); Wimley & White (1996). On the other hand, orthogonal or 'standard' vector encoding techniques Baldi & Brunak (1998) embed each amino acid into a *k* dimensional vector space where *k* is the number of symbols. For example, if *k* = 20 (as it would be for the complete amino acid alphabet), the *j th* amino acid where 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> 20 is represented by a 20 dimensional vector that is assigned a one at the *j th* position and zero in every other position. In general, standard encoding transforms a sequence of length *L* into an *n* = *Lk* dimensional vector. As an example consider the DNA alphabet A = {*A*, *G*,*C*, *T*}. In this case *k* = 4 and standard encoding transforms the alphabet symbols as

$$A = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \end{pmatrix}. \ G = \begin{pmatrix} 0 \\ 1 \\ 0 \\ 0 \end{pmatrix}. \ \ \mathcal{C} = \begin{pmatrix} 0 \\ 0 \\ 1 \\ 0 \end{pmatrix}. \ \ T = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \end{pmatrix}. \tag{7}$$

Therefore, for an example sequence *s* = *AT* with *L* = 2, this encoding method yields the following vector of dimension *n* = *Lk* = 8:

$$\mathbf{x}^T = \begin{pmatrix} 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 \end{pmatrix}.$$

Observe that, for typical values of *L*, assuming a data set of *m* sequences, standard encoding leads to an *n* × *m* data matrix that is sparse.

In bioinformatics, given the limitations on biological measurement, the number of experimental observations tends to be limited and values of *m* are often small with respect to *n*. Under these conditions, it is often the case that vector encoding methodologies lead to sparse data matrices (as is the case for text retrieval applications) in high dimensional vector spaces. Observe, for example, that the *k*-gram method reviewed in Section 2.2 fits this description.

We can expand upon the standard encoding approach by categorizing the standard amino acid alphabet into families that take into account physical and chemical characteristics derived from the literature Andorf et al. (2002); Baldi & Brunak (1998). In addition, entries within

for Bioinformatics Data Mining 7

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 91

To begin, let us assume there are training sequences of known classification that can be

dimension *n*. For each class, an *n* × *mi* matrix *Ai* can be constructed (assuming the training vectors are column vectors). To characterize the linear subspace generated by each class, we can apply the singular value decomposition (SVD) Golub & Van Loan (1989). In addition to providing us with an orthonormal basis for each class, we can also glean some information about the influence of the singular values and singular vectors from the rank approximants.

*Ai* = *Ui*Σ*iV<sup>T</sup>*

where *Ui* is *n* × *n* orthogonal matrix, Σ*<sup>i</sup>* is *n* × *mi* whose diagonal contains the singular values and *Vi* is an *mi* × *mi* orthogonal matrix. Assume the rank of each data matrix *Ai* is *ri* and let *Qi* denote the *n* × *ri* matrix formed from first *ri* columns of *Ui*. Given the properties of the SVD, the columns of *Qi* define an orthonormal basis for the column space of *Ai*. Hence, an

*Pi* = *QiQ<sup>T</sup>*

Consider an *n* × 1 query vector *x* whose classification is unknown. The class membership of

One computational convenience of constructing the orthonormal bases *Qi* is that it is not necessary to compute the projections when making this decision. Given any *Q* with orthonormal columns and orthogonal projection *P* = *QQ<sup>T</sup>* such that *P*<sup>2</sup> = *P* and *P* = *PT*,


Under these circumstances, to decide class membership, Equation (10) reduces to

*C*(*x*) ≡ arg max *i*=1,···,*M*

Furthermore, the values ||*xTQi*|| immediately yield relevance scores and confidence measures

= *xTPx* = *xTQQTx*

<sup>=</sup> ||*QTx*||<sup>2</sup> <sup>=</sup> ||*xTQ*||2.

*x* can be ascertained by identifying the class yielding the maximum projection norm:

*C*(*x*) ≡ arg max *i*=1,···,*M*

*<sup>i</sup>* <sup>=</sup> *<sup>U</sup>*−<sup>1</sup>

*th* class is established by computing

*<sup>i</sup>* , it is straightforward to check that *<sup>P</sup>*<sup>2</sup>

*th* class contains *mi* encoded vectors of

*<sup>i</sup>* (8)

*<sup>i</sup>* . (9)



*<sup>i</sup>* = *Pi* and

(11)

**4.1 Orthogonal projections**

categorized into *M* distinct classes and that the *i*

Class data matrices are therefore decomposed as

orthogonal projection operator for the *i*

(given that the SVD induces *U<sup>T</sup>*

*PT <sup>i</sup>* = *Pi*).

observe that

for each class.

the data matrix can be weighted based upon their hydrophobicity Eisenberg et al. (1984); Kyte & Doolittle (1982). Table 2 introduces alphabet symbols used to group amino acids according to hydrophobicity, charge and volume. Tables 3-5 show examples of various encoding schemes that we apply for this analysis.


Table 2. Encoding symbols applied in Tables 3-5


Table 3. Hydrophobic/Hydrophilic Encoding


Table 4. Charged Hydrophobic/Hydrophilic Encoding


Table 5. Volume/Charged Hydrophobic/Hydrophilic Encoding

#### **4. Subspace decompositions for pattern classification**

LSI techniques necessarily require the application of matrix decompositions such as the SVD to infer column vector dependencies in the data matrix. Decompositions of this kind can lead to the construction of subspaces that can mathematically categorize subsets of sequences into families. Furthermore, since these families define specific classes of data, they can be used as training data in order to perform database searches and pattern classification. The application of linear subspaces for the sake of pattern classification Oja (1983) consists of applying orthogonal projection operators based upon the training classes (an orthogonal projection operator *P* obeys *P* = *P<sup>T</sup>* and *P*<sup>2</sup> = *P*).

#### **4.1 Orthogonal projections**

6 Will-be-set-by-IN-TECH

the data matrix can be weighted based upon their hydrophobicity Eisenberg et al. (1984); Kyte & Doolittle (1982). Table 2 introduces alphabet symbols used to group amino acids according to hydrophobicity, charge and volume. Tables 3-5 show examples of various

Hydrophobicity R=hydrophobic, H=hydrophilic

Charge P=positive, N=negative, U=uncharged Volume S=smal, M=medium, ML=medium-large, L=medium

> R 1 A, I, L, M, F, P, W, V, D, E H 3 R, H, K, N, C, Q, G, S, T, Y

RU 1 A, I, L, M, F, P, W, V

RUS 1 A RUM 2 F RUML 3 I, L, M, V RUL 4 F, W HPML 5 R, H, K HNM 6 D HNML 7 E HUS 8 G, S HUM 9 N, C, V HUML 10 Q HUL 11 Y

LSI techniques necessarily require the application of matrix decompositions such as the SVD to infer column vector dependencies in the data matrix. Decompositions of this kind can lead to the construction of subspaces that can mathematically categorize subsets of sequences into families. Furthermore, since these families define specific classes of data, they can be used as training data in order to perform database searches and pattern classification. The application of linear subspaces for the sake of pattern classification Oja (1983) consists of applying orthogonal projection operators based upon the training classes (an orthogonal

HN 2 D, E HP 3 R, H, K HU 4 N, C, Q, G, S, T, Y

encoding schemes that we apply for this analysis.

Table 2. Encoding symbols applied in Tables 3-5

Table 3. Hydrophobic/Hydrophilic Encoding

Table 4. Charged Hydrophobic/Hydrophilic Encoding

Table 5. Volume/Charged Hydrophobic/Hydrophilic Encoding

**4. Subspace decompositions for pattern classification**

projection operator *P* obeys *P* = *P<sup>T</sup>* and *P*<sup>2</sup> = *P*).

To begin, let us assume there are training sequences of known classification that can be categorized into *M* distinct classes and that the *i th* class contains *mi* encoded vectors of dimension *n*. For each class, an *n* × *mi* matrix *Ai* can be constructed (assuming the training vectors are column vectors). To characterize the linear subspace generated by each class, we can apply the singular value decomposition (SVD) Golub & Van Loan (1989). In addition to providing us with an orthonormal basis for each class, we can also glean some information about the influence of the singular values and singular vectors from the rank approximants. Class data matrices are therefore decomposed as

$$A\_i = \mathcal{U}\_i \Sigma\_i V\_i^T \tag{8}$$

where *Ui* is *n* × *n* orthogonal matrix, Σ*<sup>i</sup>* is *n* × *mi* whose diagonal contains the singular values and *Vi* is an *mi* × *mi* orthogonal matrix. Assume the rank of each data matrix *Ai* is *ri* and let *Qi* denote the *n* × *ri* matrix formed from first *ri* columns of *Ui*. Given the properties of the SVD, the columns of *Qi* define an orthonormal basis for the column space of *Ai*. Hence, an orthogonal projection operator for the *i th* class is established by computing

$$P\_i = Q\_i Q\_i^T.\tag{9}$$

(given that the SVD induces *U<sup>T</sup> <sup>i</sup>* <sup>=</sup> *<sup>U</sup>*−<sup>1</sup> *<sup>i</sup>* , it is straightforward to check that *<sup>P</sup>*<sup>2</sup> *<sup>i</sup>* = *Pi* and *PT <sup>i</sup>* = *Pi*).

Consider an *n* × 1 query vector *x* whose classification is unknown. The class membership of *x* can be ascertained by identifying the class yielding the maximum projection norm:

$$\mathcal{C}(\mathbf{x}) \equiv \underset{i=1,\cdots,M}{\arg\max} \ ||P\_i \mathbf{x}||. \tag{10}$$

One computational convenience of constructing the orthonormal bases *Qi* is that it is not necessary to compute the projections when making this decision. Given any *Q* with orthonormal columns and orthogonal projection *P* = *QQ<sup>T</sup>* such that *P*<sup>2</sup> = *P* and *P* = *PT*, observe that

$$||P\mathbf{x}||^2 = \mathbf{x}^T P^T P \mathbf{x} = \mathbf{x}^T P^2 \mathbf{x}$$

$$= \mathbf{x}^T P \mathbf{x} = \mathbf{x}^T Q \mathbf{Q}^T \mathbf{x}$$

$$= ||\mathbf{Q}^T \mathbf{x}||^2 = ||\mathbf{x}^T \mathbf{Q}||^2.$$

Under these circumstances, to decide class membership, Equation (10) reduces to

$$\mathcal{C}(\mathbf{x}) \equiv \operatorname\*{arg\,max}\_{i=1,\cdots,M} ||\mathbf{x}^T \mathbf{Q}\_i||. \tag{12}$$

Furthermore, the values ||*xTQi*|| immediately yield relevance scores and confidence measures for each class.

for Bioinformatics Data Mining 9

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 93

*<sup>A</sup>PAx* =

cos(*φ*) = ||*PAx*||

Assuming ||*x*|| = 1, Equation (19) then easily follows by applying Equation (11) to Equation (25). Equations (23) and (24) are presented in order to offer additional insight by relating the

Of central focus in the next section will be to apply the above projection framework to information retrieval in bioinformatics. Since the classification problem will be of significance, we note that, given the identity in Equation (19), Equation (12) can be rephrased in term of the

In addition, this measure of class membership becomes more reliable if the contribution of *x* to the orthogonal complement of the data set is small. For instance, when *φ* is small, *cos*(*φ*) in Equation (19) approaches unity. Therefore, cos(*φ*) can be applied as a measure of data set reliability while cos(*φi*) can be used to produce relevance scores for *i* = 1, ··· , *M*. These

Similarity Measure Purpose Reference

Table 6. Reliability and relevance measures to be applied in Section 5

cos(*φ*) Data Set Reliability Equation (19) cos(*φi*) Relevance Score Equations (26) - (27)

In bioinformatics, families with similar biological function are often formed from sets of protein or nucleic acid sequences. For example, databases such as Pfam Finn et al. (2010), PROSITE Sigrist et al. (2010) and BLOCKS Pietrokovski et al. (1996) categorize sequence domains of similar function into distinct classes. Given the encodings discussed in Section 3, we seek to demonstrate how Equations (19) and (26) can applied in order to perform sequence modeling, pattern classification and database search computations typically encountered in

bioinformatics Baxevanis & Ouellette (2005); Durbin et al. (2004); Mount (2004).

*C*(*x*) ≡ arg max *i*=1,···,*M*

*xTPAx* (22)

*xTPAx*. (23)

*xTPAx* (24)


cos(*φi*) (26)

cos(*φi*) ≡ ||*xTQi*||. (27)

However, since *PA* is an orthogonal projection

and Equation (21) can therefore be rewritten as

inner product to the projection operator.

conclusions are summarized in Table 6.

cosine similarity measure

where

.

**5. Applications**


Equation (19) should also be clear from the geometric fact that

 *xTPT*

On the other hand, by applying Equation (11) to Equation (22), it follows that

as well; hence, the equality of Equations (23) and (24) establishes Equation (19).

cos(*φ*) =


#### **4.2 Characterization of the orthogonal complement**

It is important to note that the union of all the class subspaces *need not* be equal to the *n* dimensional vector space from which all data vectors are derived. To perform a complete orthogonal decomposition of the *n* dimensional vector space in terms of the data, we first define the matrix

$$A \equiv [A\_1 \ \cdots \ \ A\_M]. \tag{13}$$

The goal then is to characterize the null space <sup>N</sup> (*AT*), the subspace which is orthogonal to the column space of *A*. Assuming the rank of *A* is *rA*, computing the SVD

$$A = \mathcal{U}A\Sigma\_A V\_A^T \tag{14}$$

and forming the matrix *QA* from the the first *rA* columns of *UA* yields an orthogonal decomposition of the subspace generated by *all* class vectors. Hence, a projection operator for this subspace is constructed as

$$P\_A = Q\_A Q\_A^T. \tag{15}$$

In addition, a projection for the orthogonal complement <sup>N</sup> (*Q<sup>T</sup> <sup>A</sup>*) of *A* is then easily formed via

$$P\_{A^\perp} = I\_\mathbb{N} - P\_A \tag{16}$$

where *In* is the *n* × *n* identity matrix. A complete orthogonal decomposition Lay (2005) of a vector *<sup>x</sup>* ∈ R*<sup>n</sup>* can then be determined from

$$\mathbf{x} = P\_A \mathbf{x} + P\_{A^\perp} \mathbf{x}.\tag{17}$$

#### **4.3 Information retrieval**

Before attempting to decide the class membership of a vector *<sup>x</sup>* ∈ R*<sup>n</sup>* based upon Equation (12), it is sensible to characterize the portion of the vector that contributes to the class subspace defined by *QA*. Given Equation (17), this is most easily done by comparing ||*PAx*|| with ||*PA*<sup>⊥</sup> *x*|| as

$$\tan(\phi) = \frac{||P\_{A^\perp}x||}{||P\_A x||} \tag{18}$$

where *φ* is the angle between *x* and the subspace defined by *QA*. Ideally, if the class subspaces have been completely characterized, tan(*φ*) should be small. Conversely, larger values of tan(*φ*) would indicate that *x* is a member of a class subspace that has not yet been defined. Under these circumstances, the orthogonal complement would have to be further characterized and partitioned in order to define more classes beyond the known *M* existing classes.

It is also possible to phrase the tangent measure as a scalar version of the more familiar cosine similarity defined above in Equation (2). If ||*x*|| = 1, the cosine similarity measure takes on a convenient form

$$\cos(\phi) = ||\mathbf{x}^T \mathbf{Q}\_A||. \tag{19}$$

To see why, consider the inner product

$$\mathbf{x}^T(P\_A\mathbf{x}) = \mathbf{x} \cdot P\_A\mathbf{x} = ||\mathbf{x}|| \, ||P\_A\mathbf{x}|| \cos(\phi). \tag{20}$$

If ||*x*|| = 1, then

$$\cos(\phi) = \frac{\mathbf{x}^T P\_A \mathbf{x}}{||P\_A \mathbf{x}||}. \tag{21}$$

However, since *PA* is an orthogonal projection

$$||P\_A \mathbf{x}|| = \sqrt{\mathbf{x}^T P\_A^T P\_A \mathbf{x}} = \sqrt{\mathbf{x}^T P\_A \mathbf{x}} \tag{22}$$

and Equation (21) can therefore be rewritten as

$$\cos(\phi) = \sqrt{\mathbf{x}^T P\_A \mathbf{x}}.\tag{23}$$

On the other hand, by applying Equation (11) to Equation (22), it follows that

$$||\mathbf{x}^T \mathbf{Q}\_A|| = \sqrt{\mathbf{x}^T \mathbf{P}\_A \mathbf{x}} \tag{24}$$

as well; hence, the equality of Equations (23) and (24) establishes Equation (19). Equation (19) should also be clear from the geometric fact that

$$\cos(\phi) = \frac{||P\_A x||}{||x||}. \tag{25}$$

Assuming ||*x*|| = 1, Equation (19) then easily follows by applying Equation (11) to Equation (25). Equations (23) and (24) are presented in order to offer additional insight by relating the inner product to the projection operator.

Of central focus in the next section will be to apply the above projection framework to information retrieval in bioinformatics. Since the classification problem will be of significance, we note that, given the identity in Equation (19), Equation (12) can be rephrased in term of the cosine similarity measure

$$\mathcal{C}(\mathbf{x}) \equiv \operatorname\*{arg\,max}\_{i=1,\cdots,M} \cos(\phi\_i) \tag{26}$$

where

8 Will-be-set-by-IN-TECH

It is important to note that the union of all the class subspaces *need not* be equal to the *n* dimensional vector space from which all data vectors are derived. To perform a complete orthogonal decomposition of the *n* dimensional vector space in terms of the data, we first

The goal then is to characterize the null space <sup>N</sup> (*AT*), the subspace which is orthogonal to the

*A* = *UA*Σ*AV<sup>T</sup>*

and forming the matrix *QA* from the the first *rA* columns of *UA* yields an orthogonal decomposition of the subspace generated by *all* class vectors. Hence, a projection operator

*PA* = *QAQ<sup>T</sup>*

where *In* is the *n* × *n* identity matrix. A complete orthogonal decomposition Lay (2005) of a

Before attempting to decide the class membership of a vector *<sup>x</sup>* ∈ R*<sup>n</sup>* based upon Equation (12), it is sensible to characterize the portion of the vector that contributes to the class subspace defined by *QA*. Given Equation (17), this is most easily done by comparing ||*PAx*|| with

tan(*φ*) = ||*PA*<sup>⊥</sup> *<sup>x</sup>*||

where *φ* is the angle between *x* and the subspace defined by *QA*. Ideally, if the class subspaces have been completely characterized, tan(*φ*) should be small. Conversely, larger values of tan(*φ*) would indicate that *x* is a member of a class subspace that has not yet been defined. Under these circumstances, the orthogonal complement would have to be further characterized and partitioned in order to define more classes beyond the known *M* existing

It is also possible to phrase the tangent measure as a scalar version of the more familiar cosine similarity defined above in Equation (2). If ||*x*|| = 1, the cosine similarity measure takes on a

cos(*φ*) = *<sup>x</sup>TPAx*

column space of *A*. Assuming the rank of *A* is *rA*, computing the SVD

In addition, a projection for the orthogonal complement <sup>N</sup> (*Q<sup>T</sup>*

*A* ≡ [*A*<sup>1</sup> ··· *AM*]. (13)

*<sup>A</sup>* (14)

*<sup>A</sup>*. (15)

*PA*<sup>⊥</sup> = *In* − *PA* (16)

*x* = *PAx* + *PA*<sup>⊥</sup> *x*. (17)

cos(*φ*) = ||*xTQA*||. (19)


*<sup>x</sup>T*(*PAx*) = *<sup>x</sup>* · *PAx* <sup>=</sup> ||*x*|| ||*PAx*|| cos(*φ*). (20)


*<sup>A</sup>*) of *A* is then easily formed via

**4.2 Characterization of the orthogonal complement**

define the matrix

for this subspace is constructed as

**4.3 Information retrieval**


classes.

convenient form

If ||*x*|| = 1, then

To see why, consider the inner product

vector *<sup>x</sup>* ∈ R*<sup>n</sup>* can then be determined from

$$\cos(\phi\_i) \equiv ||\mathbf{x}^T Q\_i||. \tag{27}$$

In addition, this measure of class membership becomes more reliable if the contribution of *x* to the orthogonal complement of the data set is small. For instance, when *φ* is small, *cos*(*φ*) in Equation (19) approaches unity. Therefore, cos(*φ*) can be applied as a measure of data set reliability while cos(*φi*) can be used to produce relevance scores for *i* = 1, ··· , *M*. These conclusions are summarized in Table 6.


Table 6. Reliability and relevance measures to be applied in Section 5

#### **5. Applications**

.

In bioinformatics, families with similar biological function are often formed from sets of protein or nucleic acid sequences. For example, databases such as Pfam Finn et al. (2010), PROSITE Sigrist et al. (2010) and BLOCKS Pietrokovski et al. (1996) categorize sequence domains of similar function into distinct classes. Given the encodings discussed in Section 3, we seek to demonstrate how Equations (19) and (26) can applied in order to perform sequence modeling, pattern classification and database search computations typically encountered in bioinformatics Baxevanis & Ouellette (2005); Durbin et al. (2004); Mount (2004).

for Bioinformatics Data Mining 11

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 95

the position specific scoring matrix (PSSM) is a sequence model that considers the frequency of occurrence of all symbols in each position. Furthermore, the PSSM can be used to score and rank sequences of unknown function in order to quantify their similarity to the sequence

Given an *m* × *L* matrix of of *m* related sequences of length *L* and an alphabet of *k* symbols, a *k* × *L* 'profile' matrix of empirical probabilities is first constructed by computing the symbol frequency for each position. The profile matrix can be thought of as the preimage of the PSSM. While it can provide important statistical details regarding the sequence model, it does not have the capability to score sequences in an additive fashion position by position. To do this requires converting the profile into a *k* × *L* PSSM of additive information scores. Given a sequence *s* of length *L*, the PSSM can then be used to compute a score for *s* in order to

Recovering the PSSM from the vector space approach is straightforward. Given an *n* × *m* data

*kL* × 1 vector *μ<sup>A</sup>* into a *k* × *L* matrix recovers the profile. However, since the goal is to score sequences of unknown function, we are more interested in showing how *μ<sup>A</sup>* can be applied to recover a PSSM score. Assume that the components of *μ<sup>A</sup>* have been transformed by applying the same information measure I*PSSM* used to convert the profile to the PSSM. Assuming an encoding alphabet with *k* symbols, a query sequence *s* of length *L* can be encoded to form a *kL* × 1 vector *x*. The PSSM score *SPSSM* of *x* can then be recovered via the inner product:

where I*PSSM*(*μA*) represents the conversion of a probability vector into an vector of additive

The similarity of Equation (32) with Equation (2) is worth noting. Assume several families of sequences of equal length *L* are encoded into separate data matrices *Ai* where *i* = 1, ··· , *M* and *M* is the number of families. It should be clear that the relevance score for the query

It is of important theoretical interest that the vector space approach recovers both the PSSM and its information capacity to score sequences. However, it is more useful to observe that invoking an algebraic structure on a set of sequences induces a spectrum of novel possibilities. For instance, the SVD can be applied to the data matrix and a scoring scheme can be derived from the computed orthogonal basis. In addition, as mentioned at the end of Section 3, it is possible to weight both the data matrix and the encoded sequence according to more biologically significant measures such as hydrophobicity. Finally, and probably most importantly, the vector space formulation allows for powerful optimization techniques Golub & Van Loan (1989); Luenberger (1969) to be applied in order to maximize the scoring

I*PSSM*(*μA*<sup>1</sup> ) I*PSSM*(*μA*<sup>2</sup> ) ··· I*PSSM*(*μAM* )

vector *x* can be produced using the cosine similarity according to

is the *n* × *M* information matrix that describes the sequence families.

<sup>I</sup>*PSSM* <sup>=</sup>

capacity of the sequence model.

*th* subvector *ν<sup>i</sup>* in the average vector *μ<sup>A</sup>* computed from

*th* column in the *<sup>k</sup>* <sup>×</sup> *<sup>L</sup>* profile matrix. Simply reshaping the

*SPSSM* <sup>=</sup> *<sup>x</sup>T*I*PSSM*(*μA*) (32)

*SPSSM* <sup>=</sup> *<sup>x</sup>T*I*PSSM* (33)

(34)

model.

determine its relationship to the sequence model.

matrix of encoded sequences, the *i*

Equation (28) is equivalent to the *i*

information scores.

where

#### **5.1 Consensus sequence**

A set of *m* sequences of length *L* having some related function (e.g. DNA promoter sites for a common sigma factor) is often represented in the form of an *m* × *L* matrix where each column refers to a common position in each sequence. A consensus sequence *sC* of length *L* is constructed by extracting the symbol having the highest frequency in each column. This approach to sequence model construction, while quite rudimentary, is often useful for visualizing obvious qualitative relationships amongst sequence elements.

Using the vector space approach, it is possible to recover the consensus sequence. Assuming each sequence symbol is encoded into a *k* dimensional vector, each sequence will be encoded into a vector of length *n* = *Lk* (see Section 3). Hence the original *m* × *L* matrix of sequences will be transformed into an *n* × *m* data matrix of the form described in Section 2.1. In this case, each column vector in the data matrix represents an encoded amino acid sequence.

To recover the consensus, it is useful to introduce notation for describing an empirically derived average vector *μ<sup>A</sup>* from an *n* × *m* data matrix *A* as follows:

$$
\mu\_A \equiv (\frac{1}{m})A\mathbf{e} \tag{28}
$$

where **e** and *m* × 1 vector of ones. Then, *μ<sup>A</sup>* is an *n* × 1 column vector made up of *L* contiguous 'subvectors' of dimension *k* where the value of *k* depends upon the encoding method applied. Let *ν<sup>i</sup>* for *i* = 1, ··· , *L* represent each subvector in *μA*; then, the *i th* symbol in the consensus sequence *sC*(*i*) can be inferred by associating the component of *ν<sup>i</sup>* yielding the highest average with the originally encoded symbol. To be precise, let the alphabet of *k* sequence symbols (e.g. DNA, amino acids, structural, text, etc) be defined as

$$\mathcal{A} \equiv \{a\_1, a\_2, \dots, a\_k\} \tag{29}$$

and let the *j th* component of *<sup>ν</sup><sup>i</sup>* be written as *<sup>ν</sup>ij* for *<sup>j</sup>* <sup>=</sup> 1, ··· , *<sup>k</sup>*. The subscript index of the component with the maximum average in *ν<sup>i</sup>* can therefore be extracted as

$$J = \underset{j=1,\cdots,k}{\text{arg}\,\text{max}} \,\nu\_{ij} \tag{30}$$

and the associated alphabet symbol is entered into the *i th* position of the consensus sequence as

$$s\_{\mathbb{C}}(i) = a\_{\mathbb{I}} \tag{31}$$

where *aJ* ∈ A. The algorithm for recovering the consensus sequence can be summarized as follows:


#### **5.2 Position specific scoring matrix**

The consensus sequence, while qualitatively useful, is an incomplete sequence model in that it does not consider cases where two or more symbols in a given position are close to equiprobable. Under these circumstances, one is forced to arbitrarily choose one symbol for the consensus at the expense of loosing information about the other symbols. In contrast,

the position specific scoring matrix (PSSM) is a sequence model that considers the frequency of occurrence of all symbols in each position. Furthermore, the PSSM can be used to score and rank sequences of unknown function in order to quantify their similarity to the sequence model.

Given an *m* × *L* matrix of of *m* related sequences of length *L* and an alphabet of *k* symbols, a *k* × *L* 'profile' matrix of empirical probabilities is first constructed by computing the symbol frequency for each position. The profile matrix can be thought of as the preimage of the PSSM. While it can provide important statistical details regarding the sequence model, it does not have the capability to score sequences in an additive fashion position by position. To do this requires converting the profile into a *k* × *L* PSSM of additive information scores. Given a sequence *s* of length *L*, the PSSM can then be used to compute a score for *s* in order to determine its relationship to the sequence model.

Recovering the PSSM from the vector space approach is straightforward. Given an *n* × *m* data matrix of encoded sequences, the *i th* subvector *ν<sup>i</sup>* in the average vector *μ<sup>A</sup>* computed from Equation (28) is equivalent to the *i th* column in the *<sup>k</sup>* <sup>×</sup> *<sup>L</sup>* profile matrix. Simply reshaping the *kL* × 1 vector *μ<sup>A</sup>* into a *k* × *L* matrix recovers the profile. However, since the goal is to score sequences of unknown function, we are more interested in showing how *μ<sup>A</sup>* can be applied to recover a PSSM score. Assume that the components of *μ<sup>A</sup>* have been transformed by applying the same information measure I*PSSM* used to convert the profile to the PSSM. Assuming an encoding alphabet with *k* symbols, a query sequence *s* of length *L* can be encoded to form a *kL* × 1 vector *x*. The PSSM score *SPSSM* of *x* can then be recovered via the inner product:

$$\mathcal{S}\_{PSSM} = \mathbf{x}^T \mathcal{Z}\_{PSSM}(\mu\_A) \tag{32}$$

where I*PSSM*(*μA*) represents the conversion of a probability vector into an vector of additive information scores.

The similarity of Equation (32) with Equation (2) is worth noting. Assume several families of sequences of equal length *L* are encoded into separate data matrices *Ai* where *i* = 1, ··· , *M* and *M* is the number of families. It should be clear that the relevance score for the query vector *x* can be produced using the cosine similarity according to

$$S\_{PSSM} = \mathbf{x}^T \mathcal{Z}\_{PSSM} \tag{33}$$

where

10 Will-be-set-by-IN-TECH

A set of *m* sequences of length *L* having some related function (e.g. DNA promoter sites for a common sigma factor) is often represented in the form of an *m* × *L* matrix where each column refers to a common position in each sequence. A consensus sequence *sC* of length *L* is constructed by extracting the symbol having the highest frequency in each column. This approach to sequence model construction, while quite rudimentary, is often useful for

Using the vector space approach, it is possible to recover the consensus sequence. Assuming each sequence symbol is encoded into a *k* dimensional vector, each sequence will be encoded into a vector of length *n* = *Lk* (see Section 3). Hence the original *m* × *L* matrix of sequences will be transformed into an *n* × *m* data matrix of the form described in Section 2.1. In this case,

To recover the consensus, it is useful to introduce notation for describing an empirically

where **e** and *m* × 1 vector of ones. Then, *μ<sup>A</sup>* is an *n* × 1 column vector made up of *L* contiguous 'subvectors' of dimension *k* where the value of *k* depends upon the encoding method applied.

sequence *sC*(*i*) can be inferred by associating the component of *ν<sup>i</sup>* yielding the highest average with the originally encoded symbol. To be precise, let the alphabet of *k* sequence symbols (e.g.

> *J* = arg max *j*=1,···,*k*

where *aJ* ∈ A. The algorithm for recovering the consensus sequence can be summarized as

3. Given the alphabet A, apply Equation (31) in order to construct the consensus sequence

The consensus sequence, while qualitatively useful, is an incomplete sequence model in that it does not consider cases where two or more symbols in a given position are close to equiprobable. Under these circumstances, one is forced to arbitrarily choose one symbol for the consensus at the expense of loosing information about the other symbols. In contrast,

*th* component of *<sup>ν</sup><sup>i</sup>* be written as *<sup>ν</sup>ij* for *<sup>j</sup>* <sup>=</sup> 1, ··· , *<sup>k</sup>*. The subscript index of the

*<sup>m</sup>* )*A***<sup>e</sup>** (28)

*νij* (30)

*sC*(*i*) = *aJ* (31)

*th* position of the consensus sequence

A≡{*a*1, *a*2, ··· , *ak*} (29)

*th* symbol in the consensus

*<sup>μ</sup><sup>A</sup>* <sup>≡</sup> ( <sup>1</sup>

each column vector in the data matrix represents an encoded amino acid sequence.

visualizing obvious qualitative relationships amongst sequence elements.

derived average vector *μ<sup>A</sup>* from an *n* × *m* data matrix *A* as follows:

Let *ν<sup>i</sup>* for *i* = 1, ··· , *L* represent each subvector in *μA*; then, the *i*

component with the maximum average in *ν<sup>i</sup>* can therefore be extracted as

DNA, amino acids, structural, text, etc) be defined as

and the associated alphabet symbol is entered into the *i*

1. Given the *n* × *m* encoded data matrix *A*, compute *μA*. 2. For each *ν<sup>i</sup>* where *i* = 1, ··· , *L*, apply Equation (30).

**5.2 Position specific scoring matrix**

**5.1 Consensus sequence**

and let the *j*

as

follows:

*sC*.

$$\mathcal{Z}\_{PSM} = \left[ \mathcal{Z}\_{PSM}(\mu\_{A\_1}) \: \mathcal{Z}\_{PSM}(\mu\_{A\_2}) \cdot \cdots \: \mathcal{Z}\_{PSM}(\mu\_{A\_M}) \right] \tag{34}$$

is the *n* × *M* information matrix that describes the sequence families.

It is of important theoretical interest that the vector space approach recovers both the PSSM and its information capacity to score sequences. However, it is more useful to observe that invoking an algebraic structure on a set of sequences induces a spectrum of novel possibilities. For instance, the SVD can be applied to the data matrix and a scoring scheme can be derived from the computed orthogonal basis. In addition, as mentioned at the end of Section 3, it is possible to weight both the data matrix and the encoded sequence according to more biologically significant measures such as hydrophobicity. Finally, and probably most importantly, the vector space formulation allows for powerful optimization techniques Golub & Van Loan (1989); Luenberger (1969) to be applied in order to maximize the scoring capacity of the sequence model.

for Bioinformatics Data Mining 13

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 97

**0 10 20 30 40 50 60**

**Sequence length, L**

**0 10 20 30 40 50**

**Number of sequences**

Fig. 3. Histogram of the number of BLOCKS families as function of the number of sequences

Fig. 2. Histogram of the number of BLOCKS families as function of sequence length.

**0**

contained in each family.

**100**

**200**

**300**

**400**

**Number of BLOCKS families**

**500**

**600**

**700**

**Number of BLOCKS families**

#### **5.3 Clustering**

Our goal in this section is to investigate how clustering encoded sets of vectors will partition an existing set of data. While there are several approaches to performing data clustering Theodoridis & Koutroumbas (2003), we choose to invoke techniques that characterize the mean behavior of a data cluster. Specifically, we analyze one supervised method (Section 5.3.2) and one unsupervised method (Section 5.3.3). As we shall see, these approaches will enable us to construct 'fuzzy' regular expressions capable of algebraically describing the behavior of a given data set. It will become clear that this approach will offer additional insight to sequence clustering techniques typically encountered in the literature Henikoff & Henikoff (1991); Smith et al. (1990). As the BLOCKS database Henikoff et al. (2000); Pietrokovski et al. (1996) has been constructed from sequence clusters using ungapped multiple alignment, we choose to apply this database as the template in order to compare it against the vector space model.

#### **5.3.1 The BLOCKS database**

The BLOCKS database consists of approximately 3000 protein families (or 'blocks'). Each family has a varying number of sequences that have been derived from ungapped alignments. Therefore, while sequence lengths between two different families may differ, sequences contained within each family, by the definition of a 'block', must all have the same length. Furthermore, the number of sequences in each family can vary and there is can be a considerable degree of redundancy within some families; hence, it is sensible to analyze how the data is distributed with respect to each BLOCKS family.

The histogram in Figure 2 illustrates the number of BLOCKS families as function of sequence length. For example, there are 90 families containing sequences of length *L* = 40. From this figure, we can conclude that it is generally possible to find at least 40 families containing nominal sequence lengths. It is also important to characterize how the number of sequences contained within each family is distributed throughout the database. The histogram in Figure 3 illustrates the number of BLOCKS families as function of the number of sequences contained within each family. From this figure, we observe that many families contain somewhere between 9 and 20 representative sequences. Finally, for the sake of clarity, we restrict our attention to sequences having the same lengths. The extension of these results to variable length sequences is the subject of current research based upon existing methodologies cited in the literature Couto et al. (2007); T. Rodrigues (2004). The histogram in Figure 4 illustrates the number of BLOCKS families as function of the number of sequences contained within in each family; however, observe that this representative sample has been restricted to those families containing sequences of equal length (in this case *L* = 30). The behavior in this graph is typical in that most families contain on the order of 10-12 sequences of equal length. For the purposes of illustration and without loss of generality, we choose to demonstrate the techniques in the upcoming sections using families containing sequences of equal length.

#### **5.3.2 Centroid approach**

In this section, we cluster sequences whose BLOCKS classification is known a priori in order to algebraically characterize each family. To do this, each family in the analysis is encoded separately and Equation (28) is applied to each family data matrix in order to derive a family centroid. Since the families are already partitioned, this approach is a supervised clustering technique that will enable us to derive symbol contributions from the centroid vectors.

12 Will-be-set-by-IN-TECH

Our goal in this section is to investigate how clustering encoded sets of vectors will partition an existing set of data. While there are several approaches to performing data clustering Theodoridis & Koutroumbas (2003), we choose to invoke techniques that characterize the mean behavior of a data cluster. Specifically, we analyze one supervised method (Section 5.3.2) and one unsupervised method (Section 5.3.3). As we shall see, these approaches will enable us to construct 'fuzzy' regular expressions capable of algebraically describing the behavior of a given data set. It will become clear that this approach will offer additional insight to sequence clustering techniques typically encountered in the literature Henikoff & Henikoff (1991); Smith et al. (1990). As the BLOCKS database Henikoff et al. (2000); Pietrokovski et al. (1996) has been constructed from sequence clusters using ungapped multiple alignment, we choose to apply this database as the template in order to compare it against the vector space

The BLOCKS database consists of approximately 3000 protein families (or 'blocks'). Each family has a varying number of sequences that have been derived from ungapped alignments. Therefore, while sequence lengths between two different families may differ, sequences contained within each family, by the definition of a 'block', must all have the same length. Furthermore, the number of sequences in each family can vary and there is can be a considerable degree of redundancy within some families; hence, it is sensible to analyze how

The histogram in Figure 2 illustrates the number of BLOCKS families as function of sequence length. For example, there are 90 families containing sequences of length *L* = 40. From this figure, we can conclude that it is generally possible to find at least 40 families containing nominal sequence lengths. It is also important to characterize how the number of sequences contained within each family is distributed throughout the database. The histogram in Figure 3 illustrates the number of BLOCKS families as function of the number of sequences contained within each family. From this figure, we observe that many families contain somewhere between 9 and 20 representative sequences. Finally, for the sake of clarity, we restrict our attention to sequences having the same lengths. The extension of these results to variable length sequences is the subject of current research based upon existing methodologies cited in the literature Couto et al. (2007); T. Rodrigues (2004). The histogram in Figure 4 illustrates the number of BLOCKS families as function of the number of sequences contained within in each family; however, observe that this representative sample has been restricted to those families containing sequences of equal length (in this case *L* = 30). The behavior in this graph is typical in that most families contain on the order of 10-12 sequences of equal length. For the purposes of illustration and without loss of generality, we choose to demonstrate the techniques in the

In this section, we cluster sequences whose BLOCKS classification is known a priori in order to algebraically characterize each family. To do this, each family in the analysis is encoded separately and Equation (28) is applied to each family data matrix in order to derive a family centroid. Since the families are already partitioned, this approach is a supervised clustering technique that will enable us to derive symbol contributions from the centroid vectors.

**5.3 Clustering**

model.

**5.3.1 The BLOCKS database**

**5.3.2 Centroid approach**

the data is distributed with respect to each BLOCKS family.

upcoming sections using families containing sequences of equal length.

Fig. 2. Histogram of the number of BLOCKS families as function of sequence length.

Fig. 3. Histogram of the number of BLOCKS families as function of the number of sequences contained in each family.

for Bioinformatics Data Mining 15

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 99

**0.6 0.8 1 1.2 1.4 1.6 1.8 2**

was one misclassification. Figure 6 illustrates that data vector number 431 (which as member of family 30, 'HlyD family secretion proteins') was misclassified into family 54 (Osteopontin proteins). So, while the vector dimension is reduced from 600 to 330 (because *k* is reduced from 20 to 11), a minor cost in classification accuracy is incurred. At the same time, we observe a

We note one final application of the centroid approach for deriving 'fuzzy' regular expressions extracted from the vector components of the centroid vectors. Consider the sum normalized

. (37)

, it is then possible to

 1 ∑*n <sup>j</sup>*=1(*μAi* )*j μAi*

write an expression describing the percentage contribution of each symbol to analytically

In contrast to the supervised approach, we now wish to take all sequences of length *L* in the database and investigate how they are clustered when the unsupervised *K*-means algorithm is applied. When this algorithm is applied to small numbers of families (e.g. < 10), our results indicate that this algorithm will accurately determine the sequence families for the encoding method presented. However, as the number of data vectors grow, the high-dimensionality of the encoding method tends to obscure distances and, hence, can obscure the clusters. We

N*Ai* ≡

For each subvector associated with each sequence position in N*Ai*

briefly address this issue in the conclusions section of this chapter.

*th* sequence family.

**Between Centroid Distance (L=30, 73 Families)**

**0**

Fig. 5. Histogram of between centroid distance.

substantial reduction in dimensionality.

*i*

*th* family centroid

characterize the *i*

**5.3.3 K-means approach**

**10**

**20**

**30**

**40**

**50**

**Frequency**

**60**

**70**

**80**

**90**

Fig. 4. Histogram of the number of BLOCKS families as function of the number of sequences contained in each family (restricted to families with sequences of length L=30)

For this numerical experiment, we apply Table 5 as the encoding scheme and choose the BLOCKS family sequence length to be *L* = 30. Under these conditions, sequences will be encoded into column vectors of dimension *n* = (30)(11) = 330. In addition, all encoded data vectors are normalized to have unit magnitude.

There are 73 families in the BLOCKS database that have block length *L* = 30. Furthermore, there are a total of 910 sequences distributed amongst the 73 families. As mentioned above, there is a small degree of sequence redundancy within some BLOCKS families. After removing redundant sequences, a total of *J* = 755 sequences of length *L* = 30 are distributed amongst *I* = 73 families. Given the encoding method, the dimensions of the non-redundant data matrix *A* will be 330 × 755.

Figure 5 shows the results of computing the distance between all centroids. From this histogram, we observe that database families are fairly well-separated since the minimum distance between any two centroids is greater than 0.6.

In order to analyze the performance of the encoding method, we apply the inner product. Specifically, each data vector *vj* is classified by choosing the family associated with the centroid yielding the largest inner product:

$$\mathcal{C}(v\_j) \equiv \underset{i=1,\cdots,I}{\arg\max} \, v\_j^T \mathcal{M}.\tag{35}$$

where *j* = 1, ··· , *J* and

$$\mathcal{M} = \begin{bmatrix} \mu\_{A\_1} \ \mu\_{A\_2} \ \cdots \ \mu\_{A\_l} \end{bmatrix} \tag{36}$$

For standard encoding (i.e. *k* = 20, *n* = 600), all 755 data vectors were classified correctly using Equation (35). On the other hand, when applying the encoding method in Table 5, there

was one misclassification. Figure 6 illustrates that data vector number 431 (which as member of family 30, 'HlyD family secretion proteins') was misclassified into family 54 (Osteopontin proteins). So, while the vector dimension is reduced from 600 to 330 (because *k* is reduced from 20 to 11), a minor cost in classification accuracy is incurred. At the same time, we observe a substantial reduction in dimensionality.

We note one final application of the centroid approach for deriving 'fuzzy' regular expressions extracted from the vector components of the centroid vectors. Consider the sum normalized *i th* family centroid

$$\mathcal{N}\_{A\_l} \equiv \left(\frac{1}{\sum\_{l=1}^{n} (\mu\_{A\_l})\_l}\right) \; \mu\_{A\_l}.\tag{37}$$

For each subvector associated with each sequence position in N*Ai* , it is then possible to write an expression describing the percentage contribution of each symbol to analytically characterize the *i th* sequence family.

#### **5.3.3 K-means approach**

14 Will-be-set-by-IN-TECH

**Restricted to Families with Sequence Length L=30**

**0 5 10 15 20 25**

**Number of sequences**

Fig. 4. Histogram of the number of BLOCKS families as function of the number of sequences

For this numerical experiment, we apply Table 5 as the encoding scheme and choose the BLOCKS family sequence length to be *L* = 30. Under these conditions, sequences will be encoded into column vectors of dimension *n* = (30)(11) = 330. In addition, all encoded data

There are 73 families in the BLOCKS database that have block length *L* = 30. Furthermore, there are a total of 910 sequences distributed amongst the 73 families. As mentioned above, there is a small degree of sequence redundancy within some BLOCKS families. After removing redundant sequences, a total of *J* = 755 sequences of length *L* = 30 are distributed amongst *I* = 73 families. Given the encoding method, the dimensions of the non-redundant

Figure 5 shows the results of computing the distance between all centroids. From this histogram, we observe that database families are fairly well-separated since the minimum

In order to analyze the performance of the encoding method, we apply the inner product. Specifically, each data vector *vj* is classified by choosing the family associated with the

*i*=1,···,*I*

For standard encoding (i.e. *k* = 20, *n* = 600), all 755 data vectors were classified correctly using Equation (35). On the other hand, when applying the encoding method in Table 5, there

*μA*<sup>1</sup> *μA*<sup>2</sup> ··· *μAJ*

*vT*

*<sup>j</sup>* M. (35)

(36)

*C*(*vj*) ≡ arg max

<sup>M</sup> <sup>=</sup>

contained in each family (restricted to families with sequences of length L=30)

data matrix *A* will be 330 × 755.

where *j* = 1, ··· , *J* and

vectors are normalized to have unit magnitude.

distance between any two centroids is greater than 0.6.

centroid yielding the largest inner product:

**Number of BLOCKS families**

In contrast to the supervised approach, we now wish to take all sequences of length *L* in the database and investigate how they are clustered when the unsupervised *K*-means algorithm is applied. When this algorithm is applied to small numbers of families (e.g. < 10), our results indicate that this algorithm will accurately determine the sequence families for the encoding method presented. However, as the number of data vectors grow, the high-dimensionality of the encoding method tends to obscure distances and, hence, can obscure the clusters. We briefly address this issue in the conclusions section of this chapter.

for Bioinformatics Data Mining 17

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 101

Sequence Length Encoding Method *<sup>n</sup> <sup>m</sup>* <sup>D</sup>[*QA*] <sup>D</sup>[<sup>N</sup> (*Q<sup>T</sup>*

Table 7. Characterization of BLOCKS orthogonal complement for various sequence lengths

In a manner similar to Figure 6, we classify all encoded data vectors in order to determine their family membership by applying Equation (26). Figures 7 - 8 show results where the *L* = 15 and *L* = 30 cases have been tested. For the *L* = 15 case, as the vector space dimension decreases more classification errors arise since a reduced encoding will result in more non-unique vectors. The *L* = 30 case leads to longer vectors, hence, it is more robust to

**0 200 400 600 800 1000**

**0 200 400 600 800 1000**

**0 200 400 600 800 1000**

**Vector Index (Hydro Encoding, L=15)**

**Vector Index (Volume/Charge/Hydro Encoding, L=15)**

**Vector Index (Standard Encoding, L=15)**

*L* = 15 Standard 300 949 286 14 *L* = 15 Table 5 165 949 165 0 *L* = 15 Table 3 30 936 30 0 *L* = 30 Standard 600 785 576 13 *L* = 30 Table 5 330 785 330 0 *L* = 30 Table 3 60 774 60 0

*<sup>A</sup>*)]

renewcommand11.2

and encodings

reduced encodings.

**Family Index**

**0**

**0**

**0**

Fig. 7. Family classification of each data vector.

**50**

**Family Index**

**50**

**Family Index**

**50**

Fig. 6. Family classification of each data vector.

#### **5.4 Database search and pattern classification**

We now come to what is arguably one of the most important applications in this chapter. In this section, we will apply the reliability and relevance measures summarized in Table 6 to perform BLOCKS database searches and pattern classification Bishop (2006); Hand et al. (2001).

#### **5.4.1 Characterization of BLOCKS orthogonal complement**

When constructing a database, it is critical to understand and analytically characterize the spectrum of objects *not* contained within the database. This task is easily achieved by considering the orthogonal complement. As first step, we consider families with sequence lengths *L* = 15 (70 families) and *L* = 30 (73 families). Furthermore, we compare encodings from Table 3 and Table 5 with standard encoding. Specifically, for each encoding method, an *n* × *m* non-redundant data matrix *A* consisting of all data vectors of from all families with sequence length *L* is constructed. The SVD is then applied to construct an orthogonal basis *QA* for the column space of *A*. The rank *r* of *A* (r=D[*QA*]) and the dimension of the null space of *A* are then compared (D[<sup>N</sup> (*Q<sup>T</sup> <sup>A</sup>*)]). Using this approach, it is then possible to assess the quantity *n* − D[*QA*] to determine the size of the subspace left uncharacterized by the database. Table 7 summarizes the results. From this table, it is clear that, after redundant encoded vectors are removed, the BLOCKS database thoroughly spans the pattern space. Furthermore, the histogram in Figure 5 further indicates that, while the sequence subspace is well represented, there is also a good degree of separation between the family classes.

#### **5.4.2 Pattern classification**

Another important database characterization is to examine how the projection method classifies data vectors after the class subspace bases have been constructed using the SVD.


renewcommand11.2

16 Will-be-set-by-IN-TECH

**0 100 200 300 400 500 600 700 800**

**Data Vector Index (L=30)**

We now come to what is arguably one of the most important applications in this chapter. In this section, we will apply the reliability and relevance measures summarized in Table 6 to perform BLOCKS database searches and pattern classification Bishop (2006); Hand et al.

When constructing a database, it is critical to understand and analytically characterize the spectrum of objects *not* contained within the database. This task is easily achieved by considering the orthogonal complement. As first step, we consider families with sequence lengths *L* = 15 (70 families) and *L* = 30 (73 families). Furthermore, we compare encodings from Table 3 and Table 5 with standard encoding. Specifically, for each encoding method, an *n* × *m* non-redundant data matrix *A* consisting of all data vectors of from all families with sequence length *L* is constructed. The SVD is then applied to construct an orthogonal basis *QA* for the column space of *A*. The rank *r* of *A* (r=D[*QA*]) and the dimension of the null space of *A*

*n* − D[*QA*] to determine the size of the subspace left uncharacterized by the database. Table 7 summarizes the results. From this table, it is clear that, after redundant encoded vectors are removed, the BLOCKS database thoroughly spans the pattern space. Furthermore, the histogram in Figure 5 further indicates that, while the sequence subspace is well represented,

Another important database characterization is to examine how the projection method classifies data vectors after the class subspace bases have been constructed using the SVD.

*<sup>A</sup>*)]). Using this approach, it is then possible to assess the quantity

**0**

are then compared (D[<sup>N</sup> (*Q<sup>T</sup>*

**5.4.2 Pattern classification**

(2001).

Fig. 6. Family classification of each data vector.

**5.4 Database search and pattern classification**

**5.4.1 Characterization of BLOCKS orthogonal complement**

there is also a good degree of separation between the family classes.

**10**

**20**

**30**

**40**

**Classification by Family Index**

**50**

**60**

**70**

**80**

Table 7. Characterization of BLOCKS orthogonal complement for various sequence lengths and encodings

In a manner similar to Figure 6, we classify all encoded data vectors in order to determine their family membership by applying Equation (26). Figures 7 - 8 show results where the *L* = 15 and *L* = 30 cases have been tested. For the *L* = 15 case, as the vector space dimension decreases more classification errors arise since a reduced encoding will result in more non-unique vectors. The *L* = 30 case leads to longer vectors, hence, it is more robust to reduced encodings.

Fig. 7. Family classification of each data vector.

for Bioinformatics Data Mining 19

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 103

**0 2 4 6 8 10 12**

**Number of Positions (Standard Encode)**

**0 2 4 6 8 10 12**

**Number of Positions (Volume/Charge/Hydro Encode)**

**0 2 4 6 8 10 12**

**Number of Positions (Standard Encode)**

**0 2 4 6 8 10 12**

**Number of Positions (Volume/Charge/Hydrod Encode)**

Fig. 10. Relevance differential as a function of the number of positions randomized.

Fig. 9. Family classification as a function of the number of positions randomized.

**6**

**5**

**−0.5**

**−0.5**

**0**

**0.5**

**Relevance Difference**

**1**

**0**

**0.5**

**Relevance Difference**

**1**

**10**

**15**

**Family Index**

**20**

**8**

**10**

**Family Index**

**12**

Fig. 8. Family classification of each data vector.

#### **5.4.3 BLOCKS database search**

In this section, we demonstrate how to perform database searches using the relvance and reliability equations summarized in Table 6. Database search examples have been reported using the BLOCKS database Henikoff & Henikoff (1994). In this work, we analyze the effect of randomly mutating sequences within the BLOCKS database to analyze family recognition as a function sequence mutation. For the purposes of illustration, we consider a test sequence from the Enolase protein family (BL00164D) in order to examine relevancy and database reliability. For this test sequence with *L* = 15, amino acids are randomly changed where the number of positions mutated is gradually increased from 0 to 12. Furthermore, encodings from Table 3 are compared with standard encoding.

For this series of tests, the reliability always gives a value of cos(*φ*) = 1, implying that the randomization test did not result in a vector outside the subspace defined by the database. This corroborates conclusions drawn in Section 5.4.1. Figure 9 shows that the classification remains stable for both encodings until about 5-6 positions out of 15 have been mutated (the family index for the original test sequence is 10). In addition, the relevance can be summarized by computing the *difference* between the maximum value of cos(*φi*) and the second largest value. For the sake of illustration, if the BLOCKS family with index 10 does not yield the maximum projection, then the relevance difference is assigned a negative value. Figure 10 show the results of this computation. In this test, we observe a consistent decrease in the relevance difference indicating that secondary occurrences are gaining influence against the family class of the test sequence.

18 Will-be-set-by-IN-TECH

**0 100 200 300 400 500 600 700 800**

**Vector Index (Standard Encoding, L=30)**

**0 100 200 300 400 500 600 700 800**

**Vector Index (Volume/Charge/Hydro Encoding, L=30)**

**0 100 200 300 400 500 600 700 800**

**Vector Index (Hydro Encoding, L=30)**

In this section, we demonstrate how to perform database searches using the relvance and reliability equations summarized in Table 6. Database search examples have been reported using the BLOCKS database Henikoff & Henikoff (1994). In this work, we analyze the effect of randomly mutating sequences within the BLOCKS database to analyze family recognition as a function sequence mutation. For the purposes of illustration, we consider a test sequence from the Enolase protein family (BL00164D) in order to examine relevancy and database reliability. For this test sequence with *L* = 15, amino acids are randomly changed where the number of positions mutated is gradually increased from 0 to 12. Furthermore, encodings from Table 3

For this series of tests, the reliability always gives a value of cos(*φ*) = 1, implying that the randomization test did not result in a vector outside the subspace defined by the database. This corroborates conclusions drawn in Section 5.4.1. Figure 9 shows that the classification remains stable for both encodings until about 5-6 positions out of 15 have been mutated (the family index for the original test sequence is 10). In addition, the relevance can be summarized by computing the *difference* between the maximum value of cos(*φi*) and the second largest value. For the sake of illustration, if the BLOCKS family with index 10 does not yield the maximum projection, then the relevance difference is assigned a negative value. Figure 10 show the results of this computation. In this test, we observe a consistent decrease in the relevance difference indicating that secondary occurrences are gaining influence against the

**0**

**0**

**0**

**5.4.3 BLOCKS database search**

are compared with standard encoding.

family class of the test sequence.

Fig. 8. Family classification of each data vector.

**50**

**Family Index**

**50**

**Family Index**

**50**

**Family Index**

Fig. 9. Family classification as a function of the number of positions randomized.

Fig. 10. Relevance differential as a function of the number of positions randomized.

for Bioinformatics Data Mining 21

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining 105

Berry, M. W., Dumais, S. T. & OŠBrien, G. W. (1995). Using linear algebra for intelligent

Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When is "nearest neighbor"

Bordo, D. & Argos, P. (1991). Suggestions for safe residue substitutions in site-directed

Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure, *Science* 253: 164–170. Couto, B. R. G. M., Ladeira, A. P. & Santos, M. A. (2007). Application of latent semantic

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing

Done, B. (2009). Gene function discovery using latent semantic indexing, *Wayne State*

Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (2004). *Biological Sequence Analysis*, Cambridge

Eisenberg, D., Schwarz, E., Komaromy, M. & Wall, R. (1984). Analysis of membrane and

Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O., Gunesekaran,

Golub, G. H. & Van Loan, C. F. (1989). *Matrix Computations*, Johns Hopkins University Press,

Grossman, D. A. & Frieder, O. (2004). *Information Retrieval: Algorithms and Heuristics*, Springer.

Henikoff, J. G., Greene, E. A., Pietrokovski, S. & Henikoff, S. (2000). Increased coverage of protein families with the blocks database servers, *Nucl. Acids Res.* 28: 228–230. Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database

Henikoff, S. & Henikoff, J. G. (1994). Protein family classification based on searching a

Hinneburg, E., Aggarwal, C., Keim, D. A. & Hinneburg, A. (2000). What is the nearest

Houle, M., Kriegel, H., Kröger, P., Schubert, E. & Zimek, A. (2010). Can shared-neighbor

Khatri, P., Done, B., Rao, A., Done, A. & Draghici, S. (2005). A semantic analysis of the

annotations of the human genome, *Bioinformatics* 21: 3416–3421.

neighbor in high dimensional spaces?, *In Proceedings of the 26th VLDB Conference*,

distances defeat the curse of dimensionality?, *in* M. Gertz & B. Ludascher (eds), *Scientific and Statistical Database Management*, Vol. 6187 of *Lecture Notes in Computer*

character-by-character, *Genetics and Molecular Research* 6: 983–999.

indexing to evaluate the similarity of sets of sequences without multiple alignments

by latent semantic analysis, *Journalof the American Society for Information Science*

surface protein sequences with the hydrophobic moment plot, *Journal of Molecular*

P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E., Eddy, S. & Bateman, A. (2010).

information retrieval, *SIAM Rev.* 37: 573–595.

41: 391–407.

University.

*University (Ph.D.. Thesis)* .

*Biology* 179: 125–142.

Baltimore, MD.

pp. 506–515.

meaningful?, *In Int. Conf. on Database Theory*, pp. 217–235. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*, Springer.

mutagenesis, *Journal of Molecular Biology* 217: 721–729.

Dominich, S. (2010). *The Modern Algebra of Information Retrieval*, Springer.

Elden, L. (2004). *Matrix Methods in Data Mining and Pattern Recognition*, SIAM. Feldman, R. & Sanger, J. (2007). *The Text Mining Handbook*, Cambridge.

Hand, D., Mannila, H. & Smyth, P. (2001). *Principles of Data Mining*, MIT Press.

searching, *Nucleic Acids Research* 19: 6565–6572.

database of blocks, *Genomics* 19: 97–107.

*Science*, Springer, pp. 482–500.

The Pfam protein families database, *Nucl. Acids Res.* 38: D211–222.

#### **6. Conclusions**

This chapter has elaborated upon the application of information retrieval techniques to various computational approaches in bioinformatics such as sequence modeling, clustering, pattern classification and database searching. While extensions to multiple sequence alignment have been alluded to in the literature Couto et al. (2007); Stuart, Moffett & Baker (2002), there is a need to include and model gaps in the approaches proposed in this body of work. Extensions to the vector space methods outlined in this chapter might involve including a new symbol to represent a gap. Regardless of the symbol set employed, it is clear that the approach described can lead to sparse elements embedded in high dimensional vector spaces. While data sets of this kind can be potentially problematic Beyer et al. (1999); Hinneburg et al. (2000); Houle et al. (2010); Steinbach et al. (2003), subspace dimension reduction techniques are derivable from LSI approaches such as the SVD.

The IR techniques introduced above are readily applicable in any setting where bioinformatics data (sequence, structural, symbolic, etc) can be encoded. This work has focused primarily on amino acid sequence data; however, given existing structural encoding techniques Bowie et al. (1991); Zhang et al. (2010), future work might be directed toward vector space approaches to structural data. The methods outlined in this chapter allow for novel biologically meaningful weighting schemes, algebraic regular expressions, matrix factorizations for subspace reduction as well as numerical optimization techniques applicable to high dimensional vector spaces.

#### **7. Acknowledgements**

This work was made possible by funding from grant DHS 2008-ST-062-000011, "Increasing the Pipeline of STEM Majors among Minority Serving Institutions". The authors would like to thank David J. Schneider of the USDA-ARS for many helpful discussions.

#### **8. References**


20 Will-be-set-by-IN-TECH

This chapter has elaborated upon the application of information retrieval techniques to various computational approaches in bioinformatics such as sequence modeling, clustering, pattern classification and database searching. While extensions to multiple sequence alignment have been alluded to in the literature Couto et al. (2007); Stuart, Moffett & Baker (2002), there is a need to include and model gaps in the approaches proposed in this body of work. Extensions to the vector space methods outlined in this chapter might involve including a new symbol to represent a gap. Regardless of the symbol set employed, it is clear that the approach described can lead to sparse elements embedded in high dimensional vector spaces. While data sets of this kind can be potentially problematic Beyer et al. (1999); Hinneburg et al. (2000); Houle et al. (2010); Steinbach et al. (2003), subspace dimension reduction techniques

The IR techniques introduced above are readily applicable in any setting where bioinformatics data (sequence, structural, symbolic, etc) can be encoded. This work has focused primarily on amino acid sequence data; however, given existing structural encoding techniques Bowie et al. (1991); Zhang et al. (2010), future work might be directed toward vector space approaches to structural data. The methods outlined in this chapter allow for novel biologically meaningful weighting schemes, algebraic regular expressions, matrix factorizations for subspace reduction as well as numerical optimization techniques applicable

This work was made possible by funding from grant DHS 2008-ST-062-000011, "Increasing the Pipeline of STEM Majors among Minority Serving Institutions". The authors would like

Alter, O., Brown, P. O. & Botstein, D. (2000a). Generalized singular value decomposition

Alter, O., Brown, P. O. & Botstein, D. (2000b). Singular value decomposition for genome-wide

Andorf, C. M., Dobbs, D. L. & Honavar, V. G. (2002). Discovering protein function

Bacardit, J., Stout, M., Hirst, J. D., Valencia, A., Smith, R. E. & Krasnogor, N. (2009). Automated

Baldi, P. & Brunak, S. (1998). *Bioinformatics: The Machine Learning Approach*, MIT Press,

Baxevanis, A. D. & Ouellette, B. F. (2005). *Bioinformatics: A practical guide to the analysis of genes*

Berry, M. W. & Browne, M. (2005). *Understanding Search Engines: Mathematical Seacrh Engines*

Berry, M. W., Drmac, Z. & Jessup, E. R. (1999). Matrices, vector spaces, and information

expression data processing and modeling, *PNAS* 97: 10101–10106.

alphabet reduction for protein datasets, *BMC Bioinformatics* 10(6).

for comparative analysis of genome-scale expression data sets of two different

classification rules from reduced alphabet representations of protein sequences, *Proceedings of the Fourth Conference on Computational Biology and Genome Informatics*,

to thank David J. Schneider of the USDA-ARS for many helpful discussions.

**6. Conclusions**

are derivable from LSI approaches such as the SVD.

organisms, *PNAS* 100: 3351–3356.

Durham, NC, pp. 1200–1206.

Cambridge, MA.

*and proteins*, Wiley.

*and Text Retrieval*, SIAM.

retrieval, *SIAM Rev.* 41: 335–362.

to high dimensional vector spaces.

**7. Acknowledgements**

**8. References**


**5** 

*Egypt* 

**Massively Parallelized** 

*2Faculty of Engineering, Cairo University,* 

**DNA Motif Search on FPGA** 

Yasmeen Farouk1, Tarek ElDeeb2 and Hossam Faheem1 *1Faculty of Computer and Information Sciences, Ain Shams University,* 

Understanding the mechanisms that regulate gene expression is a major challenge in biology. Motif finding problem is considered an important task in this challenge. Addressing the complexity nature of the problem together with being very data intensive has encouraged introducing field programmable gate arrays (FPGAs) to the problem.

Many Algorithms are introduced to solve this problem. They can be categorized into pattern-based and profile-based algorithms [1]. Pattern-based algorithms include PROJECTION[4], MULTIPROFILER[6], and MITRA[3]. Profile-based algorithms includes CONSENSUS[7], MEME[2] and Gibbs sampling[5]. Although these algorithms show good performance, they still can fail to identify all the possible motifs in the sequences. They also show poor performance when trying to solve the challenge problem presented by Pvzner and Sze[8]. Some of them fail due to local search, others which are based on statistical

We can also categorize Motif finding algorithms due to the solution they provide. Some algorithms provide exact solution others provide approximate one. Brute Force algorithm is an exact algorithm but it suffers from the intractability of its running time. It increases exponentially with the size of the required motif. This makes the Brute Force unsuitable for

Our enhanced Brute Force algorithm, skip Brute Force, can predict the quality of the computed motif. The algorithm skips those iterations which will lead to a poor scored motif, thus leads to a better running time than the original Brute Force. This enhancement guarantees the same exactness of the Brute Force. But, it still suffers from the intractable

Many approaches can be applied to speed up the running time of any algorithm using hardware; examples include chip multiprocessors, graphics processing units (GPUs) and (FPGAs). GPUs are inexpensive, commodity parallel devices and have already been employed as powerful coprocessors for a large number of applications. However, GPUs have limited instructions and limited parallelism relative to FPGA's configurability. The research in [10] employed acceleration using GPU. Another approach uses clusters of workstations [12]. However, clusters typically have high maintenance and energy costs

FPGAs are very powerful in such computationally intensive tasks.

measures fail to separate the motif from the background sequences.

**1. Introduction** 

long motifs.

running time for long motifs.


### **Massively Parallelized DNA Motif Search on FPGA**

Yasmeen Farouk1, Tarek ElDeeb2 and Hossam Faheem1 *1Faculty of Computer and Information Sciences, Ain Shams University, 2Faculty of Engineering, Cairo University, Egypt* 

#### **1. Introduction**

22 Will-be-set-by-IN-TECH

106 Bioinformatics – Trends and Methodologies

Klie, S., Martens, L., Vizcaino, J. A., Cote, R., Jones, P., Apweiler, R., Hinneburg, A. &

Kuruvilla, F. G., Park, P. J. & Schreiber, S. L. (2004). Vector algebra in the analysis of

Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of

Langville, A. N. & Meyer, C. D. (2006). *Google's PageRank and Beyond: The Science of Search*

Mount, D. W. (2004). *Bioinformatics: Sequence and Genomic Analysis*, Cold Spring Harbor

Pietrokovski, S., Henikoff, J. G. & Henikoff, S. (1996). The blocks database - a system for

Salton, G. & Buckley, C. (1990). Improving retrieval performance by relevance feedbackl, *J.*

Sigrist, C. J. A., Cerutti, L., de Castro, E., Langendijk-Genevaux, P. S., Bulliard, V., Bairoch, A. &

Smith, H., Annau, T. & Chandrasegaran, S. (1990). Finding sequence motifs in groups of

Steinbach, M., Ertöz, L. & Kumar, V. (2003). The challenges of clustering high-dimensional

Stuart, G. W. & Berry, M. W. (2004). An SVD-based comparison of nine whole eukaryotic

Stuart, G. W., Moffett, K. & Baker, S. (2002). Integrated gene and species phylogenies from unaligned whole genome protein sequences, *Bioinformatics* 18: 100–108. Stuart, G. W., Moffett, K. & Leader, J. J. (2002). A comprehensive vertebrate phylogeny using

T. Rodrigues, L. Pacífico, S. T. (2004). Clustering and artificial neural networks: Classification

Wall, M. E., Rechtsteiner, A. & Rocha, L. M. (2003). *Singular value decomposition and principal*

Wang, J. T. L., Zaki, M. J., Toivonen, H. T. T. & Sasha, D. (eds) (2005). *Data Mining in*

Weiss, S. M., Indurkhya, N., Zhang, T. & Damerau, F. J. (2005). *Text Mining: Predictive Methods*

Wimley, W. C. & White, S. H. (1996). Experimentally determined hydrophobicity scale for proteins at membrane interfaces, *Nature Structural Biology* 3: 842–848. Zhang, Z. H., Lee, H. K. & Mihalek, I. (2010). Reduced representation of protein structure:

implications on efficiency and scope of detection of structural similarity, *BMC*

Hulo, N. (2010). PROSITE: a protein domain database for functional characterization

data, *In New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics,*

genomes supports a coelomate rather than ecdysozoan lineage, *BMC Bioinformatics*

vector representations of protein sequences from whole genomes, *Mol. Biol. Evol.*

of variable lengths of Helminth antigens in set of domains, *Genetics and Molecular*

indexing, *Journal of Proteome Research* 7: 182–191.

genome-wide expression data, *Genome Biology* 3(3).

a protein, *Journal of Molecular Biology* 157: 105–132.

Luenberger, D. G. (1969). *Optimization by Vector Space Methods*, Wiley.

protein classification, *Nucl. Acids Res.* 24: 197–200.

and annotation, *Nucl. Acids Res.* 38: D161–166.

functionally related proteins, *PNAS* 87: 826–830.

Theodoridis, S. & Koutroumbas, K. (2003). *Pattern Recognition*, Elsevier.

*for Analyzing Unstructured Information*, Springer.

*and Pattern Recognition*, Springer-Verlag.

*component analysis*, KLuwer, pp. 91–109.

Oja, E. (1983). *Subspace Methods of Pattern Recognition*, Wiley, New York, NY.

*Engine Rankings*, Princeton University Press. Lay, D. C. (2005). *Linear Algebra and Its Applications*, Wiley.

Laboratory Press.

5(204).

19: 554–562.

*Biology* 27: 673–678.

*Bioinformatics*, Spinger.

*Bioinformatics* 11(155).

*Amer. Soc. Info. Sci.* 41: 288–297.

Hermjakob, H. (2008). Analyzing large-scale proteomics projects with latent semantic

Understanding the mechanisms that regulate gene expression is a major challenge in biology. Motif finding problem is considered an important task in this challenge. Addressing the complexity nature of the problem together with being very data intensive has encouraged introducing field programmable gate arrays (FPGAs) to the problem. FPGAs are very powerful in such computationally intensive tasks.

Many Algorithms are introduced to solve this problem. They can be categorized into pattern-based and profile-based algorithms [1]. Pattern-based algorithms include PROJECTION[4], MULTIPROFILER[6], and MITRA[3]. Profile-based algorithms includes CONSENSUS[7], MEME[2] and Gibbs sampling[5]. Although these algorithms show good performance, they still can fail to identify all the possible motifs in the sequences. They also show poor performance when trying to solve the challenge problem presented by Pvzner and Sze[8]. Some of them fail due to local search, others which are based on statistical measures fail to separate the motif from the background sequences.

We can also categorize Motif finding algorithms due to the solution they provide. Some algorithms provide exact solution others provide approximate one. Brute Force algorithm is an exact algorithm but it suffers from the intractability of its running time. It increases exponentially with the size of the required motif. This makes the Brute Force unsuitable for long motifs.

Our enhanced Brute Force algorithm, skip Brute Force, can predict the quality of the computed motif. The algorithm skips those iterations which will lead to a poor scored motif, thus leads to a better running time than the original Brute Force. This enhancement guarantees the same exactness of the Brute Force. But, it still suffers from the intractable running time for long motifs.

Many approaches can be applied to speed up the running time of any algorithm using hardware; examples include chip multiprocessors, graphics processing units (GPUs) and (FPGAs). GPUs are inexpensive, commodity parallel devices and have already been employed as powerful coprocessors for a large number of applications. However, GPUs have limited instructions and limited parallelism relative to FPGA's configurability. The research in [10] employed acceleration using GPU. Another approach uses clusters of workstations [12]. However, clusters typically have high maintenance and energy costs

Massively Parallelized DNA Motif Search on FPGA 109

The planted motif problem guarantees to find the motif in each sequence. Based on this fact the skip algorithm skips the iterations over the remaining sequences if it reached the end of the current sequence without finding any window that matches the current *l*-mer (this *l*-mer can not be the motif) and jumps to the next *l*-mer. Assuming a single solution, the algorithm also skips the iterations over the remaining *l*-mers if it reaches the last sequence (*t*=20)

Fig. 1. Pseudo Code of the skip Brute Force Algorithm. If the commented break command is

In our early implementation of the skip algorithm, we did not consider scores for the motifs found. We forced to skip the current sequence if a single motif is found that has *d* mutations within the allowed range (line 15). Here the algorithm fails to find the best motif as more windows in the current sequence can reveal occurrences of motifs with lower mutations.

Our design benefits from the concurrent nature of the FPGAs as a hardware platform; control, multiplexing, matching and decision making are all occurring on the same clock

*nt*) at its worst case, just as the Brute Force.

A pseudo code of the skip Brute Force algorithm is shown below in Figure 1.

without skipping any iteration (the solution is found).

applied, then algorithm will skip-more.

The complexity of this algorithm is O(4*<sup>l</sup>*

**3. Hardware implementation of skip brute force** 

**2.1 Skip-more brute force**

when compared to single node solutions. Others use special hardware [9][11], where a cost performance ratio would be fairer for comparison [9].

The repetitive nature of the algorithm and the locality of the data encourage the use of FPGAs. Many operations can be done concurrently to enhance the running time. FPGAs proved to successfully accelerate sequential algorithms minimum by one or two orders of magnitude. They also have been widely used to accelerate bioinformatics problems such as Smith-Waterman and BLAST algorithms. This research offers an enhanced Brute Force algorithm hardware accelerated using Field Programmable Gate Arrays (FPGAs). Our research leads to a speed up by 1.5MX and thus boosting the running time without sacrificing the accuracy.

The rest of this chapter is organized as follows: In Section 2 we describe the motif finding problem and presents our enhanced Brute Force algorithm; skip Brute Force. Section 3 presents the hardware implementation of our novel approach with a detailed view to its components. Performance evaluation is presented in section 4. Finally, section 5 concludes our work and presents future enhancements.

#### **2. Skip brute force algorithm**

Brute-force search or exhaustive search, also known as generate and test, is a very general problem solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement.

The motif finding problem can be summarized as follows:

**Planted** *(l,d)***- Motif Problem:** Find the motif consensus *M* which is a fixed but unknown nucleotide sequence of length *l*. Suppose that *M* occurs once in each of *t* background sequences of common length *n*. Each occurrence of *M* is mutated by exactly *d* point substitutions in positions chosen independently at random. Given the *t* sequences, recover the motif occurrences and the consensus *M*.

Pevzner and Sze[8] presented the challenge problem(15,4) which makes a particular parameterization to the panted motif problem. The motif we are searching for is of length *l*=15, the allowed mutations *d*=4 and the number of sequences we are searching in is *t*=20 each of size *n*=600. The parameters of the challenge problem are typical values for finding transcription factor binding sites in co-regulated gene promoter regions yeast [4].

The Brute Force algorithm solves the motif finding problem by considering the set of all 4*<sup>l</sup>* possible *l*-mers. It computes the total distance of each *l*-mer in that set to all other *l*-mers in all *t* sequences. The correct motif is the one that have the smallest total distance along all the other *l*-mers. The run time of this algorithm is O(4*<sup>l</sup> nt*). The running time for finding a motif of *l*=11 is about 5hrs and it fails to handle longer motifs in reasonable time. To solve the challenge problem, the running time of the Brute Force algorithm would obviously be too slow.

The idea behind our skip Brute Force algorithm is that it skips all the iterations that will not lead to a correct solution. The algorithm is forced to skip over the remaining iterations in two cases. The algorithm generates all possible 4*<sup>l</sup> l*-mers. It then iterates over all the sequences examining that generated *l*-mer with all the windows in each sequence. For each sequence iteration, the current score is initialized with the allowed mutation and then the score of each window is computed; i.e. the hamming distance between that window and the current *l*-mer. If this distance beats the current score then we would suspect the current window to be an implanted motif until another window in the same sequence with a higher score beats it.

when compared to single node solutions. Others use special hardware [9][11], where a cost

The repetitive nature of the algorithm and the locality of the data encourage the use of FPGAs. Many operations can be done concurrently to enhance the running time. FPGAs proved to successfully accelerate sequential algorithms minimum by one or two orders of magnitude. They also have been widely used to accelerate bioinformatics problems such as Smith-Waterman and BLAST algorithms. This research offers an enhanced Brute Force algorithm hardware accelerated using Field Programmable Gate Arrays (FPGAs). Our research leads to a speed up by 1.5MX and thus boosting the running time without

The rest of this chapter is organized as follows: In Section 2 we describe the motif finding problem and presents our enhanced Brute Force algorithm; skip Brute Force. Section 3 presents the hardware implementation of our novel approach with a detailed view to its components. Performance evaluation is presented in section 4. Finally, section 5 concludes

Brute-force search or exhaustive search, also known as generate and test, is a very general problem solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's

**Planted** *(l,d)***- Motif Problem:** Find the motif consensus *M* which is a fixed but unknown nucleotide sequence of length *l*. Suppose that *M* occurs once in each of *t* background sequences of common length *n*. Each occurrence of *M* is mutated by exactly *d* point substitutions in positions chosen independently at random. Given the *t* sequences, recover

Pevzner and Sze[8] presented the challenge problem(15,4) which makes a particular parameterization to the panted motif problem. The motif we are searching for is of length *l*=15, the allowed mutations *d*=4 and the number of sequences we are searching in is *t*=20 each of size *n*=600. The parameters of the challenge problem are typical values for finding

The Brute Force algorithm solves the motif finding problem by considering the set of all 4*<sup>l</sup>* possible *l*-mers. It computes the total distance of each *l*-mer in that set to all other *l*-mers in all *t* sequences. The correct motif is the one that have the smallest total distance along all the other

about 5hrs and it fails to handle longer motifs in reasonable time. To solve the challenge

The idea behind our skip Brute Force algorithm is that it skips all the iterations that will not lead to a correct solution. The algorithm is forced to skip over the remaining iterations in two cases. The algorithm generates all possible 4*<sup>l</sup> l*-mers. It then iterates over all the sequences examining that generated *l*-mer with all the windows in each sequence. For each sequence iteration, the current score is initialized with the allowed mutation and then the score of each window is computed; i.e. the hamming distance between that window and the current *l*-mer. If this distance beats the current score then we would suspect the current window to be an implanted motif until another window in the same sequence with a higher score beats it.

*nt*). The running time for finding a motif of *l*=11 is

transcription factor binding sites in co-regulated gene promoter regions yeast [4].

problem, the running time of the Brute Force algorithm would obviously be too slow.

performance ratio would be fairer for comparison [9].

our work and presents future enhancements.

the motif occurrences and the consensus *M*.

*l*-mers. The run time of this algorithm is O(4*<sup>l</sup>*

The motif finding problem can be summarized as follows:

**2. Skip brute force algorithm** 

sacrificing the accuracy.

statement.

The planted motif problem guarantees to find the motif in each sequence. Based on this fact the skip algorithm skips the iterations over the remaining sequences if it reached the end of the current sequence without finding any window that matches the current *l*-mer (this *l*-mer can not be the motif) and jumps to the next *l*-mer. Assuming a single solution, the algorithm also skips the iterations over the remaining *l*-mers if it reaches the last sequence (*t*=20) without skipping any iteration (the solution is found).

A pseudo code of the skip Brute Force algorithm is shown below in Figure 1.

Fig. 1. Pseudo Code of the skip Brute Force Algorithm. If the commented break command is applied, then algorithm will skip-more.

#### **2.1 Skip-more brute force**

In our early implementation of the skip algorithm, we did not consider scores for the motifs found. We forced to skip the current sequence if a single motif is found that has *d* mutations within the allowed range (line 15). Here the algorithm fails to find the best motif as more windows in the current sequence can reveal occurrences of motifs with lower mutations. The complexity of this algorithm is O(4*<sup>l</sup> nt*) at its worst case, just as the Brute Force.

#### **3. Hardware implementation of skip brute force**

Our design benefits from the concurrent nature of the FPGAs as a hardware platform; control, multiplexing, matching and decision making are all occurring on the same clock

Massively Parallelized DNA Motif Search on FPGA 111

The skip-more algorithm resets the shifter in two cases. The first case is the one previously explained. The second case happens when the matching unit finds an *l* to be within the *d* allowed mutations. In this case the system resets the shifter as the motif is considered to be

DNA nucleotides {A,C,G,T} are easily encoded into the 2-bit symbols 00,01,10 and 11 respectively. The system locally generates all the possible *l*-mers by a simple controlled

That is, in a system with *l*=3 we would like to generate AAA, AAC, AAG, AAT, ...,TTT. According to the encoding mentioned above; we would like to generate a series of 6-bits each as follows 000000, 000001, 000010, 000011, ..., 111111. The relation between these

The matching block consists of many sub-blocks; xoring units, an *l*-bit adder and a comparison block. The matching block takes two *l*-sized sequences and compares them. If the difference between the two sequences is less than or equal to the allowed mutation (the two sequences have less than or equal to *d* different nucleotides), it outputs a match

The matching block uses a series of xoring gates to determine if two *l* nucletoids are identical. The *l*-bit adder is used to count the differences between them. Finally, a comparison block is used to compare the value obtained from the adder with the *d* allowed

The matching block also outputs the score of the matching process. This score is used by the logical control to determine the quality of the motif obtained. The Matching block diagram

is shown in Figure 5. Detailed Matching block diagram is shown in Figure 6.

 *l*-mers starting with AA ... A to TT ...T is not stored in the system. The four

The shifter outputs (*n*-*l*+1) motifs for each sequence unless it is interrupted by resetting it. Our skip Brute Force resets the shifter in one case; when the shifter has generated all the (*nl*+1) *l*-mers for this sequence. The shifter is reset to be fed with new sequence to generate the

newly suspected motifs (*l*-mers) from this sequence.

Fig. 4. Sequence Shifter block diagram.

**3.3 Motif generator**  The set of all 4*<sup>l</sup>*

**3.4 Matching block** 

signal.

mutation.

binary counter of size *l* bits.

found. Block diagram of Sequence Shifter is shown in Figure 4.

encoded bits can be obtained by a simple binary counter of size *l* bits.

edge. We used VHDL to model our design preserving its extendibility for more complex challenging problems in future. Figure 2 shows the system block diagram.

Fig. 2. Block diagram of the skip Brute Force - running on an FPGA with one matching unit.

All *t*-sequences are first loaded into an on-chip read-only memory 'ROM' as shown in Figure 3. On the contrary, the set of all 4*<sup>l</sup> l*-mers are not stored, but locally generated. Gaining from encoding each nucleotide into 2-bit symbol, the 4*<sup>l</sup>* Motif Generator is a simple controlled binary counter. The shifter block is fed by the currently needed sequence and only reveals a sliding *l*-sized window of it at a time. The matching block compares the revealed window to the generated *l*-mer and outputs the hamming distance as the mutation score. The logical control unit synchronizes the system to properly implement the skip Brute Force algorithm. More details are found in the following subsections.

Fig. 3. The ROM Block holding the challenging problem sequence.

#### **3.1 Sequence multiplexor**

The sequence multiplexor gets one sequence at a time. The Logical control issues the signal to the multiplexor to load the sequence from the ROM and feed the shifter.

#### **3.2 Sequence shifter**

The sequence shifter block has the following inputs: clk, reset and the sequence to be shifted. The shifter outputs an *l*-sized motif each clock cycle through a windowing approach.

edge. We used VHDL to model our design preserving its extendibility for more complex

Fig. 2. Block diagram of the skip Brute Force - running on an FPGA with one matching unit. All *t*-sequences are first loaded into an on-chip read-only memory 'ROM' as shown in Figure

binary counter. The shifter block is fed by the currently needed sequence and only reveals a sliding *l*-sized window of it at a time. The matching block compares the revealed window to the generated *l*-mer and outputs the hamming distance as the mutation score. The logical control unit synchronizes the system to properly implement the skip Brute Force algorithm.

The sequence multiplexor gets one sequence at a time. The Logical control issues the signal

The sequence shifter block has the following inputs: clk, reset and the sequence to be shifted.

The shifter outputs an *l*-sized motif each clock cycle through a windowing approach.

 *l*-mers are not stored, but locally generated. Gaining from

Motif Generator is a simple controlled

3. On the contrary, the set of all 4*<sup>l</sup>*

**3.1 Sequence multiplexor** 

**3.2 Sequence shifter** 

encoding each nucleotide into 2-bit symbol, the 4*<sup>l</sup>*

More details are found in the following subsections.

Fig. 3. The ROM Block holding the challenging problem sequence.

to the multiplexor to load the sequence from the ROM and feed the shifter.

challenging problems in future. Figure 2 shows the system block diagram.

The shifter outputs (*n*-*l*+1) motifs for each sequence unless it is interrupted by resetting it. Our skip Brute Force resets the shifter in one case; when the shifter has generated all the (*nl*+1) *l*-mers for this sequence. The shifter is reset to be fed with new sequence to generate the newly suspected motifs (*l*-mers) from this sequence.

Fig. 4. Sequence Shifter block diagram.

The skip-more algorithm resets the shifter in two cases. The first case is the one previously explained. The second case happens when the matching unit finds an *l* to be within the *d* allowed mutations. In this case the system resets the shifter as the motif is considered to be found. Block diagram of Sequence Shifter is shown in Figure 4.

#### **3.3 Motif generator**

The set of all 4*<sup>l</sup> l*-mers starting with AA ... A to TT ...T is not stored in the system. The four DNA nucleotides {A,C,G,T} are easily encoded into the 2-bit symbols 00,01,10 and 11 respectively. The system locally generates all the possible *l*-mers by a simple controlled binary counter of size *l* bits.

That is, in a system with *l*=3 we would like to generate AAA, AAC, AAG, AAT, ...,TTT. According to the encoding mentioned above; we would like to generate a series of 6-bits each as follows 000000, 000001, 000010, 000011, ..., 111111. The relation between these encoded bits can be obtained by a simple binary counter of size *l* bits.

#### **3.4 Matching block**

The matching block consists of many sub-blocks; xoring units, an *l*-bit adder and a comparison block. The matching block takes two *l*-sized sequences and compares them. If the difference between the two sequences is less than or equal to the allowed mutation (the two sequences have less than or equal to *d* different nucleotides), it outputs a match signal.

The matching block uses a series of xoring gates to determine if two *l* nucletoids are identical. The *l*-bit adder is used to count the differences between them. Finally, a comparison block is used to compare the value obtained from the adder with the *d* allowed mutation.

The matching block also outputs the score of the matching process. This score is used by the logical control to determine the quality of the motif obtained. The Matching block diagram is shown in Figure 5. Detailed Matching block diagram is shown in Figure 6.

Massively Parallelized DNA Motif Search on FPGA 113

The *l*-bit adder takes a pattern of size *l*, calculates the number of ones in this pattern and outputs the count in a log2*l* bits. For *l*=15, the adder would accept a 15 bit input signal and ouputs a 4-bit output signal. A 15-bit input signals needs five full adders; this would be stage 0. Stage 0 outputs 5 sum signals and 5 carry signals. Stage 1 needs 1 full adder and 1 half-adder for the output sum signals and the same for the output carry signals. Accordingly, stage 2 needs only 4 half adders, stage 3 needs 2 full adder and stage 4 needs 1

Fig. 7. The six stages adder tree - The critical path involves 4 full adders and 2 half adders.

The system is managed by the logical control. Reset signals are issued to the motif generator and to the sequence shifter to control the flow of the sequences to be compared. As explained earlier, the logical control issues this signal under certain events. The logical

It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance by the factor of extra units. Slight modifications and some logic duplication will be introduced for proper functionality and synchronization. The only

control outputs the best motif which is determined by the scoring function.

limiting factor to the performance boost is the FPGA resources.

**3.5 Adder tree** 

**3.6 Logical control** 

**3.7 Multiple matching units** 

half adder. The final stage needs 1 full adder.

Fig. 5. Matching block diagram.

Our design is meant to be extendible by instantiating more of the matching units. Thus, its circuit implementation has to be highly optimized. Classical hamming distance circuits start with an array of XOR gates to determine matching nucleotides, followed by *l* sequential adders to compute the required distance. This approach leads to long circuit delays that will cause the system maximum frequency to drop, degrading the performance.

Our design replaces the sequential adders with a specially designed adders tree. For the (15, 4) problem, the proposed design shortens the critical path from fifteen 4-bit adders to only four full adders and two half adders. Figure 7 shows the optimized adder tree.

Fig. 6. Matching block components - xoring units are double the size of the motif.

#### **3.5 Adder tree**

112 Bioinformatics – Trends and Methodologies

Our design is meant to be extendible by instantiating more of the matching units. Thus, its circuit implementation has to be highly optimized. Classical hamming distance circuits start with an array of XOR gates to determine matching nucleotides, followed by *l* sequential adders to compute the required distance. This approach leads to long circuit delays that will

Our design replaces the sequential adders with a specially designed adders tree. For the (15, 4) problem, the proposed design shortens the critical path from fifteen 4-bit adders to only four

cause the system maximum frequency to drop, degrading the performance.

full adders and two half adders. Figure 7 shows the optimized adder tree.

Fig. 6. Matching block components - xoring units are double the size of the motif.

Fig. 5. Matching block diagram.

The *l*-bit adder takes a pattern of size *l*, calculates the number of ones in this pattern and outputs the count in a log2*l* bits. For *l*=15, the adder would accept a 15 bit input signal and ouputs a 4-bit output signal. A 15-bit input signals needs five full adders; this would be stage 0. Stage 0 outputs 5 sum signals and 5 carry signals. Stage 1 needs 1 full adder and 1 half-adder for the output sum signals and the same for the output carry signals. Accordingly, stage 2 needs only 4 half adders, stage 3 needs 2 full adder and stage 4 needs 1 half adder. The final stage needs 1 full adder.

Fig. 7. The six stages adder tree - The critical path involves 4 full adders and 2 half adders.

#### **3.6 Logical control**

The system is managed by the logical control. Reset signals are issued to the motif generator and to the sequence shifter to control the flow of the sequences to be compared. As explained earlier, the logical control issues this signal under certain events. The logical control outputs the best motif which is determined by the scoring function.

#### **3.7 Multiple matching units**

It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance by the factor of extra units. Slight modifications and some logic duplication will be introduced for proper functionality and synchronization. The only limiting factor to the performance boost is the FPGA resources.

Massively Parallelized DNA Motif Search on FPGA 115

Additionally, we define the expected number of required matching operations to find the

We then deduce for a problem of size *n*=600, *t*=20, the expected matching operations to be as

L D Expected Matching

We synthesized our design for multiple matching units (MU). Synthesis results of one, five, ten and twenty matching units need further analysis. Figure 9 shows the area utilization of the FPGA. The FPGA utilization increases almost linearly with increasing the number of

The design of multiple MUs inherits parallelization; this means the system critical path remains the same even after increasing the number of MUs. Unfortunately, the system maximum frequency decreases with increasing the number of MUs. This is due to the increased complexity of the FPGA interconnects. Over 80% of transistors inside the FPGA are dedicated to the programmable routing network as programmable switches and buffers.

The increased complexity of the interconnects leads to FPGA resource starvation.

9 2 7.7699 x 107 11 3 1.2388 x 109 12 3 4.9428 x 109 13 4 1.9750 x 1010 14 4 7.8813 x 1010 15 4 3.1464 x 1011 17 5 5.0170 x 1012

Table 1. Expected matching operations for different (*l*,*d*) problems.

Fig. 9. FPGA area utilization - increases almost linearly.

Operations

correct implanted motif as:

shown in table 1.

MUs.

Figure 8 shows the block diagram of the skip Brute force running on an FPGA with multiple matching units. All *t* sequences are also loaded into an on-chip read-only memory ROM as the previous architecture. The sequence multiplexor feeds *n* series of sequence shifter followed by a matching unit. The matching unit takes its two *l*-sized sequences one from the shifter and the other from the logical unit which contains the motif generator. The outputs of the matching unit in each series are ANDed to determine the value of solution found. The number of the series of sequence shifter followed by matching unit is equal to *n*, where *n* is the number of the examined sequences. In the previous architecture, the system has to loop over all the sequences for each generated motif. This corresponds to *n.t.*4*<sup>l</sup>* loops. In this enhanced architecture, the system loops only *n* . 4*<sup>l</sup>* .

Fig. 8. Block diagram of the skip Brute Force - running on an FPGA with multiple matching units.

#### **4. Performance evaluation and results**

We tested the performances of Brute Force algorithm and skip Brute Force on synthetic problem instances generated according to the planted (*l*,*d*)-motif model. We followed the FM model described by Pvzner and Sze [8] to generate synthetic data to test our work. We produced problem instances as follows:

First, a motif consensus M of length *l* is chosen by picking *l* bases at random. Second, *t*= 20 occurrences of the motif are created by randomly choosing *d* positions per occurrence (without replacement) and mutating the base at each chosen position to a different, randomly chosen base. Third, we construct *t* background sequences of length *n*=600 using *n*\**t* bases chosen at random. Finally, we assign each motif occurrence to a random position in a background sequence, one occurrence per sequence. All random choices are made uniformly and independently with equal base frequencies.

The skip Brute Force achieves an average speedup of 9.11X. Both Brute Force and skip Brute Force algorithms were modelled and implemented on MatlabR2006b[15]. All the experiments ran on an AMD 5500 X2+ processor with 2GB RAM. For fair comparison, it is reported in literature that the Matlab platform is about 5 to 6 times slower than an optimized C coded program.

To evaluate the hardware implementation; we need to define the expected number of matching operations. First, we define the probability to find a random *l*-mer in a given sequence with up to *d* mutations as:

Figure 8 shows the block diagram of the skip Brute force running on an FPGA with multiple matching units. All *t* sequences are also loaded into an on-chip read-only memory ROM as the previous architecture. The sequence multiplexor feeds *n* series of sequence shifter followed by a matching unit. The matching unit takes its two *l*-sized sequences one from the shifter and the other from the logical unit which contains the motif generator. The outputs of the matching unit in each series are ANDed to determine the value of solution found. The number of the series of sequence shifter followed by matching unit is equal to *n*, where *n* is the number of the examined sequences. In the previous architecture, the system has to loop

.

Fig. 8. Block diagram of the skip Brute Force - running on an FPGA with multiple matching

We tested the performances of Brute Force algorithm and skip Brute Force on synthetic problem instances generated according to the planted (*l*,*d*)-motif model. We followed the FM model described by Pvzner and Sze [8] to generate synthetic data to test our work. We

First, a motif consensus M of length *l* is chosen by picking *l* bases at random. Second, *t*= 20 occurrences of the motif are created by randomly choosing *d* positions per occurrence (without replacement) and mutating the base at each chosen position to a different, randomly chosen base. Third, we construct *t* background sequences of length *n*=600 using *n*\**t* bases chosen at random. Finally, we assign each motif occurrence to a random position in a background sequence, one occurrence per sequence. All random choices are made

The skip Brute Force achieves an average speedup of 9.11X. Both Brute Force and skip Brute Force algorithms were modelled and implemented on MatlabR2006b[15]. All the experiments ran on an AMD 5500 X2+ processor with 2GB RAM. For fair comparison, it is reported in literature that the Matlab platform is about 5 to 6 times slower than an

To evaluate the hardware implementation; we need to define the expected number of matching operations. First, we define the probability to find a random *l*-mer in a given

loops. In this

over all the sequences for each generated motif. This corresponds to *n.t.*4*<sup>l</sup>*

enhanced architecture, the system loops only *n* . 4*<sup>l</sup>*

**4. Performance evaluation and results** 

uniformly and independently with equal base frequencies.

produced problem instances as follows:

optimized C coded program.

sequence with up to *d* mutations as:

units.

$$P\_d = \sum\_{i=0}^{d} (\,^l\_i)(\frac{3}{4})^i (\frac{1}{4})^{l-i}$$

Additionally, we define the expected number of required matching operations to find the correct implanted motif as:

$$E(l,d) = \frac{4^l}{2}(n-l+1)(1+\sum\_{i=1}^{t-1}P\_d^i)$$

We then deduce for a problem of size *n*=600, *t*=20, the expected matching operations to be as shown in table 1.


Table 1. Expected matching operations for different (*l*,*d*) problems.

We synthesized our design for multiple matching units (MU). Synthesis results of one, five, ten and twenty matching units need further analysis. Figure 9 shows the area utilization of the FPGA. The FPGA utilization increases almost linearly with increasing the number of MUs.

Fig. 9. FPGA area utilization - increases almost linearly.

The design of multiple MUs inherits parallelization; this means the system critical path remains the same even after increasing the number of MUs. Unfortunately, the system maximum frequency decreases with increasing the number of MUs. This is due to the increased complexity of the FPGA interconnects. Over 80% of transistors inside the FPGA are dedicated to the programmable routing network as programmable switches and buffers. The increased complexity of the interconnects leads to FPGA resource starvation.

Massively Parallelized DNA Motif Search on FPGA 117

Fig. 12. Running time of various challenge problems - skip Brute Force running on an FPGA

Utilizing one matching unit leads to a speedup by 9800X over pure software running time of skip Brute Force. It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance nearly by the factor of extra units. We used

Thus, applying the skip Brute Force (9.11X) on 20 matching units (16.88X) running on an

20 matching units and achieved a speed up factor 16.88X over one matching unit.

FPGA-based architecture (9800X) would offer 1.5MX boosting in the performance.

Fig. 13. Speedup factors of our accelerating designs - Total speedup is 1.5MX.

based architecture with 20 matching units has the fastest running time.

Figure 13 illustrates these observations.

Fig. 10. Maximum system frequency - decreases due to interconnects complexity.

Furthermore, It is well known that interconnects in FPGA dominate the system performance and power consumption.

Depending on the architecture, 60% to 80% of the FPGA critical path delay is due to the routing between logic blocks. Long interconnects exhibit a substantial delay and often lead to timing violation and require further optimizations. In a recent study [13], it was found that FPGA interconnects is poorly scaled. Based on the extrapolation of future device performance, interconnects will become the performance bottleneck, of which the clock rate will be slowed down to 17 MHz in a 13 nm process. Figure 10 shows degradation in the maximum frequency of the system with increasing the number of matching units.

We define the system throughput as the number of matching operations per second. Figure 11 shows the curve of the system throughput. The throughput increases by increasing the number of MUs. The curve tends to be linear but the degradation in the maximum frequency alters this linearity.

Fig. 11. System throughput - increases almost linearly.

Figure 12 compares the running time of Brute Force, skip Brute Force, skip Brute Force running on FPGA with one matching unit and with 20 matching units of different challenge problems. The running time of Brute Force in all challenge problems is the highest. Our skip Brute Force algorithm running on an FPGA has the best running time.

Fig. 10. Maximum system frequency - decreases due to interconnects complexity.

maximum frequency of the system with increasing the number of matching units.

and power consumption.

frequency alters this linearity.

Fig. 11. System throughput - increases almost linearly.

Brute Force algorithm running on an FPGA has the best running time.

Furthermore, It is well known that interconnects in FPGA dominate the system performance

Depending on the architecture, 60% to 80% of the FPGA critical path delay is due to the routing between logic blocks. Long interconnects exhibit a substantial delay and often lead to timing violation and require further optimizations. In a recent study [13], it was found that FPGA interconnects is poorly scaled. Based on the extrapolation of future device performance, interconnects will become the performance bottleneck, of which the clock rate will be slowed down to 17 MHz in a 13 nm process. Figure 10 shows degradation in the

We define the system throughput as the number of matching operations per second. Figure 11 shows the curve of the system throughput. The throughput increases by increasing the number of MUs. The curve tends to be linear but the degradation in the maximum

Figure 12 compares the running time of Brute Force, skip Brute Force, skip Brute Force running on FPGA with one matching unit and with 20 matching units of different challenge problems. The running time of Brute Force in all challenge problems is the highest. Our skip

Fig. 12. Running time of various challenge problems - skip Brute Force running on an FPGA based architecture with 20 matching units has the fastest running time.

Utilizing one matching unit leads to a speedup by 9800X over pure software running time of skip Brute Force. It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance nearly by the factor of extra units. We used 20 matching units and achieved a speed up factor 16.88X over one matching unit.

Thus, applying the skip Brute Force (9.11X) on 20 matching units (16.88X) running on an FPGA-based architecture (9800X) would offer 1.5MX boosting in the performance.

Figure 13 illustrates these observations.

Fig. 13. Speedup factors of our accelerating designs - Total speedup is 1.5MX.

Massively Parallelized DNA Motif Search on FPGA 119

and MEME [2] proved to have high accuracy and much better running time. Introducing these algorithms to hardware acceleration will offer more boosting to its running time. An embedded processor can be added on the FPGA to run the algorithm on chip. This approach will eliminate the communication overheads which is the bottleneck in most

Furthermore, our approach can be applied to other biological applications. One of the most important problems in the biological research is the tertiary structure prediction of a protein using amino acid information. This is particularly important in the context of designer proteins in the area of drug discovery. Graph analysis of biological networks is also

[1] Rajasekaran, S., Balla, S. and Huang, C.H.: Exact algorithm for planted motif

[2] Bailey, T.L., Williams, N., Misleh, C., Li, W.W.: MEME: discovering and analyzing DNA

[3] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, Predicting gene regulatory elements in

[4] Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Biol. 9, 225–

[5] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.:

[6] E. Eskin and P. Pevzner, Finding composite regulatory patterns in DNA sequences,

[7] Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically signicant alignments of multiple sequences. Bioinformatics 15(7-8), 563–577 (1999) [8] Pevzner, P., and Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA

[9] Jan SchrOder, Lars Wienbrandt, Gerd Pfei�er, and Manfred Schimmler: Massively

[10] Chen Chen, Bertil Schmidt, Liu Weiguo, and Wolfgang Müller-Wittig: GPU-MEME:

[11] Sandve, G.K., Nedland, M., Syrstad, B., Eidsheim, L.A., Abul, O., Drablas, F.:

[12] Grundy, W.N., Bailey, T.L., Elkan, C.P.: ParaMEME: A parallel implementation and a

[13] Terrence Mak The Future Looks Gloomy for FPGA Interconnects Technical Report

COPACOBANA. PRIB 2008 LNBI 5265, 436-447(2008)

in the Biological Sciences (CABIOS) 12, 303–310 (1996)

Series NCL-EECE-MSD-TR-2009-145, 2009.

Detecting subtle sequence signals: a Gibbs sampling strategy for multiple

sequences. Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology, 269–

Parallelized DNA Motif Search on the Recongurable Hardware Platform

Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences. PRIB

Accelerating motif discovery: Motif matching on parallel hardware. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175, pp. 197–206. Springer,

web interface for a DNA and protein motif discovery tool. Computer Applications

and protein motifs. Nucleic Acid Research 34, W369–W373 (2006)

silico on a genomic scale, Genome Research 15, 1202-1215(1998)

alignment. Science 262, 208–214 (1993)

Bioinformatics S1, 354-363(2002)

2008 LNBI 5265, 448-459(2008)

Heidelberg (2006)

challenge problems, Proceedings of Asia-Pacific Bioinformatics Conference, 249–

hardware-software co-designs.

computationally intensive.

259 (2005)

242 (2002)

78(2000)

**6. References** 

RTL synthesis and Place and route were accomplished using Quartus tool on the Stratix III FPGA technology, a product from Altera[14]. The skip Brute Force FPGA design does not use any of the FPGA memory blocks. The PowerPlay tool showed a total of power consumption of 400mW.

#### **5. Conclusion and future work**

This chapter presents a proof-of-concept parallization of motif finding on FPGA to achieve high performance at low cost. Among all Motif Finding Algorithms, Brute Force is known to be the most accurate. This is mainly because it searches the space of all possible motifs. The major drawback of Brute Force is the intractability of its running time. The algorithm running time grows exponentially with the length of the motif. This makes the Brute Force unsuitable for long motifs. The algorithm can not be used to solve the (15,4) challenge problem in a reasonable time.

In order to find the correct solution for the planted motif problem; we have to over-come two main problems. We have to be able to identify the motif from background sequences by applying an exact algorithm such as the Brute Force that guarantees to always find the correct motif. We also have to overcome its running time and memory complexities through acceleration by enhancement in the algorithm itself and by hardware implementation. Our research presented here addresses these two issues.

We presented an enhanced Brute Force algorithm; skip Brute Force, which can predict the quality of the obtained motif. The algorithm skips those iterations which will lead to a poor scored motif, thus leads to a better running time. This enhancement guarantees the same exactness of the Brute Force. Our enhanced algorithm showed a speedup factor of average 9.11X.

The repetitive nature of the algorithm and the locality of the data encourage the use of FPGAs. Many operations can be done concurrently to enhance the running time. FPGAs proved to successfully accelerate sequential algorithms minimum by one or two orders of magnitude. They also have been widely used to accelerate bioinformatics problems such as Smith-Waterman and BLAST algorithms. This research offers an enhanced Brute Force algorithm hardware accelerated using Field Programmable Gate Arrays (FPGAs).

We designed an FPGA-based architecture to accelerate our skip Brute Force algorithm. The core of the skip Brute Force algorithm is its matching unit. Utilizing one matching unit leads to a speedup by 9800X over pure software running time of skip Brute Force. It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance nearly by the factor of extra units. We used 20 matching units and achieved a speed up factor 16.88X over one matching unit.

Thus, applying the skip Brute Force (9.11X) on 20 matching units (16.88X) running on an FPGA-based architecture (9800X) would offer 1.5MX boosting in the performance.

Obviously, the real boosting in the performance (9800X) is achieved by introducing FPGA to the algorithm. It is neither the effect of enhancing the Brute Force algorithm, nor the effect of applying more matching units.

Many motif finding algorithms achieves better running time on the expense of the motif accuracy obtained. We succeeded to accelerate the motif finding problem without sacrificing the accuracy by applying an exact algorithm; skip Brute Force.

Our work can be extended to accelerate other motif finding algorithms that have shown better performance to solve the motif finding problem. Algorithms such as Projection [4] and MEME [2] proved to have high accuracy and much better running time. Introducing these algorithms to hardware acceleration will offer more boosting to its running time.

An embedded processor can be added on the FPGA to run the algorithm on chip. This approach will eliminate the communication overheads which is the bottleneck in most hardware-software co-designs.

Furthermore, our approach can be applied to other biological applications. One of the most important problems in the biological research is the tertiary structure prediction of a protein using amino acid information. This is particularly important in the context of designer proteins in the area of drug discovery. Graph analysis of biological networks is also computationally intensive.

#### **6. References**

118 Bioinformatics – Trends and Methodologies

RTL synthesis and Place and route were accomplished using Quartus tool on the Stratix III FPGA technology, a product from Altera[14]. The skip Brute Force FPGA design does not use any of the FPGA memory blocks. The PowerPlay tool showed a total of power

This chapter presents a proof-of-concept parallization of motif finding on FPGA to achieve high performance at low cost. Among all Motif Finding Algorithms, Brute Force is known to be the most accurate. This is mainly because it searches the space of all possible motifs. The major drawback of Brute Force is the intractability of its running time. The algorithm running time grows exponentially with the length of the motif. This makes the Brute Force unsuitable for long motifs. The algorithm can not be used to solve the (15,4) challenge

In order to find the correct solution for the planted motif problem; we have to over-come two main problems. We have to be able to identify the motif from background sequences by applying an exact algorithm such as the Brute Force that guarantees to always find the correct motif. We also have to overcome its running time and memory complexities through acceleration by enhancement in the algorithm itself and by hardware implementation. Our

We presented an enhanced Brute Force algorithm; skip Brute Force, which can predict the quality of the obtained motif. The algorithm skips those iterations which will lead to a poor scored motif, thus leads to a better running time. This enhancement guarantees the same exactness of the Brute Force. Our enhanced algorithm showed a speedup factor of average

The repetitive nature of the algorithm and the locality of the data encourage the use of FPGAs. Many operations can be done concurrently to enhance the running time. FPGAs proved to successfully accelerate sequential algorithms minimum by one or two orders of magnitude. They also have been widely used to accelerate bioinformatics problems such as Smith-Waterman and BLAST algorithms. This research offers an enhanced Brute Force

We designed an FPGA-based architecture to accelerate our skip Brute Force algorithm. The core of the skip Brute Force algorithm is its matching unit. Utilizing one matching unit leads to a speedup by 9800X over pure software running time of skip Brute Force. It is clear that scaling up the design by utilizing more matching units in parallel will speed up the overall performance nearly by the factor of extra units. We used 20 matching units and achieved a

Thus, applying the skip Brute Force (9.11X) on 20 matching units (16.88X) running on an

Obviously, the real boosting in the performance (9800X) is achieved by introducing FPGA to the algorithm. It is neither the effect of enhancing the Brute Force algorithm, nor the effect of

Many motif finding algorithms achieves better running time on the expense of the motif accuracy obtained. We succeeded to accelerate the motif finding problem without sacrificing

Our work can be extended to accelerate other motif finding algorithms that have shown better performance to solve the motif finding problem. Algorithms such as Projection [4]

algorithm hardware accelerated using Field Programmable Gate Arrays (FPGAs).

FPGA-based architecture (9800X) would offer 1.5MX boosting in the performance.

consumption of 400mW.

**5. Conclusion and future work** 

problem in a reasonable time.

9.11X.

research presented here addresses these two issues.

speed up factor 16.88X over one matching unit.

the accuracy by applying an exact algorithm; skip Brute Force.

applying more matching units.


**A Pattern Search Method for Discovering** 

*2Functional Genomics and Proteomics, Department of Biology, K.U. Leuven* 

Feng Liu1, Liliane Schoofs2, Geert Baggerman2,

Geert Wets1 and Marleen Lindemans2

*Belgium* 

**Conserved Motifs in Bioactive Peptide Families** 

*1Data Analysis & Modeling Group, Transportation Research Institute, Hasselt University* 

Bioactive peptides play critical roles in regulating most biological processes in animals, and they have considerable biological, medical and industrial importance. Peptides belonging to the same family are often characterized by a typical short sequence motif (pattern) that is highly functionally preserved among the family members. In this chapter, we design a pattern search method to facilitate the detection of such conserved motifs. First, all known bioactive peptides annotated in Uniprot are collected and classified, and the program Pratt is used to search these unaligned peptide sequences in each family for conserved patterns. The obtained patterns are then refined by taking into account the information on amino acids at important functional sites collected from literature, and are further tested by scanning them against all the Uniprot proteins. The diagnostic power of the patterns is demonstrated by the fact that, while the false positive is kept to zero to ensure that the signatures are exclusive to peptides and their precursors, nearly 94% of all known peptide

In total, we brought to light 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns represent 110 peptide families; among which 55 are not characterized by PROSITE and 12 are also dismissed by other existing motif databases, such as Pfam. Using the newly uncovered peptide patterns as a search tool, we

Whole genome sequencing projects have made available immense sequence data at a pace that far supersedes their rate of annotation. As a result, out of 1.7 million protein sequences, which are currently available for all the completely sequenced metazoan genomes, nearly 15% could not be assigned to any putative function. Although several tools/algorithms are available to contribute towards the putative functional assignments of the proteins, yet large numbers of proteins remain un-elucidated. In most cases this is due to the low degrees of sequence similarities with known proteins; alternatively, the existing similarities can be confined to only very small part(s) of the entire protein. The latter is especially true for precursor proteins coding for bioactive peptides. Consequently, there is still a need for

family members accommodate one or several of the identified patterns.

**2. Problem statement and background** 

predicted 95 hypothetical proteins as putative peptides or peptide precursors.

**1. Introduction** 


### **A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families**

Feng Liu1, Liliane Schoofs2, Geert Baggerman2, Geert Wets1 and Marleen Lindemans2

*1Data Analysis & Modeling Group, Transportation Research Institute, Hasselt University 2Functional Genomics and Proteomics, Department of Biology, K.U. Leuven Belgium* 

#### **1. Introduction**

120 Bioinformatics – Trends and Methodologies

[14] Altera Inc., http://www.altera.com/

Bioactive peptides play critical roles in regulating most biological processes in animals, and they have considerable biological, medical and industrial importance. Peptides belonging to the same family are often characterized by a typical short sequence motif (pattern) that is highly functionally preserved among the family members. In this chapter, we design a pattern search method to facilitate the detection of such conserved motifs. First, all known bioactive peptides annotated in Uniprot are collected and classified, and the program Pratt is used to search these unaligned peptide sequences in each family for conserved patterns. The obtained patterns are then refined by taking into account the information on amino acids at important functional sites collected from literature, and are further tested by scanning them against all the Uniprot proteins. The diagnostic power of the patterns is demonstrated by the fact that, while the false positive is kept to zero to ensure that the signatures are exclusive to peptides and their precursors, nearly 94% of all known peptide family members accommodate one or several of the identified patterns.

In total, we brought to light 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns represent 110 peptide families; among which 55 are not characterized by PROSITE and 12 are also dismissed by other existing motif databases, such as Pfam. Using the newly uncovered peptide patterns as a search tool, we predicted 95 hypothetical proteins as putative peptides or peptide precursors.

#### **2. Problem statement and background**

Whole genome sequencing projects have made available immense sequence data at a pace that far supersedes their rate of annotation. As a result, out of 1.7 million protein sequences, which are currently available for all the completely sequenced metazoan genomes, nearly 15% could not be assigned to any putative function. Although several tools/algorithms are available to contribute towards the putative functional assignments of the proteins, yet large numbers of proteins remain un-elucidated. In most cases this is due to the low degrees of sequence similarities with known proteins; alternatively, the existing similarities can be confined to only very small part(s) of the entire protein. The latter is especially true for precursor proteins coding for bioactive peptides. Consequently, there is still a need for

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 123

Like BLAST, motif search methods are important tools to search for a protein in a database, nevertheless, they are also limited to detect all members from a characterized peptide family. Most of the motifs in the existing databases, e.g. PROSITE (Hulo et al., 2004) and Pfam (Finn et al, 2010), cover the entire precursor sequences or sequence domains which are much longer than the conserved bioactive peptide regions. Therefore, the database motifs show their weakness when they are used to detect short mature peptides for which the precursors are unknown and the information on the sequences outside the peptide regions is thus missing. In addition, the construction of these motifs requires a good multiple protein sequence alignment in order to produce an accurate signature. This works well when the sequences are easy to align. However, for some peptide families for which the conserved regions are very short and the bulk of peptide precursor sequences is not very well preserved, the multiple alignment is very difficult to obtain or evaluate. The overall precursor protein sequence identity, especially in distantly related homologues, may be too low for an accurate alignment. In some cases, the short conserved regions are repeated within a precursor, making it even more challenging to build a unique alignment that truly

In this chapter, we have followed an alternative approach, taking unaligned sequences as a starting point. We then used a pattern search program to look for conserved patterns. We first collected all currently annotated peptides and peptide precursor proteins in Metazoa through a search in Uniprot and classified them into peptide families. Next, we extracted peptide sequences in each family and used the program Pratt to search the sequences for representative patterns. Such patterns consist of highly conserved positions that can be separated by fixed or variable spacing. The patterns are then refined by incorporating the information that is available in literature on the important amino acids contained within the biologically active site(s) of the peptides. The specificity of the generated patterns are further verified by scanning them against Uniprot in order to ascertain that proteins picked up by the patterns are either annotated as peptides or peptide precursor proteins or have an

A protein was collected into a peptide-precursor database if it is annotated in the Uniprot protein database (release 6.6) consisting of Swiss-Prot (release 48.6) and TrEMBL (release 31.6) with one of the following keywords: hormone, antimicrobial, toxin. The hormone includes bombesin, bradykinin, cytokine, glucagon, growth factor, hormone, hypotensive agent, insulin, neuropeptide, neurotransmitter, opioid peptide, pyrokinin, tachykinin, thyroid hormone, vasoactive, vasoconstrictor and vasodilator (the definition of the keywords can be referred to in this database). The antimicrobial consists of antibiotic, antiviral defense, defensin and fungicide; while the toxin includes naturally produced and secreted poisonous proteins that damage or kill other cells. However, when the protein is also characterized by non-peptide keywords, such as receptor, signal-anchor, transmembrane, binding protein, DNA binding, nuclear protein, transport, collagen, enzyme or words ending in 'ase' (excluding 'disease'), it is excluded, in order to avoid the

Stand-alone PSI-BLAST (ftp://ftp.ncbi.nih.gov/blast/executables/) is then used to align all the assembled sequences with all the Uniprot proteins except the ones which are already in

reflects the evolutionary relationship.

**3.1 Peptide precursor collection and classification** 

selection of proteins which are not peptides or peptide precursors.

unknown function.

**3. Data collection** 

bioinformatic tools to predict the function of the enormously large number of the unknown protein sequences.

Bioactive peptides occur in the whole animal kingdom, from the least evolved phyla to the highest vertebrates (Filipsson et al., 2001; Masashi et al., 2001). They play key roles as signaling molecules in many, if not all physiological processes, for instance as a peptidergic neurotransmitter or neurohormone, as a peptidergic toxin, or as a growth factor (Boonen et al., 2007; Boonen et al., 2010). They are synthesized in the cell in the form of large preproproteins (precursors), which are a special class of proteins as they undergo extensive post-translational processing prior to producing final mature bioactive peptides (Schoofs & Baggerman, 2003). Peptides and their precursors that are structurally and functionally related have been classified into peptide families; each family of proteins is assumed to be derived from a common ancestor (Husson et al., 2009). During the evolutionary process, the protein sequences may have much diverged, but the essential amino acids involved in the biologically important activities are still present. These conserved amino acids along with their particular sequential order form the functional foundation and represent the motif (pattern) of a peptide family.

However, over the course of natural adaptation, different peptide families have diverged at different rates. While for some peptide families, the similarity extends over a much longer region even over the entire peptide precursor sequences; for many others, a short highly conserved motif is responsible for the function of the precursor proteins throughout the family members, and the sequence fragments outside the conserved regions often display no significant similarities (Baggerman et al., 2005). The latter conserved sequence characteristics can be further exposed by many short but biologically important functional peptides released from known large precursors as annotated in Uniprot, such as the 3-amino-acid thyroliberin peptide 'QHPamide' (Vandenborne et al., 2005) and 4-amino-acid neuropeptides 'FMRFamide' (Baggerman et al., 2002). For some mature peptides, the precursor proteins (genes) are unknown, such as the 2-amino-acid neuropeptide 'GWamide' (P83570) from *Sepia officinalis* (Henry et al., 1997) and the human growth-modulating peptide 'GHK' (P01157) (Schlesinger et al., 1977). The existence of numerous short bioactive peptides within the precursor proteins implies that only a very small conserved peptide motif may be a biologically important functional portion of the precursors.

Due to the fact that only short sequence regions are conserved, peptides or their precursors are sometimes not identified by existing sequence alignment algorithms e.g. BLAST or by motif search methods. While BLAST programs (Altschul et al., 1997) are very suitable to scan databases for homologous proteins, they are far less efficient at finding similarities to short conserved regions which can be only a few amino acids in length, when the whole genome sequence is scanned. For large precursors which are usually a few hundred amino acids in length and for which the biologically conserved regions are limited, the important domains are often masked by long randomly unrelated sequence regions. This is because for any two random large protein sequences, BLAST usually can find a relative long local alignment, at least longer than the short conserved peptide motif, and BLAST tends to assign a higher score to a longer alignment (Durbin et al., 1998). In addition, if a pair of homologues involves a short independent peptide molecule, which may be either an unknown peptide sequence as query or a known mature peptide as target from a protein database, it is difficult for BLAST to detect the pair of homologues, because the involvement of a short sequence makes the pairwise sequence alignment less likely to obtain a significant BLAST score (e.g., e-value < 0.01).

Like BLAST, motif search methods are important tools to search for a protein in a database, nevertheless, they are also limited to detect all members from a characterized peptide family. Most of the motifs in the existing databases, e.g. PROSITE (Hulo et al., 2004) and Pfam (Finn et al, 2010), cover the entire precursor sequences or sequence domains which are much longer than the conserved bioactive peptide regions. Therefore, the database motifs show their weakness when they are used to detect short mature peptides for which the precursors are unknown and the information on the sequences outside the peptide regions is thus missing. In addition, the construction of these motifs requires a good multiple protein sequence alignment in order to produce an accurate signature. This works well when the sequences are easy to align. However, for some peptide families for which the conserved regions are very short and the bulk of peptide precursor sequences is not very well preserved, the multiple alignment is very difficult to obtain or evaluate. The overall precursor protein sequence identity, especially in distantly related homologues, may be too low for an accurate alignment. In some cases, the short conserved regions are repeated within a precursor, making it even more challenging to build a unique alignment that truly reflects the evolutionary relationship.

In this chapter, we have followed an alternative approach, taking unaligned sequences as a starting point. We then used a pattern search program to look for conserved patterns. We first collected all currently annotated peptides and peptide precursor proteins in Metazoa through a search in Uniprot and classified them into peptide families. Next, we extracted peptide sequences in each family and used the program Pratt to search the sequences for representative patterns. Such patterns consist of highly conserved positions that can be separated by fixed or variable spacing. The patterns are then refined by incorporating the information that is available in literature on the important amino acids contained within the biologically active site(s) of the peptides. The specificity of the generated patterns are further verified by scanning them against Uniprot in order to ascertain that proteins picked up by the patterns are either annotated as peptides or peptide precursor proteins or have an unknown function.

#### **3. Data collection**

122 Bioinformatics – Trends and Methodologies

bioinformatic tools to predict the function of the enormously large number of the unknown

Bioactive peptides occur in the whole animal kingdom, from the least evolved phyla to the highest vertebrates (Filipsson et al., 2001; Masashi et al., 2001). They play key roles as signaling molecules in many, if not all physiological processes, for instance as a peptidergic neurotransmitter or neurohormone, as a peptidergic toxin, or as a growth factor (Boonen et al., 2007; Boonen et al., 2010). They are synthesized in the cell in the form of large preproproteins (precursors), which are a special class of proteins as they undergo extensive post-translational processing prior to producing final mature bioactive peptides (Schoofs & Baggerman, 2003). Peptides and their precursors that are structurally and functionally related have been classified into peptide families; each family of proteins is assumed to be derived from a common ancestor (Husson et al., 2009). During the evolutionary process, the protein sequences may have much diverged, but the essential amino acids involved in the biologically important activities are still present. These conserved amino acids along with their particular sequential order form the functional foundation and represent the motif

However, over the course of natural adaptation, different peptide families have diverged at different rates. While for some peptide families, the similarity extends over a much longer region even over the entire peptide precursor sequences; for many others, a short highly conserved motif is responsible for the function of the precursor proteins throughout the family members, and the sequence fragments outside the conserved regions often display no significant similarities (Baggerman et al., 2005). The latter conserved sequence characteristics can be further exposed by many short but biologically important functional peptides released from known large precursors as annotated in Uniprot, such as the 3-amino-acid thyroliberin peptide 'QHPamide' (Vandenborne et al., 2005) and 4-amino-acid neuropeptides 'FMRFamide' (Baggerman et al., 2002). For some mature peptides, the precursor proteins (genes) are unknown, such as the 2-amino-acid neuropeptide 'GWamide' (P83570) from *Sepia officinalis* (Henry et al., 1997) and the human growth-modulating peptide 'GHK' (P01157) (Schlesinger et al., 1977). The existence of numerous short bioactive peptides within the precursor proteins implies that only a very small conserved peptide

Due to the fact that only short sequence regions are conserved, peptides or their precursors are sometimes not identified by existing sequence alignment algorithms e.g. BLAST or by motif search methods. While BLAST programs (Altschul et al., 1997) are very suitable to scan databases for homologous proteins, they are far less efficient at finding similarities to short conserved regions which can be only a few amino acids in length, when the whole genome sequence is scanned. For large precursors which are usually a few hundred amino acids in length and for which the biologically conserved regions are limited, the important domains are often masked by long randomly unrelated sequence regions. This is because for any two random large protein sequences, BLAST usually can find a relative long local alignment, at least longer than the short conserved peptide motif, and BLAST tends to assign a higher score to a longer alignment (Durbin et al., 1998). In addition, if a pair of homologues involves a short independent peptide molecule, which may be either an unknown peptide sequence as query or a known mature peptide as target from a protein database, it is difficult for BLAST to detect the pair of homologues, because the involvement of a short sequence makes the pairwise sequence alignment less likely to obtain a significant

motif may be a biologically important functional portion of the precursors.

protein sequences.

(pattern) of a peptide family.

BLAST score (e.g., e-value < 0.01).

#### **3.1 Peptide precursor collection and classification**

A protein was collected into a peptide-precursor database if it is annotated in the Uniprot protein database (release 6.6) consisting of Swiss-Prot (release 48.6) and TrEMBL (release 31.6) with one of the following keywords: hormone, antimicrobial, toxin. The hormone includes bombesin, bradykinin, cytokine, glucagon, growth factor, hormone, hypotensive agent, insulin, neuropeptide, neurotransmitter, opioid peptide, pyrokinin, tachykinin, thyroid hormone, vasoactive, vasoconstrictor and vasodilator (the definition of the keywords can be referred to in this database). The antimicrobial consists of antibiotic, antiviral defense, defensin and fungicide; while the toxin includes naturally produced and secreted poisonous proteins that damage or kill other cells. However, when the protein is also characterized by non-peptide keywords, such as receptor, signal-anchor, transmembrane, binding protein, DNA binding, nuclear protein, transport, collagen, enzyme or words ending in 'ase' (excluding 'disease'), it is excluded, in order to avoid the selection of proteins which are not peptides or peptide precursors.

Stand-alone PSI-BLAST (ftp://ftp.ncbi.nih.gov/blast/executables/) is then used to align all the assembled sequences with all the Uniprot proteins except the ones which are already in

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 125

For each Pratt run which starts with the minimum percentage of sequences to match the pattern (the parameter C%) equal to 90%, the most significant pattern, which is the one with the highest fitness in the Pratt output list, is retained. The obtained pattern is then refined by integrating the information on the important functional sites in the matched peptide sequences depicted in literature. The amino acids occurring at these sites are added to the

The pattern is further verified by scanning it against all the Uniprot proteins using the ScanProsite tool (http://www.expasy.org/tools/scanprosite/). Two possible cases occur: (1) If the pattern is not contained in any known non-peptide protein, it is retained as a conserved peptide pattern. (2) Otherwise, if the pattern is matched by both peptide and nonpeptide proteins (further referred to as true and false positive hits, respectively), it is subsequently processed as follows. (2a)If the pattern does not include any wildcard region where any amino acid is accepted, the positions where the pattern is located in all matching protein sequences are checked. If the pattern exclusively occurs at the N- or C-terminus of the true positive hits, or if the peptide proteins are all small molecules, the pattern is retained with a constraint ('<' or '>') imposed at the N- or C-terminus of the pattern to limit the maximum distance between the conserved pattern region and the N- or C-terminus of the peptide or precursor protein. If the pattern with such a restriction cannot distinguish the true positives from the false ones, the pattern is eliminated. (2b)Or, if the pattern has wildcard regions, the sequence fragments corresponding to the pattern in all the matching sequences are extracted and aligned. If the two groups of amino acids in a wildcard region X in this alignment have different physicochemical properties between the true and the false positive hits, the region X is replaced by the group of amino acids distinctively occurring in the true positive proteins. In the other case, when the two groups of amino acids share identical physicochemical properties, the pattern is discarded. The amino acid symbol sets: DE, KRH, NQ, ST, ILV, FWY, AG, C, M and P, which are classified based on the

If a conserved pattern cannot be obtained, the parameter C% is reduced by 10%, and Pratt is re-run against the same dataset. As the percentage of sequences to match the pattern decreases, a pattern which is usually longer and contains more sites than the previously one is shown up and processed by similar refinement and verification. The procedure is repeated until a pattern, which represents the majority of a group of related peptide

Once a conserved pattern is identified in the peptide family dataset, the program ps-scan (ftp://ftp.expasy.org/databases/prosite/tools/ps\_scan/sources/) is run locally on the pattern against this dataset. The sequence regions which match the pattern are removed from the original peptides. Each of the two remaining parts of the peptide sequences at their N- and C-terminus is left to form an independent sequence if it is not less than 4aa in length, given the assumption that the minimum length of the peptide pattern we search for is not less than this value. Thus, a reduced dataset is created including not only the peptides which are not covered by the identified pattern, but also the remaining sequences of the original peptides that match the pattern. This methodology is based on the fact that a peptide precursor protein may contain several conserved regions, and that our extracted peptide sequences include long peptide chains which may contain a few shorter, unrelated, bioactive peptides. The reduced peptide family dataset is then scanned by Pratt to discover the next pattern. The search procedure is repeated until the parameter C% is less than 50%.

pattern if they are absent at the corresponding sites in the pattern.

physiochemical nature of the side groups (Smith & Smith), are used.

sequences and rules out any known non-peptide proteins, is discovered.

the peptide-precursor database. Based on the conserved sequence characteristics of peptide families, the score matrix PAM30 is used and the word size is set to 2, allowing for the search for short but strong similarities. The proteins, which show significant similarities (evalue <0.01) with the known peptides or precursors, are retained. The obtained list is then checked manually in terms of the proteins' cellular location, molecular function and biological process as stated by GO (gene ontology) terms or in literature. As a result, 1345 more proteins which have as yet not been annotated in Uniprot are added to the peptideprecursor database.

Proteins collected in this database are automatically classified into peptide families if their family classification information is available in Uniprot that is based on a significant match to an existing motif or based on sequence similarities. Otherwise, proteins that display sequence similarities with a significant BLAST score, are clustered into the same family. A protein can also be assigned to a particular family based on its molecular function described in literature.

#### **3.2 In silicon extraction of peptides**

From each precursor protein in a peptide family, the bioactive peptide sequences are extracted in silicon from the beginning and ending positions of the subsequences that are annotated as 'peptide' or 'chain' in 'feature' line in the corresponding protein file in Uniprot. The conserved basic cleavage sites flanking the peptides, which contribute to the endoproteolytic cleavage process of the peptides from their precursors, such as the monobasic site (G)R or (G)K, the dibasic sites (G)KR, (G)RR, (G)KK or (G)RK, or a combination of consecutive K or R, are also withdrawn along with the subsequences (Liu & Wets, 2005; Rouille et al., 1995).

Entries in the family that only constitute the peptide sequence, i.e. in those cases where the precursor is unknown, are also retained. Proteins less than 200aa (amino acids) in length, which contain an N-terminal signal peptide and for which no mature peptides have as yet been identified, presumably contain a single peptide and are therefore also deposited after in silicon removal of the N-terminal signal peptide. According to the statistics on all annotated bioactive peptide sequences in Uniprot, 97% are no longer than the 200aa threshold value. The presence of a signal peptide is assumed when it is indicated in Uniprot; in other cases, it is forecasted by the signal peptide prediction program signalP (http://www.cbs.dtu.dk/services/SignalP/).

In total, 110 datasets of peptide families are formed with each including at least 10 peptide sequences. All the extracted peptide sequences in each of the families were scanned independently for patterns conserved in the corresponding family.

#### **4. Method**

Different software available on the internet provides users the tools to search for patterns conserved in a set of unaligned protein sequences. Pratt (http://www.ebi.ac.uk/pratt/#) (Jonassen et al., 1995) is a flexible pattern search tool in the number of parameters that can be controlled by users. It allows searching for patterns of conserved positions with limited variable length spacing, which is important because even in well-conserved peptide regions, variable loop sizes can occur. Pratt is run on each of the peptide family datasets, and the searching parameters are set based on maximum pattern length and pattern flexibilities found in the existing peptide patterns in PROSITE.

the peptide-precursor database. Based on the conserved sequence characteristics of peptide families, the score matrix PAM30 is used and the word size is set to 2, allowing for the search for short but strong similarities. The proteins, which show significant similarities (evalue <0.01) with the known peptides or precursors, are retained. The obtained list is then checked manually in terms of the proteins' cellular location, molecular function and biological process as stated by GO (gene ontology) terms or in literature. As a result, 1345 more proteins which have as yet not been annotated in Uniprot are added to the peptide-

Proteins collected in this database are automatically classified into peptide families if their family classification information is available in Uniprot that is based on a significant match to an existing motif or based on sequence similarities. Otherwise, proteins that display sequence similarities with a significant BLAST score, are clustered into the same family. A protein can also be assigned to a particular family based on its molecular function described

From each precursor protein in a peptide family, the bioactive peptide sequences are extracted in silicon from the beginning and ending positions of the subsequences that are annotated as 'peptide' or 'chain' in 'feature' line in the corresponding protein file in Uniprot. The conserved basic cleavage sites flanking the peptides, which contribute to the endoproteolytic cleavage process of the peptides from their precursors, such as the monobasic site (G)R or (G)K, the dibasic sites (G)KR, (G)RR, (G)KK or (G)RK, or a combination of consecutive K or R, are also withdrawn along with the subsequences (Liu &

Entries in the family that only constitute the peptide sequence, i.e. in those cases where the precursor is unknown, are also retained. Proteins less than 200aa (amino acids) in length, which contain an N-terminal signal peptide and for which no mature peptides have as yet been identified, presumably contain a single peptide and are therefore also deposited after in silicon removal of the N-terminal signal peptide. According to the statistics on all annotated bioactive peptide sequences in Uniprot, 97% are no longer than the 200aa threshold value. The presence of a signal peptide is assumed when it is indicated in Uniprot; in other cases, it is forecasted by the signal peptide prediction program signalP

In total, 110 datasets of peptide families are formed with each including at least 10 peptide sequences. All the extracted peptide sequences in each of the families were scanned

Different software available on the internet provides users the tools to search for patterns conserved in a set of unaligned protein sequences. Pratt (http://www.ebi.ac.uk/pratt/#) (Jonassen et al., 1995) is a flexible pattern search tool in the number of parameters that can be controlled by users. It allows searching for patterns of conserved positions with limited variable length spacing, which is important because even in well-conserved peptide regions, variable loop sizes can occur. Pratt is run on each of the peptide family datasets, and the searching parameters are set based on maximum pattern length and pattern flexibilities

precursor database.

**3.2 In silicon extraction of peptides** 

Wets, 2005; Rouille et al., 1995).

(http://www.cbs.dtu.dk/services/SignalP/).

found in the existing peptide patterns in PROSITE.

independently for patterns conserved in the corresponding family.

in literature.

**4. Method** 

For each Pratt run which starts with the minimum percentage of sequences to match the pattern (the parameter C%) equal to 90%, the most significant pattern, which is the one with the highest fitness in the Pratt output list, is retained. The obtained pattern is then refined by integrating the information on the important functional sites in the matched peptide sequences depicted in literature. The amino acids occurring at these sites are added to the pattern if they are absent at the corresponding sites in the pattern.

The pattern is further verified by scanning it against all the Uniprot proteins using the ScanProsite tool (http://www.expasy.org/tools/scanprosite/). Two possible cases occur: (1) If the pattern is not contained in any known non-peptide protein, it is retained as a conserved peptide pattern. (2) Otherwise, if the pattern is matched by both peptide and nonpeptide proteins (further referred to as true and false positive hits, respectively), it is subsequently processed as follows. (2a)If the pattern does not include any wildcard region where any amino acid is accepted, the positions where the pattern is located in all matching protein sequences are checked. If the pattern exclusively occurs at the N- or C-terminus of the true positive hits, or if the peptide proteins are all small molecules, the pattern is retained with a constraint ('<' or '>') imposed at the N- or C-terminus of the pattern to limit the maximum distance between the conserved pattern region and the N- or C-terminus of the peptide or precursor protein. If the pattern with such a restriction cannot distinguish the true positives from the false ones, the pattern is eliminated. (2b)Or, if the pattern has wildcard regions, the sequence fragments corresponding to the pattern in all the matching sequences are extracted and aligned. If the two groups of amino acids in a wildcard region X in this alignment have different physicochemical properties between the true and the false positive hits, the region X is replaced by the group of amino acids distinctively occurring in the true positive proteins. In the other case, when the two groups of amino acids share identical physicochemical properties, the pattern is discarded. The amino acid symbol sets: DE, KRH, NQ, ST, ILV, FWY, AG, C, M and P, which are classified based on the physiochemical nature of the side groups (Smith & Smith), are used.

If a conserved pattern cannot be obtained, the parameter C% is reduced by 10%, and Pratt is re-run against the same dataset. As the percentage of sequences to match the pattern decreases, a pattern which is usually longer and contains more sites than the previously one is shown up and processed by similar refinement and verification. The procedure is repeated until a pattern, which represents the majority of a group of related peptide sequences and rules out any known non-peptide proteins, is discovered.

Once a conserved pattern is identified in the peptide family dataset, the program ps-scan (ftp://ftp.expasy.org/databases/prosite/tools/ps\_scan/sources/) is run locally on the pattern against this dataset. The sequence regions which match the pattern are removed from the original peptides. Each of the two remaining parts of the peptide sequences at their N- and C-terminus is left to form an independent sequence if it is not less than 4aa in length, given the assumption that the minimum length of the peptide pattern we search for is not less than this value. Thus, a reduced dataset is created including not only the peptides which are not covered by the identified pattern, but also the remaining sequences of the original peptides that match the pattern. This methodology is based on the fact that a peptide precursor protein may contain several conserved regions, and that our extracted peptide sequences include long peptide chains which may contain a few shorter, unrelated, bioactive peptides. The reduced peptide family dataset is then scanned by Pratt to discover the next pattern. The search procedure is repeated until the parameter C% is less than 50%.

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 127

Yes No

If C% >= 50%?

No

If the pattern has wild regions

If the pattern is at the terminus of precursors, or if the precursors are small molecules

Yes

No No

If the letters at X have different properties between peptides and nonpeptides

New constraints are imposed at the wild regions in the pattern

C% is reduced by

End

Align all matched sequences in the peptide and nonpeptide proteins

10%

Yes Yes

Fig. 1. Procedure for searching patterns in peptide sequences.

Yes

No

A conserved pattern

Run ps-scan locally against the peptide family dataset

Remove the matched sequence parts, a reduced dataset is

is obtained

created

Begin

A peptide family dataset

Minimum Percentage (C%)=90%

Run Pratt and select the most significant pattern

Refine by literature

Run ScanProsite to the pattern against Uniprot

> If the pattern is contained in nonpeptide proteins

Note: The parameters are set as follows: the maximum pattern length (PL) is 52, the

A constraint is imposed at the N- or Cterminus of the pattern

maximum length of a wildcard (PX) is 15, the maximum number of flexible wildcards (FN) is 3, the maximum flexibility of a flexible wild card (FL) is 8, the upper limit on the product of flexibilities for a pattern (FP) is 48, the minimum percentage of sequences to match the pattern (C%) is 90, 80, 70, 60 and 50%, respectively, and all other parameters are at default.

This means that the remaining dataset contains no more patterns representing the majority of the sequences.

Fig. 1 represents the scheme of the described pattern searching procedure which is aimed to examine short bioactive peptide sequences rather than their large precursor molecules, and to take into account not only the biologically functional sites of each individual peptide discussed in literature, but also the general information which is extracted by the computational tool Pratt from all related peptides in a family.

#### **5. Results**

#### **5.1 'PeptideMotif' database**

We have built a peptide-precursor database consisting of 11,688 peptides and precursor proteins originated from 1420 metazoan organisms; of which 11,437 proteins (98%) are categorized into 110 distinctive peptide families. Based on bioactive peptide sequences drawn from the peptide families, we uncovered in total 211 conserved patterns which are assembled into the peptide motif database 'PeptideMotif'.

All the patterns range between 4 and 52 amino acids (column) in length with 78 (37%) no longer than 10aa. While each of the patterns covers most of the peptides or precursors belonging to the corresponding family, the false positives are kept to zero because it is guaranteed by the criterion that a known protein matching the pattern is indeed a peptide or precursor protein from this family.

#### **5.2 Comparison with the other motif databases**

The PROSITE database (http://ca.expasy.org/prosite) is a motif database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Its 19.9 release contains 56 entries (patterns) describing 55 peptide families in Metazoa (the omegaatracotoxin family has two patterns) belonging to categories of cytokines and growth factors, hormones and active peptides, and toxins. All the 55 families are also covered by patterns in the 'PeptideMotif' database, and these peptide patterns (Table 1) share the similar length to their PROSITE counterparts. However, in terms of conserved sequence characteristics revealed in both database motifs, more amino acids are imposed at the conserved sites or wildcard regions in the 'PeptideMotif' patterns. This is due to the fact that the identified peptide patterns are not only trained by running them against the Swiss-Prot protein database which is also used as the test dataset by PROSITE, but also against the TrEMBL database, in which many proteins are also annotated by keywords or literature. In addition, for 25 of the 56 families, we have found 34 additional novel patterns and they are marked as 'new' in Table 1.

The remaining 121 'PeptideMotif' patterns presented in Table 2 allow the identification of 55 peptide families that are untouched by PROSITE signatures; they cover 3866 bioactive peptide sequences cleaved from 3572 precursors. Among the patterns, 28 representing 12 families are also not characterized by any other motif database, such as Pfam (Bateman et al., 2004) and CDD (Marchler-Bauer et al., 2005). The sequence reminiscence for these families is short and often occurs repeatedly within a same precursor protein. The sequences outside the conserved region are not well preserved, and thus a probability model based on protein sequence alignments cannot efficiently characterize such peptide families.

This means that the remaining dataset contains no more patterns representing the majority

Fig. 1 represents the scheme of the described pattern searching procedure which is aimed to examine short bioactive peptide sequences rather than their large precursor molecules, and to take into account not only the biologically functional sites of each individual peptide discussed in literature, but also the general information which is extracted by the

We have built a peptide-precursor database consisting of 11,688 peptides and precursor proteins originated from 1420 metazoan organisms; of which 11,437 proteins (98%) are categorized into 110 distinctive peptide families. Based on bioactive peptide sequences drawn from the peptide families, we uncovered in total 211 conserved patterns which are

All the patterns range between 4 and 52 amino acids (column) in length with 78 (37%) no longer than 10aa. While each of the patterns covers most of the peptides or precursors belonging to the corresponding family, the false positives are kept to zero because it is guaranteed by the criterion that a known protein matching the pattern is indeed a peptide or

The PROSITE database (http://ca.expasy.org/prosite) is a motif database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Its 19.9 release contains 56 entries (patterns) describing 55 peptide families in Metazoa (the omegaatracotoxin family has two patterns) belonging to categories of cytokines and growth factors, hormones and active peptides, and toxins. All the 55 families are also covered by patterns in the 'PeptideMotif' database, and these peptide patterns (Table 1) share the similar length to their PROSITE counterparts. However, in terms of conserved sequence characteristics revealed in both database motifs, more amino acids are imposed at the conserved sites or wildcard regions in the 'PeptideMotif' patterns. This is due to the fact that the identified peptide patterns are not only trained by running them against the Swiss-Prot protein database which is also used as the test dataset by PROSITE, but also against the TrEMBL database, in which many proteins are also annotated by keywords or literature. In addition, for 25 of the 56 families, we have found 34 additional novel patterns and they are

The remaining 121 'PeptideMotif' patterns presented in Table 2 allow the identification of 55 peptide families that are untouched by PROSITE signatures; they cover 3866 bioactive peptide sequences cleaved from 3572 precursors. Among the patterns, 28 representing 12 families are also not characterized by any other motif database, such as Pfam (Bateman et al., 2004) and CDD (Marchler-Bauer et al., 2005). The sequence reminiscence for these families is short and often occurs repeatedly within a same precursor protein. The sequences outside the conserved region are not well preserved, and thus a probability model based on

protein sequence alignments cannot efficiently characterize such peptide families.

computational tool Pratt from all related peptides in a family.

assembled into the peptide motif database 'PeptideMotif'.

**5.2 Comparison with the other motif databases** 

of the sequences.

**5. Results** 

**5.1 'PeptideMotif' database** 

precursor protein from this family.

marked as 'new' in Table 1.

Fig. 1. Procedure for searching patterns in peptide sequences.

Note: The parameters are set as follows: the maximum pattern length (PL) is 52, the maximum length of a wildcard (PX) is 15, the maximum number of flexible wildcards (FN) is 3, the maximum flexibility of a flexible wild card (FL) is 8, the upper limit on the product of flexibilities for a pattern (FP) is 48, the minimum percentage of sequences to match the pattern (C%) is 90, 80, 70, 60 and 50%, respectively, and all other parameters are at default.

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 129

**(22) Corticotropin-releasing factor** (1) [KR]-R-x(0,28)-[PQASLVIG]-[STPI]-[LIVM]-S-[LIVM]-x- [LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)-[LIVMWF]; (2) <x(0,8)-[PQASLVIG]-[STPI]- [LIVM]-S-[LIVM]-x-[LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)-[LIVMWF] (new); (3) T-R-[PQASLVIG]-[STPI]-[LIVM]-S-[LIVM]-x-[LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)-

**(23) Arthropod CHH/MIH/GIH neurohormones** (1) [LIVM]-{C}-x(2)-C-[KR]-{FY}-[DENGRKHQ]-C-

**(24) Erythropoietin/thrombopoeitin** (1) P-x(4)-C-D-x-R-[LIVM](2)-x-[KRH]-x(14)-C; {34, 34, 8, 0} **(25)Granins** (1){DEF}-[DE]-[SN]-L-[SAN]-[AD]-[LIMVKR]-[DE]-[AGLSTQ]-E-L; (2) [LIVM]-x- [KHR]-C-[LIVM](2)-[ED]-[LIVM](2)-x(5)-[KRH]-[STP]-x(3)-[PST]-x(4)-C (new); (3) K-R-[STAG]- [NDEST]-[ED]-x(2)-[DE]-[DEGA]-[QKR]-Y-[AGST]-P-Q (new); {63, 96, 5}; {Q86T07, Q4RYY8,

**(26) Galanin** (1) G-W-[ST]-L-N-[ST]-[AG]-[AG]-[FY]-[LIVM]-[LIVM]-G-P; (2) <L-N-[ST]-[AG]-[AG]-

**(27) Gastrin/cholecystokinin** (1) [FY]-x(0,2)-[GADN]-[AS](0,1)-[WH]-[MFLIV]-[DR]-F-G-[KR]-[RS];

**(28)Glucagon/GIP/secretin/VIP** (1) [YH]-[STAIVGD]-[DENQ]-[AGF]-[LIVMSTE]-[FY]- {QLPAGDEKR}-[DENSTAK]-[DENSTA]-[LIVMFYG]-[RKSTDEN]-x(3)-{P}-{P}-x(2)-[AGSTLIVMQ]-

**(30) Glycoprotein hormones beta chain** (1)C-{C}(2)-[CW]-{C}(7,9)-C-[STAGMLIVED]-G-[HFYLRS]- C-{C}-[STA]; (2) <x(0,8)-C-[STAGMDEVLI]-G-[HFYL]-C-{CKRH}-[ST] (new); (3) <x-[CW]-{C}(7,9)-C-

**(31) Gonadotropin-releasing hormones** (1) Q-[HY]-[FYW]-S-x(4)-P-G-G-[KR]-R; (2) Q-[HY]-[FYW]-

**(32) Insulin** (1){C}(2)-[IVLMPSTAFYR]-{CNE}-x-{C}-C-C-{CPM}-{P}-{CHW}-C-[STDNEKIGQ]-{C}(2)- {CPAG}-[LIVMFSQ]-{CD}-{CPW}-{CHDEP}-C; (2) <x(0,205)-C-G-{FYILVMQW}-{CWPSTLIVM}- [LIVFY]-[VILMASTPH]-{AGHCFYPQW}-{CPQSW}-[LIVMRKHQWF]-{CNP}-{WCQP}-[LVIMATC]-

**(34) Neurohypophysial hormones** (1) C-[LIFY]-[LIFYV]-x-N-C-P-x-G; (2) C-x(2,6)-[CW]-G-x(4,6)-C-

**(35) Neuromedin U and S** (1) [FY]-[LIVMF]-[FY]-R-P-R-N-G-[KR]; (2) [FY]-[LIVMF]-[FY]-R-P-R-N>

**(36) Pancreatic** (1) [FY]-x(2)-{LIVM}-[LIVM]-x(2)-[YK]-x(3)-[LIVMFYRHK]-x-R-[PQVH]-R-[YF]- [GD]-[KR]-[RS]; (2) [FY]-x(3)-[LIVM]-x(2)-[YK]-x(3)-[LIVMFYRHK]-x-R-[PQVH]-R-[YF]-x(0,1)>

**(37) Parathyroid hormone** (1) [KR]-R-x-[VI]-[STAGFYN]-[EH]-x-Q-x(2)-H-[DEN]-x-[GR]; {54, 54, 3} **(38) Pyrokinins** (1) [AGHNQDEST]-{FYST}-[PQVIWFYED]-[FY]-[AGST]-P-R-[LI]-G-[KR]-R; (2) [AGHNQDEST]-{FYST}-[PQVIWFYED]-[FY]-[AGST]-P-R-[LI]> (new); {72, 89, 4} {Q7PTL2, Q5TV14} **(39) Somatotropin** (1) C-{KRAG}-[STNRAC]-x(2)-[LIVMFYSRNW]-x-[LIVMSTAGY]-P-x(2)-{FYW} x(2)-[TALIVMSHN]-x(7)-[LIVMFYP]-x(2)-{QHKR}-{KRHP}-{NW}-x-[LIVMFYR]-[LIVMSTC]-x- [STACVLMIG]-W; (2) C-[LIVMFG]-x-[KHRSNDEQVI]-[DEN]-{CNDEPQ}-{AGLMVI}-[KRMT]- {DENKRHPQ}-x-[STNALIVMF]-[FYLIVMKS]-[LIMVT]-x-{NDEKRH}-[LIVMATE]-[KRNEQTA]-C (new); (3) [ED]-K-L-L-[DE]-R-[VIA]-[IV]-x-H-[AT]-E-L (new); (4) C-F-[KRH](2)-[DEN]-[LIVMAG]-

**(40) Tachykinin** (1) [AGSTQKRFY]-[SF]-[IVFYTHQ]-G-[LVIM]-M-G-[KR]-[RS]; (2) [AGSTQKRFY]- F-[IVLMFYSHQ]-G-[LVIMS]-R-G-K-R (new); (3) <x(0,9)-F-[IVLMFYTHQ]-G-[LVIMSTAG]-[RM]>

**(29) Glycoprotein hormones alpha chain** (1) C-x-G-C-C-[FY]-S-x-A-[FY]-P-T-P; {109, 109, 4}

C-{LM}-x(0,204)> (new);{507, 877, 52} {Q32L79, Q621L6, Q61VN2, Q61GN7, Q4T1R8} **(33) Natriuretic peptides** (1) C-F-G-x(3)-[DEA]-[RH]-I-x(3)-[ST]-x(2)-G-C; {155, 155,10}

[LIVMWF] (new); {64, 64, 9}; {Q4RWF4}

[FY]-[LIVM]-[LIVM]-G-P (new); {31, 31, 1}

S-x(4)-P-G> (new); {178, 188, 4}

(new); {24, 24, 3}

(new); {118, 118, 7}

(new); {104, 124,6}

Q566G8}

[FY]-{C}-{AGKRC}-{C}(2)-[FYILVM]-{C}-{CP}-C; {135, 135, 5} {Q23247}

(2) Y-x(0,2)-[GA]-[AS](0,1)-[WH]-[MFL]-[DR]-F> (new); {88, 102, 4}

[KREQL]-[KRDENQL]-[LVFYWG]-[LIVQ]; {202, 305, 8}

[STAGDEVLIM]-G-[HFYL]-C-{C}-[ST] (new); {341, 341, 13}

[FYAGLIVM]-x(3)-[LIVFY]-C-C (new); {112, 259, 4}

[HKR](2)-[LIVM]-[DEQ]-[ST]-[FYLIVM]-x(0,1)> (new); {633, 1093, 45}


**Cytokines and growth factors**

**(2)HBGF/FGF;** (1) G-x-[LIVM]-{AGNP}-[STAGP]-{AGC}-{C}-x-{KRHNDE}-{WPC}-x- [STAGDENKRHQ](0,1)-[AGST](0,1)-[DENA]-C-{QP}-[FYLIVM]-{C}-[EQH]-x-{P}-{C}-{LIVM}- [DENKRHL]-{PLIVMDE}-[YHF]; (2) [GR]-[LIVM]-[LIVM]-{CWPDE}-[LIVM]-{PST}-{QLIVM}-x- [KRDEVIAGQFYNCS]-[STAGLMHQ]-{CP}-{AGDEN}-[FY]-[LIVM]-[AGSC]-[MLIV]-[NSTDEK]- [GAKRSTNDEQ]-[EDNKRHSTQA]-G(new); (3) G-S-[RHKQ]-[LIVM]-{CWPDE}-[LIVM]-{PST}- {QLIVM}-x-[KRDEVIAGQFYNCS]-[STAGLMHQ]-{CP}-{AGDEN}-[FY]-[LIVM]-[AGSC]-[MLIV]-

**(3) PTN/MK heparin-binding;** (1) S-[DE]-C-x-[DE]-W-x-W-x(2)-C-x-P-x-[SN]-x-D-C-G-[LIVMA]-Gx-R-E-G (identical); (2) C-[KR]-[YF]-x-[KRFY]-x(2)-W-[AGST]-x-C-[DENST] (new); {51, 84, 1}

**(4) Nerve growth factor;** (1) [GSRAED]-[CR]-[KRLIVM]-G-[LIVAT]-[DE]-{C}-x(2)-[YW]-{P}-S-x-[CR]; (2) [SAP]-[LIVA]-C-[DEY]-[SAG]-{WM}-[STDENC]-x-W-[VE]-[AGSTNI] (new); {321, 471, 12}

**(5)Platelet-derived growth factor (PDGF**); (1) P-[PSRAKQGL]-C-[LIVMFYAGST]-x(3)-[RQ]-C-

**(6)Small cytokines C-x-C;** (1) C-x-C-{CFYW}-{CW}-x(3)-{P}-x(2)-{C}(8)-x(5,8)-C-x(2,3)-[EQMA]- [LIVMTE]-[LIVMF]-x(9,14)-C-[LIVMRK]-[DENH]; {206, 206, 18}; { Q6DUZ6, Q6GLX8, Q4T8B9} **(7) Small cytokines (intercrine/chemokine) C-C;** (1) C-C-[LIVMFYSTQRKHDE]-{P}(2)-{CDE}-{C}(7) x(2,5)-{P}-[FYWAC]-{C}(2)-x(3,6)-C-{KM}-{C}(1,3)-[SAG]-[LIVMTS]-[LIVMRTDE]-[FYLIVDE]-

**(8)TGF-beta;** (1) [WFYSTKRHL]-[LIVM]-[LIVMKRHF]-{CPNL}-P-{FY}-{PCW}-[FYILVA]-{C}-

**(10)Granulocyte-macrophage colony-stimulating factor;** (1) C-P-[LP]-T-{ST}-E-x-{QLIVMT}-C; {25,

**(9)interferon alpha, beta and delta;** (1) [FYH]-[FY]-{CP}-[GNRKCDSTI]-[LIVM]-{W}-{AGC}- [KRN](0,1)-[FYLVIMN]-L-{PAG}-{C}-{PST}-{PFYW}-[FYHDEN]-x-{QY}-[CYQE]-[AT]-W; (2) L- {QKR}-x(0,4)-[GAEDVI]-[LVI]-[QHNDEFY]-[RQ]-[QH]-[LMIV]-[DENQVSTR]-x-L-[DENKRQ]-x-C-

**(11) Interleukin-1;** (1) **[**LIVSTNDEFH]-[YESTMVIR]-[LFC]-{AGCFYL}-[SA]-[ASLV]-{CFY}- [CFYWH]-[PKRST]-{FYLC}-[WHLIVM]-[FYL]-[LI]-[SCA]-[TSVG]-x(6)-[PKRHCLIVMT]-x(0,2)-

**(13) Interleukin\_4\_13;** (1) [LI]-x-E-[LIVM](2)-{Q}(4)-x(0,1)-[LIVM]-[TL]-x(5,7)-C-x(2)-[LMIVST]-x- [IV]-x-[DNS]-[LIVMA]; (2) [KREV]-N-[STA]-[STED]-[DEAG]-{C}(3,4)-C-[RKT]-[AV]-x(11,17)-C

**(16) Interleukin\_10;** (1)[KQSN]-{C}(4)-C-[QYCH]-x(4)-[LIVM](2)-x-[FL]-[FYT]-[LMVRT]-x-[DERST]-

**Hormones (19) Adipokinetic** (1) [AGC]-Q-[LVI]-[NT]-[FY]-[ST]-[PASTKR]-[AGWSDEN]-W-[AGNDEST]; (2) <Q-[LVI]-[NT]-[FY]-[ST]-[PASTKR]-[AGWSDEN]-W-[AGNDEST>] (new); {45, 45, 0} {Q5TTQ9}

**(21) Calcitonin/CGRP/IAPP** (1) [KR]-R-x(0,1)-C-[SAGDNT]-[STNG]-x(0,1)-[STAGVIL]-[TS]-C- [VMALI]-x(3)-[LYF]-x(3)-[LYFVI]; (2) <x(0,1)-C-[SAGDNT]-[STNG]-x(0,1)-[STAGVIL]-[TS]-C-

**(12) Interleukin\_2;** (1) [ST]-E-[LF]-x(2)-L-x-C-L-x-[EDN]-E-L; {74, 74, 14}

**(14) Interleukin\_6;** (1) C-x(9)-C-[FYLIVM]-x(5)-G-L-x(2)-[FY]-x(3)-L; {69, 69, 8} **(15) Interleukin\_7\_9;** (1) N-[DAT]-[LAPS]-[SCT]-F-L-K-{AGDE}-L-L; {20, 20, 2}

**(18) Osteopontin;** (1) P-x(1,5)-[KQ]-x-[TA]-x(2)-[GA]-S-S-E-E-K; {27, 27, 0}

[VMALI]-x(3)-[LYF]-x(3)-[LYFVI] (new); {83, 84, 7}

**(17) LIF / OSM;** (1) [PSTA]-x(4)-F-[NQ]-x-K-x(3)-[CG]-x-[LF]-L-x(2)-Y-[HK] ; {24, 24, 4}

**(20) Bombesin-like peptides** (1) [HLIVMQ]-W-A-[STIVRK]-G-[SH]-[LF]-M; {42, 42, 1}

**(1) Granulins**; (1) C-x-D-x(2)-H-C-C-{LIVM}-x(4)-C; {42, 241, 2}; {Q616A1, Q7JKP2, Q9U362}

[NSTDEK]-[GAKRSTNDEQ]-[EDNKRHSTQA]-G (new); {300,530,44}

[AGSTMLIVN]-G-S(0,1)-[CN]-C; {158, 158, 23}

[LIVMKRQG] (new); {272, 442, 29}

[LIVM]-[AGSTCVINDE]; {128, 128, 24}

25, 1}; {Q4G094}

(new); {73, 119, 4}

[IV]-[LMF]; {75,75,12}

{C}(7,10)-C-[STAGVILM]; {234, 234,27}; { Q3ZBN3, Q32L58}

{QCWKRH}-{PA}-{PAGC}-C-{C}-[GE]-{C}-C; {766, 766, 59}

**(22) Corticotropin-releasing factor** (1) [KR]-R-x(0,28)-[PQASLVIG]-[STPI]-[LIVM]-S-[LIVM]-x- [LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)-[LIVMWF]; (2) <x(0,8)-[PQASLVIG]-[STPI]- [LIVM]-S-[LIVM]-x-[LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)-[LIVMWF] (new); (3) T-R-[PQASLVIG]-[STPI]-[LIVM]-S-[LIVM]-x-[LIVMNAG]-[PST]-[LIVMFT]-x-[LIVM]-[LM]-[RN]-x(2)- [LIVMWF] (new); {64, 64, 9}; {Q4RWF4}

**(23) Arthropod CHH/MIH/GIH neurohormones** (1) [LIVM]-{C}-x(2)-C-[KR]-{FY}-[DENGRKHQ]-C- [FY]-{C}-{AGKRC}-{C}(2)-[FYILVM]-{C}-{CP}-C; {135, 135, 5} {Q23247}

**(24) Erythropoietin/thrombopoeitin** (1) P-x(4)-C-D-x-R-[LIVM](2)-x-[KRH]-x(14)-C; {34, 34, 8, 0}

**(25)Granins** (1){DEF}-[DE]-[SN]-L-[SAN]-[AD]-[LIMVKR]-[DE]-[AGLSTQ]-E-L; (2) [LIVM]-x- [KHR]-C-[LIVM](2)-[ED]-[LIVM](2)-x(5)-[KRH]-[STP]-x(3)-[PST]-x(4)-C (new); (3) K-R-[STAG]- [NDEST]-[ED]-x(2)-[DE]-[DEGA]-[QKR]-Y-[AGST]-P-Q (new); {63, 96, 5}; {Q86T07, Q4RYY8, Q566G8}

**(26) Galanin** (1) G-W-[ST]-L-N-[ST]-[AG]-[AG]-[FY]-[LIVM]-[LIVM]-G-P; (2) <L-N-[ST]-[AG]-[AG]- [FY]-[LIVM]-[LIVM]-G-P (new); {31, 31, 1}

**(27) Gastrin/cholecystokinin** (1) [FY]-x(0,2)-[GADN]-[AS](0,1)-[WH]-[MFLIV]-[DR]-F-G-[KR]-[RS]; (2) Y-x(0,2)-[GA]-[AS](0,1)-[WH]-[MFL]-[DR]-F> (new); {88, 102, 4}

**(28)Glucagon/GIP/secretin/VIP** (1) [YH]-[STAIVGD]-[DENQ]-[AGF]-[LIVMSTE]-[FY]- {QLPAGDEKR}-[DENSTAK]-[DENSTA]-[LIVMFYG]-[RKSTDEN]-x(3)-{P}-{P}-x(2)-[AGSTLIVMQ]- [KREQL]-[KRDENQL]-[LVFYWG]-[LIVQ]; {202, 305, 8}

**(29) Glycoprotein hormones alpha chain** (1) C-x-G-C-C-[FY]-S-x-A-[FY]-P-T-P; {109, 109, 4}

**(30) Glycoprotein hormones beta chain** (1)C-{C}(2)-[CW]-{C}(7,9)-C-[STAGMLIVED]-G-[HFYLRS]- C-{C}-[STA]; (2) <x(0,8)-C-[STAGMDEVLI]-G-[HFYL]-C-{CKRH}-[ST] (new); (3) <x-[CW]-{C}(7,9)-C- [STAGDEVLIM]-G-[HFYL]-C-{C}-[ST] (new); {341, 341, 13}

**(31) Gonadotropin-releasing hormones** (1) Q-[HY]-[FYW]-S-x(4)-P-G-G-[KR]-R; (2) Q-[HY]-[FYW]- S-x(4)-P-G> (new); {178, 188, 4}

**(32) Insulin** (1){C}(2)-[IVLMPSTAFYR]-{CNE}-x-{C}-C-C-{CPM}-{P}-{CHW}-C-[STDNEKIGQ]-{C}(2)- {CPAG}-[LIVMFSQ]-{CD}-{CPW}-{CHDEP}-C; (2) <x(0,205)-C-G-{FYILVMQW}-{CWPSTLIVM}- [LIVFY]-[VILMASTPH]-{AGHCFYPQW}-{CPQSW}-[LIVMRKHQWF]-{CNP}-{WCQP}-[LVIMATC]- C-{LM}-x(0,204)> (new);{507, 877, 52} {Q32L79, Q621L6, Q61VN2, Q61GN7, Q4T1R8}

**(33) Natriuretic peptides** (1) C-F-G-x(3)-[DEA]-[RH]-I-x(3)-[ST]-x(2)-G-C; {155, 155,10}

**(34) Neurohypophysial hormones** (1) C-[LIFY]-[LIFYV]-x-N-C-P-x-G; (2) C-x(2,6)-[CW]-G-x(4,6)-C- [FYAGLIVM]-x(3)-[LIVFY]-C-C (new); {112, 259, 4}

**(35) Neuromedin U and S** (1) [FY]-[LIVMF]-[FY]-R-P-R-N-G-[KR]; (2) [FY]-[LIVMF]-[FY]-R-P-R-N> (new); {24, 24, 3}

**(36) Pancreatic** (1) [FY]-x(2)-{LIVM}-[LIVM]-x(2)-[YK]-x(3)-[LIVMFYRHK]-x-R-[PQVH]-R-[YF]- [GD]-[KR]-[RS]; (2) [FY]-x(3)-[LIVM]-x(2)-[YK]-x(3)-[LIVMFYRHK]-x-R-[PQVH]-R-[YF]-x(0,1)> (new); {118, 118, 7}

**(37) Parathyroid hormone** (1) [KR]-R-x-[VI]-[STAGFYN]-[EH]-x-Q-x(2)-H-[DEN]-x-[GR]; {54, 54, 3} **(38) Pyrokinins** (1) [AGHNQDEST]-{FYST}-[PQVIWFYED]-[FY]-[AGST]-P-R-[LI]-G-[KR]-R; (2)

[AGHNQDEST]-{FYST}-[PQVIWFYED]-[FY]-[AGST]-P-R-[LI]> (new); {72, 89, 4} {Q7PTL2, Q5TV14} **(39) Somatotropin** (1) C-{KRAG}-[STNRAC]-x(2)-[LIVMFYSRNW]-x-[LIVMSTAGY]-P-x(2)-{FYW} x(2)-[TALIVMSHN]-x(7)-[LIVMFYP]-x(2)-{QHKR}-{KRHP}-{NW}-x-[LIVMFYR]-[LIVMSTC]-x- [STACVLMIG]-W; (2) C-[LIVMFG]-x-[KHRSNDEQVI]-[DEN]-{CNDEPQ}-{AGLMVI}-[KRMT]- {DENKRHPQ}-x-[STNALIVMF]-[FYLIVMKS]-[LIMVT]-x-{NDEKRH}-[LIVMATE]-[KRNEQTA]-C (new); (3) [ED]-K-L-L-[DE]-R-[VIA]-[IV]-x-H-[AT]-E-L (new); (4) C-F-[KRH](2)-[DEN]-[LIVMAG]- [HKR](2)-[LIVM]-[DEQ]-[ST]-[FYLIVM]-x(0,1)> (new); {633, 1093, 45}

**(40) Tachykinin** (1) [AGSTQKRFY]-[SF]-[IVFYTHQ]-G-[LVIM]-M-G-[KR]-[RS]; (2) [AGSTQKRFY]- F-[IVLMFYSHQ]-G-[LVIMS]-R-G-K-R (new); (3) <x(0,9)-F-[IVLMFYTHQ]-G-[LVIMSTAG]-[RM]> (new); {104, 124,6}

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 131

**(6) Interleukin\_17** (1) [RLM]-{QKR}-[PS]-{P}-x-[LIVMFY]-{RKH}-{CP}-[AS]-x-Cx-[CHKRNDESTFY]-

**(8) Receptivity factor** (1) L-[LIVMPAG]-x(2)-[YF]-[LIVM]-x(2)-[QLIVM]-[GA]-x-P-[LIVMFY]-x-

**(9) GMF-beta** (1) [FY]-[LIVM](2)-x-[STAG]-[FYWH]-x(5)-[DE]-x(5)-P-[LIVM]-x(2)-[LIVM]-[FYWN]-

**Hormones (10) ACTH\_domain and opioid neuropeptides** (1) K-R-[YF]-G-G-F-[LIVMT]-[STGKRIV]- [AGKRSTLIVMPY]; (2) K-R-[YF]-G-G-F-[LIVMT]>; (3) K-[KN]-[YF]-G-G-F-M-[KR]; (4) <[YF]-G-G-F- [LIVMT]-[STGKRIV]-[AGKRSTLIVMPY]; (5){CFYWHM}-Y-x-[MIVSTFY]-{FY}-H-F-R-W; (6) <Y-x-

**(11)FMRFamide and related neuropeptides** (1){LCFY}-{LCFYQWST}-{LCFYQWH}- {LCDEFYKRQW}-[LVMI]-[MLIV]-R-F-G-K-R;(2){LCFY}-{LCFYQWSTLIVM}-{LCFYQWHKR}- {LCDEFYKRQWLIVM}-[LM]-[MIV]-R-F-GR-[ASPD]-{LCFYHKR}-{LCQST};(3)<x(0,8)-[LVMI]- [MLIV]-R-F>;(4){CLIVM}-{CAGLIVMW}-{QCFYLW}-[FY]-[MLIV]-R-F-G-K-R; (5){CHIV}-x-{CQN}- {HIV}-{CLIVMY}-{CAGLIVMW}-{QCFYLWIV}-[FY]-[MLIV]-R-F-G-R-[DNESTAG];(6)<x(0,9)-[FY]- [MLIV]-R-F>;(7)[AGED]-[LIVMFY]-Q-G-R-F-G-R-[DEN];(8)P-[AGST]-[LIVM]-R-[MLIV]-R-F>;(9)N-Q-[VI]-R-F-G-K-R; (10) [STG]-[LVMI]-F-R-F-G-K-R; (11)[RD]-[QPH]-F-[FY]-R-F-G-[KR]-{FWYL}; (12)[RD]-[QPH]-F-[FY]-R-F>; (13)R-P-[VI]-G-R-F-G-[KR]-[RS]; (14)S-A-[LM]-A-R-F-G-[KR]-[RS]; (15)[PQ]-[HL]-[LMFY]-R-G-R-F-G-R; (16 )[STNFYH]-[LQ]-PQ-R-F-G-[KR]-{LC}; (17)F-M-[NH]-F-G-K-R; (18)[AGNQ]-[GLE]-P-[LI]-R-F-G-[KR]-{QLIVMAG}; (19)P-[RK]-P-L-R-F-G>; (20)[FL]-G-T-M-R-F-G-[KR]-[RS]; (21)Q-[WL]-[LMIV]-[AGKRST]-G-R-F-G-[KR]; (22)[GA]-[GA]-[FY]-[ST]-[FY]-R-F-G- [RK]; (23)[GA]-[GA]-F-[ST]-[FY]-R-F>; {214,605,2}; {Q7YWT6, Q622X3,Q61P51, Q616K2, Q613X6, Q21656, P34405, Q60ZQ9, Q618S3, Q620F8, Q620P9, Q7PUD4, Q618T6, Q705J7, Q3SXL4, Q3KNG4,

**(12) Neuropeptide-like protein**\* (1) G-M-Y-G-G-[FYW]-G-R; (2) A-Q-[FW]-G-Y-G-[GY]-x(2)- [KRFYG]; (3) G-[FYW]-G-G-Y-G-G-Y-G-R-G; (4) P-L-Q-F-GK-R; (5) [STRIV]-M-S-F-G-K-R; (6) [AGIV]-M-[AG]-F-G-K-R; (7) [DE]-K-R-G-G-A-R-A-[FYLIVM]; (8) R-x-G-[FML]-R-PG-K-R; (9) [RFYM]-[AGTR]-F-A-F-A-K-R; {33, 84, 7}; {Q60NA1, Q619H9, Q624T4, Q61BN3, Q627I5, Q60MJ8,

**(13) Wamide neuropeptides**\* (1) [QRKED]-{P}-[KRPQN]-[IVP]-G-[LM]-W-G-R-[RDESA]; (2) [ANPRKQ]-x-[AGLQP]-[RHKLIVP]-G-[LM]-W-G-K-R; (3) K-[KR]-x(1,5)-W-x(6)-W-G-[KR]-R; {10, 86,

**(15) Neurotensin/neuromedin N** (1)[KR]-[IVTRK]-P-Y-I-L-K-R; (2) [KR]-[IVTRK]-P-Y-I-L>; {14, 24, 0} **(16)Allatostatin**\* (1) [KR]-R-{NCKRFY}-x(0,11)-[FY]-[DENAGST]-[FY]-G-[LIVM]-G-[KR]-R; (2) <x(0,11)-[FY]-[DENAGST]-[FY]-G-[LIVM]>; (3) [KR]-R-x(0,3)-[FY]-[DENAGST]-[FY]-G-[LIVM]>; {52,

**(17) Egg-laying hormone** (1) K-R-R-[LIVM]-R-F-[HNY]-[KR]-R; (2) P-R-[LIVM]-R-F-[HNY]-

**(18)Periviscerokinin** (1)<x(0,1)-[AG]-x(0,3)-[GS]-[LIVM]-[LIFY]-x-[FYAMV]-[AGPM]-R-x>;{59, 59, 0}

**(20) Orcokinin**\* (1) [KR]-R-N-F-[DE]-[DE]-[IV]-[DE]-[KR]; (2) <N-F-[DE]-[DE]-[IV]-[DE]-[KR]; {3, 22,

**(21) Allatotropin**\* (1)N-x(4)-[STIV]-A-R-G-[FY]-G-[KR]-R; (2)N-x(4)-[STIV]-A-R-G-[FY]>; {15, 18, 1};

[PSTDEN]-x-[KRG]-[KR]-[KR]; (3) P-R-[LIVM]-R-F-[HNY]-[PSTDEN]-x(1,2)>; {21, 32, 2}

**(19) Somatostatin** (1) C-[KRM]-[NSIV]-[FY]-[FY]-W-[KRDE]-[STG]-x-[ST]-x-C; {71, 71, 2}

**(7) Interleukin\_18** (1) [EQ]-[SY]-S-[SL]-x(2)-[GS]-x-[FY]-L-[AST]-[CF]; {41, 41, 3}

x-[GRKHFY]-C-[LIVM]; {47, 47, 4}

x(2)-P; {29, 29, 1}; {Q9VJL6, Q29NM1}

[MIVSTFY]-{FY}-H-F-R-W; {397, 1045, 4}

Q60YH4, Q622X1, Q28Z02, Q297C5, Q28Z02}

1} {Q7Q4X3, Q8T3G1, Q60TK2, Q2LZG9}

0}; {Q7Q025, Q7QNH4, Q9W1F8, Q292P8}

**(14) Thyroliberin** (1)[KR]-[HKR]-Q-H-P-G-[KR]-R; {12, 78, 1}

Q625G9, Q622L1, Q622L2}

222, 3}; {Q7QAG2, Q29BZ8}

{Q7QKW9, Q7PZX1}

[DENHKRLIVM]-[PAG]-[DEAGST]-[FY]; {204, 204, 0}

**(41)Urotensin II** (1) C-F-W-K-Y-C (identical); {30, 30,1}

**(42) Endothelin** (1) C-{C}-C-{C}(4)-D-{C}(2)-C-{C}(2)-[FY]-C; {50, 104, 2}

**(43) Agouti** (1) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C-{C}-C-{C}(6,9)-C; (2) C- {C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C-{C}-C-{C}(0,8)> (new); (3) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C> (new); (4) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C(0,1)> (new); {37, 37, 7}

#### **Antimicrobial**

**(44) Cecropin** (1) W-[KDN]-{QNDEGAKRW}-[FYGA]-K-[KRE]-[LIVM]-E-[RKHAGN]-x-[AGVI]; (2) [GS]-[WRKHG]-[LIVMST]-[KRST]-K-{QNDEGAKRW}-[FYGA]-K-[KRED]-[LIVM]-E-[RKHAGN]-x- [AGVI] (new); {96, 96, 3} {Q5TWE5}

**(45) Mammalian defensins** (1) C-{C}-C-{C}(3,5)-C-{C}(6)-{CP}-[GARKSTW]-x-[SC]-{C}(6,10)-C-C; (2) C-[PR]-x-C-x(2,5)-C-x(2)-C-[PQ]-x-C-[PQ]-x-C (new); {119, 145, 5}

**(46) Arthropod defensins** (1) [CG]-x(0,1)-{C}-{CQ}-[HNSEDRY]-C-x(3)-{C}(0,1)-[GR]-{A}-x- [GRQAY]-[GAL]-x-C-{FY}-x(3,4)-C-{C}-C; (2) [CG]-x(0,1)-{C}(2)-[HNSEDRY]-C-x(3)-{C}(0,1)-[GR]- {A}-x-[GRQAY]-[GAL]-x-C-{FY}-x(6)-C-{C}-C (new); {103, 105, 7}; {Q6XD83}

**(47) Cathelicidins** (1) Y-{LIVM}-[EDQN]-[AVI]-[LMVI]-{HKRG}-[RKHQ]-A-[LIVMA]-[DQGEN]-x- [LIVMFY]-N-[DEQ]; {58, 58, 0}

#### **Toxin**

**(48) Snake toxins** (1) C-{CKRPL}-x(0,2)-C-[PRTFG]- {C}(5)-x(0,6)-C-C-{P}-x-[PDEN]-x-C-[NDEY]; {352, 352, 20}

**(49) Myotoxins** (1) K-x-C-H-x-K-x(2)-H-C-x(2)-K-x(3)-C-x(8)-K-x(2)-C; {15, 15, 0}

**(50) Scorpion short toxin 1** (1) C-{C}(4,5)-C-{PC}-{CQ}-{C}-C-x(3)-{C}-{CPWA}-x(1,4)-[GASEDN]- [KRAVISNDE]-C-[VIMQTDK]-[NG]-x(1,2)-{P}-C-[HKRDENVI]-C; {77, 77, 6}

**(51) Alpha-conotoxin** (1) < x(0,35)-{C}(15)-C-C-[SHYNDE]-{C}(2,3)-C-{C}(3,7)-C-{C}(0,12)>; (2) <{C}(0,14)-C-C-[SHYNDE]-{C}(2,3)-C-{C}(3,7)-C-[G>]> (new); {34, 34, 1}

**(52) I-superfamily conotoxin** (1) C-{C}(6)-C-{C}(5)-C-C-{C}(1,3)-C-C-{C}(2,4)-C-{C}(3,10)-C (identical); {37, 37, 0}

**(53) Mu-agatoxin and spider toxin SFI** (1) C-{C}(2)-[DEKR]-{C}(3)-C-{C}(4,7)-C-C-{C}(2,4)-C-{C}-C- {C}(4,15)-C-{C}-C-x(0,10)>; {36, 36, 2}

**(54) Omega-atracotoxin (ACTX)** (1)C-[IT]-P-S-G-Q-P-C (identical); (2)C-C-[GE]-[ML]-T-P-x-C (identical); {13, 13, 0}

**(55) Ergtoxin** (1) C-{C}(5)-C-x(8)-C-{C}(2)-C-C-x(9)-C-x(4)-C-{C}-C {25, 25, 0}

Table 1. The conserved peptide patterns similar to PROSITE signatures.

#### **Cytokines and growth factors**

**(1) Interferon gamma** (1) [RHSG]-[KRQ]-A-[AGFYLIVM]-x-[DE]-[LIVFY]-{QPAG}-x-[VI]-[VMLIY]- {LVIM}-x(1,4)-L-[STAGPKRLIVM]-{Q}-x(1,9)-[AGKR]-[KR]-R; (2) [RHSG]-[KRQ]-A-[AGFYLIVM]-x- [DE]-[LIVFY]-{QPAG}-x-[VI]-[VMLIY]-{LVIM}-x(1,4)-L-S-P-x(1,7)>; {91, 91, 44}

**(2) Interleukin\_3** (1) [CVLIM]-[LIVM]-P-x-[AGPST]-x(2)-[STAGDENRKH]-x(12,14)-[DE]-F-[RKQ]- {NDEAGQST}-K-L; {20, 20, 0}

**(3) Interleukin\_5** (1) [HDE]-x(2)-C-x(3)-[IVLM]-F-x-G-[LIVMST]-x(2)-L-x-[NST]; {23, 23, 1}

**(4) Interleukin\_12 alpha** (1) [KRHE]-[LM]-C-x(2)-[LM]-[KRHQ]-[AG]-x(3)-R-x(2)-T-x(2)-[KR]-x(3)-Y- [LMIV]; {34, 34, 7}

**(5) Interleukin\_15**(1)C-{C}(4)-[LM]-{C}-C-[FY]-[LIVFYQ]-x-[DE]-[LIVM]-x(2)-[LIVM]-x(2)-[ED]; {44, 44, 1}

**Antimicrobial (44) Cecropin** (1) W-[KDN]-{QNDEGAKRW}-[FYGA]-K-[KRE]-[LIVM]-E-[RKHAGN]-x-[AGVI]; (2) [GS]-[WRKHG]-[LIVMST]-[KRST]-K-{QNDEGAKRW}-[FYGA]-K-[KRED]-[LIVM]-E-[RKHAGN]-x-

**(45) Mammalian defensins** (1) C-{C}-C-{C}(3,5)-C-{C}(6)-{CP}-[GARKSTW]-x-[SC]-{C}(6,10)-C-C; (2)

**(47) Cathelicidins** (1) Y-{LIVM}-[EDQN]-[AVI]-[LMVI]-{HKRG}-[RKHQ]-A-[LIVMA]-[DQGEN]-x-

**Toxin (48) Snake toxins** (1) C-{CKRPL}-x(0,2)-C-[PRTFG]- {C}(5)-x(0,6)-C-C-{P}-x-[PDEN]-x-C-[NDEY];

**(50) Scorpion short toxin 1** (1) C-{C}(4,5)-C-{PC}-{CQ}-{C}-C-x(3)-{C}-{CPWA}-x(1,4)-[GASEDN]-

**(51) Alpha-conotoxin** (1) < x(0,35)-{C}(15)-C-C-[SHYNDE]-{C}(2,3)-C-{C}(3,7)-C-{C}(0,12)>; (2)

**(52) I-superfamily conotoxin** (1) C-{C}(6)-C-{C}(5)-C-C-{C}(1,3)-C-C-{C}(2,4)-C-{C}(3,10)-C (identical);

**Cytokines and growth factors (1) Interferon gamma** (1) [RHSG]-[KRQ]-A-[AGFYLIVM]-x-[DE]-[LIVFY]-{QPAG}-x-[VI]-[VMLIY]- {LVIM}-x(1,4)-L-[STAGPKRLIVM]-{Q}-x(1,9)-[AGKR]-[KR]-R; (2) [RHSG]-[KRQ]-A-[AGFYLIVM]-x-

**(2) Interleukin\_3** (1) [CVLIM]-[LIVM]-P-x-[AGPST]-x(2)-[STAGDENRKH]-x(12,14)-[DE]-F-[RKQ]-

**(4) Interleukin\_12 alpha** (1) [KRHE]-[LM]-C-x(2)-[LM]-[KRHQ]-[AG]-x(3)-R-x(2)-T-x(2)-[KR]-x(3)-Y-

**(5) Interleukin\_15**(1)C-{C}(4)-[LM]-{C}-C-[FY]-[LIVFYQ]-x-[DE]-[LIVM]-x(2)-[LIVM]-x(2)-[ED]; {44,

**(3) Interleukin\_5** (1) [HDE]-x(2)-C-x(3)-[IVLM]-F-x-G-[LIVMST]-x(2)-L-x-[NST]; {23, 23, 1}

**(53) Mu-agatoxin and spider toxin SFI** (1) C-{C}(2)-[DEKR]-{C}(3)-C-{C}(4,7)-C-C-{C}(2,4)-C-{C}-C-

**(54) Omega-atracotoxin (ACTX)** (1)C-[IT]-P-S-G-Q-P-C (identical); (2)C-C-[GE]-[ML]-T-P-x-C

**(46) Arthropod defensins** (1) [CG]-x(0,1)-{C}-{CQ}-[HNSEDRY]-C-x(3)-{C}(0,1)-[GR]-{A}-x- [GRQAY]-[GAL]-x-C-{FY}-x(3,4)-C-{C}-C; (2) [CG]-x(0,1)-{C}(2)-[HNSEDRY]-C-x(3)-{C}(0,1)-[GR]-

**(43) Agouti** (1) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C-{C}-C-{C}(6,9)-C; (2) C- {C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C-{C}-C-{C}(0,8)> (new); (3) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C> (new); (4) C-{C}(6)-C-{C}(6)-C-C-{C}(2)-C-{C}(2)-C-{C}-C-{C}(5,6)-C(0,1)> (new);

**(41)Urotensin II** (1) C-F-W-K-Y-C (identical); {30, 30,1}

[AGVI] (new); {96, 96, 3} {Q5TWE5}

[LIVMFY]-N-[DEQ]; {58, 58, 0}

{C}(4,15)-C-{C}-C-x(0,10)>; {36, 36, 2}

{37, 37, 7}

{352, 352, 20}

{37, 37, 0}

(identical); {13, 13, 0}

{NDEAGQST}-K-L; {20, 20, 0}

[LMIV]; {34, 34, 7}

44, 1}

**(42) Endothelin** (1) C-{C}-C-{C}(4)-D-{C}(2)-C-{C}(2)-[FY]-C; {50, 104, 2}

C-[PR]-x-C-x(2,5)-C-x(2)-C-[PQ]-x-C-[PQ]-x-C (new); {119, 145, 5}

{A}-x-[GRQAY]-[GAL]-x-C-{FY}-x(6)-C-{C}-C (new); {103, 105, 7}; {Q6XD83}

**(49) Myotoxins** (1) K-x-C-H-x-K-x(2)-H-C-x(2)-K-x(3)-C-x(8)-K-x(2)-C; {15, 15, 0}

[KRAVISNDE]-C-[VIMQTDK]-[NG]-x(1,2)-{P}-C-[HKRDENVI]-C; {77, 77, 6}

**(55) Ergtoxin** (1) C-{C}(5)-C-x(8)-C-{C}(2)-C-C-x(9)-C-x(4)-C-{C}-C {25, 25, 0} Table 1. The conserved peptide patterns similar to PROSITE signatures.

[DE]-[LIVFY]-{QPAG}-x-[VI]-[VMLIY]-{LVIM}-x(1,4)-L-S-P-x(1,7)>; {91, 91, 44}

<{C}(0,14)-C-C-[SHYNDE]-{C}(2,3)-C-{C}(3,7)-C-[G>]> (new); {34, 34, 1}

**(6) Interleukin\_17** (1) [RLM]-{QKR}-[PS]-{P}-x-[LIVMFY]-{RKH}-{CP}-[AS]-x-Cx-[CHKRNDESTFY] x-[GRKHFY]-C-[LIVM]; {47, 47, 4}

**(7) Interleukin\_18** (1) [EQ]-[SY]-S-[SL]-x(2)-[GS]-x-[FY]-L-[AST]-[CF]; {41, 41, 3}

**(8) Receptivity factor** (1) L-[LIVMPAG]-x(2)-[YF]-[LIVM]-x(2)-[QLIVM]-[GA]-x-P-[LIVMFY]-x- [DENHKRLIVM]-[PAG]-[DEAGST]-[FY]; {204, 204, 0}

**(9) GMF-beta** (1) [FY]-[LIVM](2)-x-[STAG]-[FYWH]-x(5)-[DE]-x(5)-P-[LIVM]-x(2)-[LIVM]-[FYWN] x(2)-P; {29, 29, 1}; {Q9VJL6, Q29NM1}

**Hormones**

**(10) ACTH\_domain and opioid neuropeptides** (1) K-R-[YF]-G-G-F-[LIVMT]-[STGKRIV]- [AGKRSTLIVMPY]; (2) K-R-[YF]-G-G-F-[LIVMT]>; (3) K-[KN]-[YF]-G-G-F-M-[KR]; (4) <[YF]-G-G-F- [LIVMT]-[STGKRIV]-[AGKRSTLIVMPY]; (5){CFYWHM}-Y-x-[MIVSTFY]-{FY}-H-F-R-W; (6) <Y-x- [MIVSTFY]-{FY}-H-F-R-W; {397, 1045, 4}

**(11)FMRFamide and related neuropeptides** (1){LCFY}-{LCFYQWST}-{LCFYQWH}- {LCDEFYKRQW}-[LVMI]-[MLIV]-R-F-G-K-R;(2){LCFY}-{LCFYQWSTLIVM}-{LCFYQWHKR}- {LCDEFYKRQWLIVM}-[LM]-[MIV]-R-F-GR-[ASPD]-{LCFYHKR}-{LCQST};(3)<x(0,8)-[LVMI]- [MLIV]-R-F>;(4){CLIVM}-{CAGLIVMW}-{QCFYLW}-[FY]-[MLIV]-R-F-G-K-R; (5){CHIV}-x-{CQN}- {HIV}-{CLIVMY}-{CAGLIVMW}-{QCFYLWIV}-[FY]-[MLIV]-R-F-G-R-[DNESTAG];(6)<x(0,9)-[FY]- [MLIV]-R-F>;(7)[AGED]-[LIVMFY]-Q-G-R-F-G-R-[DEN];(8)P-[AGST]-[LIVM]-R-[MLIV]-R-F>;(9)N-Q-[VI]-R-F-G-K-R; (10) [STG]-[LVMI]-F-R-F-G-K-R; (11)[RD]-[QPH]-F-[FY]-R-F-G-[KR]-{FWYL}; (12)[RD]-[QPH]-F-[FY]-R-F>; (13)R-P-[VI]-G-R-F-G-[KR]-[RS]; (14)S-A-[LM]-A-R-F-G-[KR]-[RS]; (15)[PQ]-[HL]-[LMFY]-R-G-R-F-G-R; (16 )[STNFYH]-[LQ]-PQ-R-F-G-[KR]-{LC}; (17)F-M-[NH]-F-G-K-R; (18)[AGNQ]-[GLE]-P-[LI]-R-F-G-[KR]-{QLIVMAG}; (19)P-[RK]-P-L-R-F-G>; (20)[FL]-G-T-M-R-F-G-[KR]-[RS]; (21)Q-[WL]-[LMIV]-[AGKRST]-G-R-F-G-[KR]; (22)[GA]-[GA]-[FY]-[ST]-[FY]-R-F-G- [RK]; (23)[GA]-[GA]-F-[ST]-[FY]-R-F>; {214,605,2}; {Q7YWT6, Q622X3,Q61P51, Q616K2, Q613X6, Q21656, P34405, Q60ZQ9, Q618S3, Q620F8, Q620P9, Q7PUD4, Q618T6, Q705J7, Q3SXL4, Q3KNG4, Q60YH4, Q622X1, Q28Z02, Q297C5, Q28Z02}

**(12) Neuropeptide-like protein**\* (1) G-M-Y-G-G-[FYW]-G-R; (2) A-Q-[FW]-G-Y-G-[GY]-x(2)- [KRFYG]; (3) G-[FYW]-G-G-Y-G-G-Y-G-R-G; (4) P-L-Q-F-GK-R; (5) [STRIV]-M-S-F-G-K-R; (6) [AGIV]-M-[AG]-F-G-K-R; (7) [DE]-K-R-G-G-A-R-A-[FYLIVM]; (8) R-x-G-[FML]-R-PG-K-R; (9) [RFYM]-[AGTR]-F-A-F-A-K-R; {33, 84, 7}; {Q60NA1, Q619H9, Q624T4, Q61BN3, Q627I5, Q60MJ8, Q625G9, Q622L1, Q622L2}

**(13) Wamide neuropeptides**\* (1) [QRKED]-{P}-[KRPQN]-[IVP]-G-[LM]-W-G-R-[RDESA]; (2) [ANPRKQ]-x-[AGLQP]-[RHKLIVP]-G-[LM]-W-G-K-R; (3) K-[KR]-x(1,5)-W-x(6)-W-G-[KR]-R; {10, 86, 1} {Q7Q4X3, Q8T3G1, Q60TK2, Q2LZG9}

**(14) Thyroliberin** (1)[KR]-[HKR]-Q-H-P-G-[KR]-R; {12, 78, 1}

**(15) Neurotensin/neuromedin N** (1)[KR]-[IVTRK]-P-Y-I-L-K-R; (2) [KR]-[IVTRK]-P-Y-I-L>; {14, 24, 0}

**(16)Allatostatin**\* (1) [KR]-R-{NCKRFY}-x(0,11)-[FY]-[DENAGST]-[FY]-G-[LIVM]-G-[KR]-R; (2) <x(0,11)-[FY]-[DENAGST]-[FY]-G-[LIVM]>; (3) [KR]-R-x(0,3)-[FY]-[DENAGST]-[FY]-G-[LIVM]>; {52, 222, 3}; {Q7QAG2, Q29BZ8}

**(17) Egg-laying hormone** (1) K-R-R-[LIVM]-R-F-[HNY]-[KR]-R; (2) P-R-[LIVM]-R-F-[HNY]- [PSTDEN]-x-[KRG]-[KR]-[KR]; (3) P-R-[LIVM]-R-F-[HNY]-[PSTDEN]-x(1,2)>; {21, 32, 2}

**(18)Periviscerokinin** (1)<x(0,1)-[AG]-x(0,3)-[GS]-[LIVM]-[LIFY]-x-[FYAMV]-[AGPM]-R-x>;{59, 59, 0}

**(19) Somatostatin** (1) C-[KRM]-[NSIV]-[FY]-[FY]-W-[KRDE]-[STG]-x-[ST]-x-C; {71, 71, 2}

**(20) Orcokinin**\* (1) [KR]-R-N-F-[DE]-[DE]-[IV]-[DE]-[KR]; (2) <N-F-[DE]-[DE]-[IV]-[DE]-[KR]; {3, 22, 0}; {Q7Q025, Q7QNH4, Q9W1F8, Q292P8}

**(21) Allatotropin**\* (1)N-x(4)-[STIV]-A-R-G-[FY]-G-[KR]-R; (2)N-x(4)-[STIV]-A-R-G-[FY]>; {15, 18, 1}; {Q7QKW9, Q7PZX1}

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 133

**(43)Liver-expressed antimicrobial** (1) [KR]-P-x(4)-C-x(5)-C-x(3)-[LIVM]-C-[KR]-x(2)-[RKHQ]-[CQ];

**(46)Attacin** (1) [GTS]-[AGVMLI]-[AGFYST](0,1)-[FYLIV]-[AGDEL]-{GMQWKRHNDE}-{PKR}- [NKG]-[ADENHIV](0,1)-[NDEKR](0,1)-[GSR]-[HFL]-[GAS]-[GAL]-[STAED]-[LIVM]-[TSMQ]- [KRHDNEGA]-[TSEAG]-[HKRQGT] (2) Y-x-Q-[KRH]-L-[PG]-G-P-Y-G-N-S-x-P; {50, 50, 1}; {Q290V6,

**(47)Beta-defensin** (1) <x(0,79)-{WP}-x-C-{C}-{CP}-{CW}-{CA}-{C}(0,4)-C-{CP}-{C}-{CW}-{C}(0,2)-C- {C}(3)-{CP}(2)-{C}(2)-{CP}-{C}(1,5)-C-{C}(0,3)-{C}(4)-C-C-{CDENFWYP}-x(0,128)>; {326, 326, 13};

**(48) 4 kDa defensin**\* (1) G-[CGA]-P-x(2)-[HQP]-x(2)-[CRK]-[DE]-x-[HP]-[CRWK]-[KR]-G-

**Toxin (49)Conotoxin scaffold III/IV, muconotoxin and M conotoxin** (1) <x(0,62)-{C}-x(2)-{C}(10)-C-C- {C}(2,6)-C-{C}(2,5)-C-{C}(1,5)-C-{C}(0,3)-C-{C}(0,3)>; (2) <{C}(0,9)-C-C-{C}(2,6)-C-{C}(2,5)-C-{C}(1,5)-

**(50)Conotoxin scaffold IX and tau conotoxin** (1) <x(0,49)-{C}(12)-{CDEFY}-{C}(2)-C-C-{C}(4,7)-C-

**(51)Conotoxin scaffold VI/VII, four-loop conotoxin, Spider potassium channel inhibitory toxin, O superfamily** (1) <x-{PA}-x(0,17)-{C}(0,21)-{C}(2)-{CQ}-{C}(11)-{CI}-{CP}-{C}-{CH}-C-{C}(3,6)-C-{QC}- {C}(3,9)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>;(2)<{C}(0,16)-{CQ}-C-{CI}-{C}(2,5)-C-{QPC}-{C}- {CY}(2)-{C}(0,6)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>; (3) <C-{CI}-{C}(2,5)-C-{QPC}-{C}-{CY}(2)-

**(52) Scorpion toxin** (1) [CKDEN]-{C}(3)-[CI]-{CDEN}-{C}(2)-C-{C}(3)-C-{C}(6,10)-G-{C}(1,2)-[CF]-x- {C}(3,11)-C-[WYF]-C; (2) [CKDEN]-{C}(3)-[CI]-{CDEN}-{C}(4,9)-C-{C}(3)-C-{C}(6,10)-G-{C}(1,2)-[CF]-

**(53) Scorpion short toxin 2** (1) C-x-P-C-x(10)-C-x(2)-C-C-x(5,7)-C-x(2,3)-Q-C-LIVM]-C; {14, 14, 0} **(54) Anenome neurotoxin** (1) C-x-C-{C}(4)-P-x(6,8)-G-x(5,13)-C-x(6,9)-C-x(6,9)-C-C; {25, 25, 0} **(55) Melittin** (1) [LIVM]-[GA]-x(2)-[LIVM]-[KR]-[LIVM]; (2)-x(3)-[LIVM]-P-x-[LIVM](2)-x-W-

Note: each family is described in the following items: (1) the name of the family; (2) all identified patterns; patterns marked with 'identical' are completely identical to their

PROSITE counterpart and the ones marked as 'new' are novel to PROSITE in Table 1; (3) the number of true positive peptide or precursor proteins, the number of matches to the pattern, and the number of false negative hits, all these numbers are in a bracket; (4) if there are novel putative peptides or precursors predicted by the patterns of the family, they are listed

Patterns respectively representing the family of opioid and POMC-derived peptides as well as the FMRFamide and related neuropeptides (FARPs) are here shown as test cases in order to provide insights into the conserved sequence characteristics in many know peptide families and how the peptide patterns deduced based on these characteristics perform.

{C}(0,2)-C-{C}(0,9)>; (2) <{C}(0,14)-C-C-{C}(4,7)-C-{C}(0,2)-C-{C}(0,9)>; {80, 80, 1}

{C}(0,6)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>; {408, 408, 25}

x-{C}(3,11)-C-[WYF]-C; {223, 223, 14}; {Q2TSD9}

Table 2. The novel conserved peptide patterns.

**(44) Penaeidin** (1) [CR]-x(1,3)-C-{C}(2)-[LIVM]-{C}(7)-[CYF]-[CST]-{C}(3)-[GA]-x-C-C; {40, 40, 0} **(45) Ceratotoxin**\* (1) [ST]-[LIVM]-[GA]-[ST]-[AG]-x-[KR]-[KR]-[AG]-[LIVM]-P-[LIVM]-[AG]-[KR](2);

{15, 15, 0}; {Q4SXZ9, Q5M9I7}

Q291C0, Q295K8, Q29QF8, Q29QG5}

[MLIVEDN]; {27, 27, 0}

[LIVM]; {11, 11, 0}

in a second bracket.

**6. Case study** 

C-{C}(0,1)-C-{C}(0,3)>; {62, 62, 0}

{Q32P86, Q2XXN6, Q2XXN7, Q2XXN8, Q2XXN9}

{10, 10, 3}

**(22) Ghrelin and Motilinrelated peptide** (1) G-[STL]-[ST]-F-[LIVM]-[ST]-P-x(0,1)-[AGSTDE]- [FYQHM]-[QRK]; (2) [FY]-[VILM]-P-x-[FY]-[TS]-x(2)-[DE]-[LIVM]-[QRK]-[RK]-x-[QRK]-[ED]-[KR]; {68, 68, 12}

**(23) ADM** (1) [AG]-C-{P}-x-[AGFY]-[STMLIV]-C-[AGQIVT]-[VMLIFYHKR]-[QH]-x-[LIVM]; {23, 23, 1}; {Q4RDH7, Q6IFS9}

**(24) Hepcidin**\* (1) C-[CGW]-x-C-C-{C}(4,5)-[CG]-G-x-C-C; {44, 44, 1}; {Q4RUL1, Q4RUL2}

**(25) Achatin**\* (1) K-R-G-F-[AGF]-[DG]-K-R; (2) <G-F-[AGF]-[DG]>; {5, 20, 0}

**(26) Cocaine- and amphetamineregulated transcript protein** (1) C-x-C-x(5)-C-x(3)-[LIVM]-L-K-[C>]; {11, 11, 2}; {Q4RMR3, Q568S2, Q68EU1, Q4SGG2, Q4T695, Q4TBI3}

**(27)Bradykinin** (1) P-[PAT]-G-[FW]-[ST]-P-[FL]-R; {58, 84, 7}; {Q5XJ76}

**(28) GBP/PSP1/paralytic** (1) N-[FY]-x(2)-[GA]-C-x(2)-[GA]-[FY]-x-[RK]-[TS]-x-[DE]-[GA]-[RK]-C- [KR]-x-[TS]; {18, 18, 0}

**(29) Stanniocalcin** (1) C-L-x(2,6)-[GA]-C-x(2,5)-F-x-C-x(4)-[ST]-[CS]; {45, 45, 1}

**(30) Resistin** (1) C-x-C-x(3)-C-x(2)-W-x(7)-C-x-C-x-C-x(4)-W-x(4)-C-C; {22, 22, 2}

**(31) Pro-MCH** (1) [RK]-R-x(2,6)-[LMIV]-x-C-[MLIV](2)-[GA]-[RK]-[VLIM]-[FY]-x(2)-C-W; (2) R-[ED] x(2)-[DE](3)-N-[ST]-[AG]-x-[FY]-[PK]-[IV]-[GD]-[RK]-R; {29, 39, 4}

**(32) Pigment dispersing hormone** (1) K-R-N-[ST]-[DEGA]-[LIVM](2)-N-[STAG]-[LIVM](2); (2) <N- [ST]-[DEGA]-[LIVM](2)-N-[STAG]-[LIVM](2); {21, 21, 1}; {Q298P6}

**(33) Orexin** (1) [HQ]-A-A-G-[IV]-L-T-[LIVM]-G-[KR]-R; (2) [HQ]-A-AG-[IV]-L-T-[LIVM]>; {11, 18, 0}

**(34)Leucokinin**\* (1) [PQAGSTKRH]-x-F-[HYN]-[AGSP]-W-[GA]-G-K-R; (2) <x-[PQAGSTKRH]-x-F- [HYN]-[AGSP]-W-[GA]>; {11, 11, 0}; {Q60MR3, Q8MNU5}

**(35)Myomodulin**\* (1) [LIVM]-[HQPST]-M-L-R-L-G-K-R; {3, 29, 0}

**(36)Nitrophorin** (1) C-[ST]-x(9,10)-[KRH]-x(2)-[FYW](2)-x(3,4)-[FYW](2)-x-[TS]-x-[FY]-x(4,5)-[PTS]; {11, 11, 1}

**(37)Prokineticin** (1) Q-C-x(4)-[CFY]-C-x(2)-[ST]-x(3)-[KR]-x-[LIVM]-[RK]-x-C-x-P-x-[GA]-x(2)-[GA] x(2)-C-[HYF]-P; {35, 35, 1}

**(38) Leptin** (1)L-x-[VIT]-[FY]-[QRH]-[QKA]-[IV]-[LIVMH]-x-[SNG]-[LM]-[PHQS]; {68, 68, 13}

#### **Antimicrobial**

**(39)Bombinin** (1) K-R-[LIVM](2)-G-P-[LIVM](2)-x(2)-[VILM]-[STG]-x(2)-[LIVM]-x(2)-[LIVM](2); (2) <[LIVM](2)-G-P-[LIVM](2)-x(2)-[VILM]-[STG]-x(2)-[LIVM]-x(2)-[LIVM](2); (3) [SG]-IG-x(0,3)-[LIV] x(2,7)-K-[STAGIV]-[AGFYIV]-[LIVF]-[KR]-[GAC]-[AGFYL]-[AGLVIM]-[KRN]; {59, 110, 0}

**(40) Brevinin, Dermaseptin, Aurein, Caeridin, Caerin, Dahlein, Temporin Ponericin and Uperin**  (1) <x(7)-{C}(2)-x(0,68)-C-[KSTAGLVE]-[LIVA]-[STAKYD]-[KRYGN]-[KRDESTQLG]-C>; (2) C- [KSTAGLVE]-[LIVA]-[STAKYD]-[KRYGN]-[KRDESTQLG]-C-R-x>; (3) <[DGA]-[LIVF]-[LIVMFW]- [DNESAGQKPLM]-[STLIVMKFAGDN]-[LVIMAGTF]-[KRAGSTVIL]-[KRHDENGASTQ]- [LIVMAGKFYSTW]-[IVLMAGFKRH]-[AGKRHSTDENQLIV]-{W}-x(0,2)>; (4) <[DGA]-[LIVF]- [LIVMFW]-[DNESAGQKPLM]-[STLIVMKFAGDN]-[LVIMAGTF]-[KRAGSTVIL]- [KRHDENGASTQ]-[LIVMAGKFYSTW]-[IVLMAGFKRH]-[AGKRHSTDENQLIV]-{W}-{CP}(2) x(0,35)>; (5) <x(0,45)-{QAGR}-{FYLQKRST}-K-R-[DGA]-[LIVFW]-[LIVMFW]-[DNESAGQKPLFM]- [STLIVMKFAGDN]-[LVIMAGTFY]-[KRAGSTVIL]-[KRHDENGASTQ]-[LIVMAGKFYSTW]- [IVLMAGFYKRH]-[AGKRHSTDENQLIV]-{W}-x(0,37)>; (6) <x(0,1)-[FIVLM]-[LIVMFYST]-[PGAQ] x-[LIVMFY]-[AGSTIVLM]-[KRSTNDEMLIV]-[LIVMAGFY](0,1)-[LIVMAG](0,1)-x(0,2)-[GKRDEST]- [LIVM](2)>; (7) K-R-[FIVLM]-[LIVMFYST]-[PGAQ]-x-[LIVMFY]-[AGSTIVLM]-[KRSTNDEMLIV]- [LIVMAGFY](0,1)-[LIVMAG](0,1)-x(0,2)-[GKRDEST]-[LIVM](2)-G-K>; {278, 310, 25} **(41) Dermorphin** (1) K-R-Y-A-F-x-[YVLI]-[PVILM]-x-[RG];(2) <Y-A-F-x-[YVLI]-[PVILM]-x>; {6, 22, 0}

**(42)Termicin**\* (1) C-x(4)-C-W-x(2)-C-x(12)-C-x(4)-C-x-C; {21, 21, 0}

**(43)Liver-expressed antimicrobial** (1) [KR]-P-x(4)-C-x(5)-C-x(3)-[LIVM]-C-[KR]-x(2)-[RKHQ]-[CQ]; {15, 15, 0}; {Q4SXZ9, Q5M9I7}

**(44) Penaeidin** (1) [CR]-x(1,3)-C-{C}(2)-[LIVM]-{C}(7)-[CYF]-[CST]-{C}(3)-[GA]-x-C-C; {40, 40, 0}

**(45) Ceratotoxin**\* (1) [ST]-[LIVM]-[GA]-[ST]-[AG]-x-[KR]-[KR]-[AG]-[LIVM]-P-[LIVM]-[AG]-[KR](2); {10, 10, 3}

**(46)Attacin** (1) [GTS]-[AGVMLI]-[AGFYST](0,1)-[FYLIV]-[AGDEL]-{GMQWKRHNDE}-{PKR}- [NKG]-[ADENHIV](0,1)-[NDEKR](0,1)-[GSR]-[HFL]-[GAS]-[GAL]-[STAED]-[LIVM]-[TSMQ]- [KRHDNEGA]-[TSEAG]-[HKRQGT] (2) Y-x-Q-[KRH]-L-[PG]-G-P-Y-G-N-S-x-P; {50, 50, 1}; {Q290V6,

Q291C0, Q295K8, Q29QF8, Q29QG5}

132 Bioinformatics – Trends and Methodologies

**(22) Ghrelin and Motilinrelated peptide** (1) G-[STL]-[ST]-F-[LIVM]-[ST]-P-x(0,1)-[AGSTDE]- [FYQHM]-[QRK]; (2) [FY]-[VILM]-P-x-[FY]-[TS]-x(2)-[DE]-[LIVM]-[QRK]-[RK]-x-[QRK]-[ED]-[KR];

**(23) ADM** (1) [AG]-C-{P}-x-[AGFY]-[STMLIV]-C-[AGQIVT]-[VMLIFYHKR]-[QH]-x-[LIVM]; {23, 23,

**(26) Cocaine- and amphetamineregulated transcript protein** (1) C-x-C-x(5)-C-x(3)-[LIVM]-L-K-[C>];

**(28) GBP/PSP1/paralytic** (1) N-[FY]-x(2)-[GA]-C-x(2)-[GA]-[FY]-x-[RK]-[TS]-x-[DE]-[GA]-[RK]-C-

**(31) Pro-MCH** (1) [RK]-R-x(2,6)-[LMIV]-x-C-[MLIV](2)-[GA]-[RK]-[VLIM]-[FY]-x(2)-C-W; (2) R-[ED]-

**(32) Pigment dispersing hormone** (1) K-R-N-[ST]-[DEGA]-[LIVM](2)-N-[STAG]-[LIVM](2); (2) <N-

**(33) Orexin** (1) [HQ]-A-A-G-[IV]-L-T-[LIVM]-G-[KR]-R; (2) [HQ]-A-AG-[IV]-L-T-[LIVM]>; {11, 18, 0} **(34)Leucokinin**\* (1) [PQAGSTKRH]-x-F-[HYN]-[AGSP]-W-[GA]-G-K-R; (2) <x-[PQAGSTKRH]-x-F-

**(36)Nitrophorin** (1) C-[ST]-x(9,10)-[KRH]-x(2)-[FYW](2)-x(3,4)-[FYW](2)-x-[TS]-x-[FY]-x(4,5)-[PTS];

**(37)Prokineticin** (1) Q-C-x(4)-[CFY]-C-x(2)-[ST]-x(3)-[KR]-x-[LIVM]-[RK]-x-C-x-P-x-[GA]-x(2)-[GA]-

**(40) Brevinin, Dermaseptin, Aurein, Caeridin, Caerin, Dahlein, Temporin Ponericin and Uperin**  (1) <x(7)-{C}(2)-x(0,68)-C-[KSTAGLVE]-[LIVA]-[STAKYD]-[KRYGN]-[KRDESTQLG]-C>; (2) C- [KSTAGLVE]-[LIVA]-[STAKYD]-[KRYGN]-[KRDESTQLG]-C-R-x>; (3) <[DGA]-[LIVF]-[LIVMFW]-

[LIVMAGKFYSTW]-[IVLMAGFKRH]-[AGKRHSTDENQLIV]-{W}-x(0,2)>; (4) <[DGA]-[LIVF]-

**(41) Dermorphin** (1) K-R-Y-A-F-x-[YVLI]-[PVILM]-x-[RG];(2) <Y-A-F-x-[YVLI]-[PVILM]-x>; {6, 22, 0}

**(38) Leptin** (1)L-x-[VIT]-[FY]-[QRH]-[QKA]-[IV]-[LIVMH]-x-[SNG]-[LM]-[PHQS]; {68, 68, 13} **Antimicrobial (39)Bombinin** (1) K-R-[LIVM](2)-G-P-[LIVM](2)-x(2)-[VILM]-[STG]-x(2)-[LIVM]-x(2)-[LIVM](2); (2) <[LIVM](2)-G-P-[LIVM](2)-x(2)-[VILM]-[STG]-x(2)-[LIVM]-x(2)-[LIVM](2); (3) [SG]-IG-x(0,3)-[LIV]-

x(2,7)-K-[STAGIV]-[AGFYIV]-[LIVF]-[KR]-[GAC]-[AGFYL]-[AGLVIM]-[KRN]; {59, 110, 0}

[DNESAGQKPLM]-[STLIVMKFAGDN]-[LVIMAGTF]-[KRAGSTVIL]-[KRHDENGASTQ]-

[KRHDENGASTQ]-[LIVMAGKFYSTW]-[IVLMAGFKRH]-[AGKRHSTDENQLIV]-{W}-{CP}(2) x(0,35)>; (5) <x(0,45)-{QAGR}-{FYLQKRST}-K-R-[DGA]-[LIVFW]-[LIVMFW]-[DNESAGQKPLFM]- [STLIVMKFAGDN]-[LVIMAGTFY]-[KRAGSTVIL]-[KRHDENGASTQ]-[LIVMAGKFYSTW]- [IVLMAGFYKRH]-[AGKRHSTDENQLIV]-{W}-x(0,37)>; (6) <x(0,1)-[FIVLM]-[LIVMFYST]-[PGAQ] x-[LIVMFY]-[AGSTIVLM]-[KRSTNDEMLIV]-[LIVMAGFY](0,1)-[LIVMAG](0,1)-x(0,2)-[GKRDEST]- [LIVM](2)>; (7) K-R-[FIVLM]-[LIVMFYST]-[PGAQ]-x-[LIVMFY]-[AGSTIVLM]-[KRSTNDEMLIV]-

[LIVMFW]-[DNESAGQKPLM]-[STLIVMKFAGDN]-[LVIMAGTF]-[KRAGSTVIL]-

[LIVMAGFY](0,1)-[LIVMAG](0,1)-x(0,2)-[GKRDEST]-[LIVM](2)-G-K>; {278, 310, 25}

**(42)Termicin**\* (1) C-x(4)-C-W-x(2)-C-x(12)-C-x(4)-C-x-C; {21, 21, 0}

**(24) Hepcidin**\* (1) C-[CGW]-x-C-C-{C}(4,5)-[CG]-G-x-C-C; {44, 44, 1}; {Q4RUL1, Q4RUL2}

**(25) Achatin**\* (1) K-R-G-F-[AGF]-[DG]-K-R; (2) <G-F-[AGF]-[DG]>; {5, 20, 0}

**(29) Stanniocalcin** (1) C-L-x(2,6)-[GA]-C-x(2,5)-F-x-C-x(4)-[ST]-[CS]; {45, 45, 1} **(30) Resistin** (1) C-x-C-x(3)-C-x(2)-W-x(7)-C-x-C-x-C-x(4)-W-x(4)-C-C; {22, 22, 2}

{11, 11, 2}; {Q4RMR3, Q568S2, Q68EU1, Q4SGG2, Q4T695, Q4TBI3} **(27)Bradykinin** (1) P-[PAT]-G-[FW]-[ST]-P-[FL]-R; {58, 84, 7}; {Q5XJ76}

x(2)-[DE](3)-N-[ST]-[AG]-x-[FY]-[PK]-[IV]-[GD]-[RK]-R; {29, 39, 4}

[ST]-[DEGA]-[LIVM](2)-N-[STAG]-[LIVM](2); {21, 21, 1}; {Q298P6}

[HYN]-[AGSP]-W-[GA]>; {11, 11, 0}; {Q60MR3, Q8MNU5} **(35)Myomodulin**\* (1) [LIVM]-[HQPST]-M-L-R-L-G-K-R; {3, 29, 0}

{68, 68, 12}

1}; {Q4RDH7, Q6IFS9}

[KR]-x-[TS]; {18, 18, 0}

{11, 11, 1}

x(2)-C-[HYF]-P; {35, 35, 1}

**(47)Beta-defensin** (1) <x(0,79)-{WP}-x-C-{C}-{CP}-{CW}-{CA}-{C}(0,4)-C-{CP}-{C}-{CW}-{C}(0,2)-C- {C}(3)-{CP}(2)-{C}(2)-{CP}-{C}(1,5)-C-{C}(0,3)-{C}(4)-C-C-{CDENFWYP}-x(0,128)>; {326, 326, 13}; {Q32P86, Q2XXN6, Q2XXN7, Q2XXN8, Q2XXN9}

**(48) 4 kDa defensin**\* (1) G-[CGA]-P-x(2)-[HQP]-x(2)-[CRK]-[DE]-x-[HP]-[CRWK]-[KR]-G- [MLIVEDN]; {27, 27, 0}

#### **Toxin**

**(49)Conotoxin scaffold III/IV, muconotoxin and M conotoxin** (1) <x(0,62)-{C}-x(2)-{C}(10)-C-C- {C}(2,6)-C-{C}(2,5)-C-{C}(1,5)-C-{C}(0,3)-C-{C}(0,3)>; (2) <{C}(0,9)-C-C-{C}(2,6)-C-{C}(2,5)-C-{C}(1,5)- C-{C}(0,1)-C-{C}(0,3)>; {62, 62, 0}

**(50)Conotoxin scaffold IX and tau conotoxin** (1) <x(0,49)-{C}(12)-{CDEFY}-{C}(2)-C-C-{C}(4,7)-C- {C}(0,2)-C-{C}(0,9)>; (2) <{C}(0,14)-C-C-{C}(4,7)-C-{C}(0,2)-C-{C}(0,9)>; {80, 80, 1}

**(51)Conotoxin scaffold VI/VII, four-loop conotoxin, Spider potassium channel inhibitory toxin, O superfamily** (1) <x-{PA}-x(0,17)-{C}(0,21)-{C}(2)-{CQ}-{C}(11)-{CI}-{CP}-{C}-{CH}-C-{C}(3,6)-C-{QC}- {C}(3,9)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>;(2)<{C}(0,16)-{CQ}-C-{CI}-{C}(2,5)-C-{QPC}-{C}- {CY}(2)-{C}(0,6)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>; (3) <C-{CI}-{C}(2,5)-C-{QPC}-{C}-{CY}(2)- {C}(0,6)-C-C-{C}(2,8)-C-{CQ}-{C}(2,9)-C-{C}(0,9)>; {408, 408, 25}

**(52) Scorpion toxin** (1) [CKDEN]-{C}(3)-[CI]-{CDEN}-{C}(2)-C-{C}(3)-C-{C}(6,10)-G-{C}(1,2)-[CF]-x- {C}(3,11)-C-[WYF]-C; (2) [CKDEN]-{C}(3)-[CI]-{CDEN}-{C}(4,9)-C-{C}(3)-C-{C}(6,10)-G-{C}(1,2)-[CF] x-{C}(3,11)-C-[WYF]-C; {223, 223, 14}; {Q2TSD9}

**(53) Scorpion short toxin 2** (1) C-x-P-C-x(10)-C-x(2)-C-C-x(5,7)-C-x(2,3)-Q-C-LIVM]-C; {14, 14, 0}

**(54) Anenome neurotoxin** (1) C-x-C-{C}(4)-P-x(6,8)-G-x(5,13)-C-x(6,9)-C-x(6,9)-C-C; {25, 25, 0}

**(55) Melittin** (1) [LIVM]-[GA]-x(2)-[LIVM]-[KR]-[LIVM]; (2)-x(3)-[LIVM]-P-x-[LIVM](2)-x-W- [LIVM]; {11, 11, 0}

Table 2. The novel conserved peptide patterns.

Note: each family is described in the following items: (1) the name of the family; (2) all identified patterns; patterns marked with 'identical' are completely identical to their PROSITE counterpart and the ones marked as 'new' are novel to PROSITE in Table 1; (3) the number of true positive peptide or precursor proteins, the number of matches to the pattern, and the number of false negative hits, all these numbers are in a bracket; (4) if there are novel putative peptides or precursors predicted by the patterns of the family, they are listed in a second bracket.

#### **6. Case study**

Patterns respectively representing the family of opioid and POMC-derived peptides as well as the FMRFamide and related neuropeptides (FARPs) are here shown as test cases in order to provide insights into the conserved sequence characteristics in many know peptide families and how the peptide patterns deduced based on these characteristics perform.

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 135

Query=Q28409|PENK\_FELCA Proenkephalin A-Felis silvestris catus(Mammalia) Length=187

> P01210|PENK\_HUMAN Proenkephalin A precursor - Homo sapiens **(Mammalia)**

Query WETCKEFLKLSQLEIPQDGTSALRESS-PEESHALRKKYGGFMKRYGGFMKKMDELYPQE WETCKE L+LS+ E+PQDGTS LRE+S PEESH L **K+YGGFMKRYGGFMKK**MDELYP E Sbjct WETCKELLQLSKPELPQDGTSTLRENSKPEESHLLAKRYGGFMKRYGGFMKKMDELYPME Query PEEEAP-AEILAKRYGGFMKKDAEEEEDALASSSDLLKELLGPGETETAAAPRGR----- PEEEA +EILA**KRYGGFMKK**DAEE+ D+LA+SSDLLKELL G+ R R Sbjct PEEEANGSEILAKRYGGFMKKDAEED-DSLANSSDLLKELLETGDN------RERSHHQD Query ---DDEDVSKSHGGFMRALKGSPQLAQEAKMLQKRYGGFMRRVGRPEWWMDYQKRYGGFL ++E+VSK +GGFMR LK SPQL EAK LQ**KRYGGFMRR**VGRPEWWMDYQ**KRYGGFL** Sbjct GSDNEEEVSKRYGGFMRGLKRSPQLEDEAKELQKRYGGFMRRVGRPEWWMDYQKRYGGFL

> Q8AX66|Q8AX66\_BRARE Proenkephalin (Fragment) - Brachydanio rerio (**Actinopterygii**)

Query KKYGGFMKRYGGFMKKMDELYPQEPEEEAPAEILAKRYGGFMKKDAE----EEED----- **KKYGGFMKR** +E L **KRYGGFMKK** AE E ED Sbjct KKYGGFMKR---------------------SESLIKRYGGFMKKAAEFYGLESEDVDQGR Query ALASSSDLLKELLGP-----GETETAAAPRGRDDED-VSKSHGGFMR-----ALKGSPQL A+ ++ D+ E+L GE E AA R + E+ +**K +GGFMR** AL Sbjct AILTNHDV--EMLANQVEADGEREEAALTRSKGGEEGTAKRYGGFMRRGGLYAL------

> Q4RIZ7|Q4RIZ7\_TETNG Chromosome undetermined SCAF15040 - Tetraodon nigroviridis

Query KKYGGFMKRYGGFMKKMD------ELYPQEPEEEA--PAEIL------------------

Sbjct KKYGGFMKRYGGFMSRRDVPEGALE-HPSDPDEEENIRLEILKILNAAAVHGSEGGGKAG Query --AKRYGGFMKKDAEEEEDALASSSDLLKELLGPGETETAAAPRGRDDEDVSKSHGGFMR **KRYGGFM++** AEE A+ DLL+ +LG R Sbjct EEGKRYGGFMRR-AEEG----AAQGDLLEAVLG--------------------------R Query ALKGSPQLAQEAKMLQKRYGGFMRRVGRPEW--------------WM---DYQKRYGGFL LK **KRYGGFMRR**VGRPEW W D Q**KRYGGF+** Sbjct GLK-------------KRYGGFMRRVGRPEWLVDSSKRGGVLKRAWGSDNDLQKRYGGFM

Query AQEA--KMLQKRYGGFMRRVGRPEWWMDYQ--KRYGGFLKRFADSLPSDEEGE E+ + LQ**KRYGGFMRR**VGRP+WW Q **KRYGGFLKR** S E+ E Sbjct --ESGVRELQKRYGGFMRRVGRPDWW---QESKRYGGFLKR------SQEQDE

**(Actinopterygii)** Length=246 Score = 123 bits (283), Expect = 2e-26

**KKYGGFMKRYGGFM** + D E +P +P+EE EIL

Length=267 Score = 429 bits (1004), Expect = 1e-118

Length=216 Score = 140 bits (324), Expect = 9e-32

Query KRFADSLPSDEEGESYS **KR**FA++LPSDEEGESYS Sbjct KRFAEALPSDEEGESYS

Fig. 2. Sequence alignments between Q28409 and P01210/Q8AX66/Q4RIZ7 by BLAST.

No signature represents the subfamily in PROSITE; three Pfam motifs explain the proteins including PF08384 (45 columns), PF00976 (41 columns) and PF08035 (31 columns). These motifs capture separate conserved regions located respectively at the N-ternimus of the precursors after the removal of the signal peptide, at the sequences coding for ACTH and for 'beta-endorphin' peptides. However, the remaining parts of the precursors encoding for peptides of gamma-MSH (12aa) and beta-MSH (17aa) are left untouched. As a result, 27 mature peptides or sequence fragments, e.g. Q9PRN3 from the *Sea lamprey*, horse P01202

Notes: the conserved opioid peptide sequence similarities are in bold.

and leech P41989, cannot be detected by any of these Pfam motifs.

#### **6.1 Opioid and POMC-derived peptides**

The family includes subfamilies of opioid peptides and pro-opiomelanocortin (POMC) proteins, and proteins in this family vary in length ranging from large precursors with a few hundred amino acids, e.g. Q805B5 in *Chimaera phantasma* (325aa), to short peptides or partial sequence fragments, e.g. Q7M2Z6 in Sheep (13aa).

#### **6.1.1 The subfamily of opioid peptides**

Opioid peptides are neuropeptides that are involved in pain control mechanisms in vertebrates, and they consist of proenkephalin (PENK), nociceptin (PNOC) and prodynorphin (PDYN) (Comb et al., 1982). The 41-column PROSITE pattern PS01252 'Cx(3)-C-x(2)-C-x(2)-[KRH]-x(6,7)-[LIF]-[DNS]-x(3)-C-x-[LIVM]-[EQ]-C-[EQ]-x(8)-W-x(2)-C'

matches 39 Uniprot proteins. However, 92 remaining sequences from the subfamily are disregarded; including nine full peptide precursors e.g. zebrafish Q7T3L0 and 83 peptides or sequence fragments e.g. human Q9BYY3.

The subfamily is also described by a 71-column Pfam motif PF01160. When querying this motif against all proteins in the subfamily by means of 'both global (ls) and fragment (fs)' search modes (http://www.sanger.ac.uk/Software/Pfam/search.shtml), 78 precursors are singled out. But, the other 53 opioid proteins, e.g. cat Q28409, zebrafish Q8AX66 and Q9W687 from *Acipenser transmontanus*, cannot be recognized by the Pfam motif with a score higher than a gathering threshold.

A further investigation into the proteins missed by the Pfam motif is conducted by comparing them with all proteins in the non-redundant protein sequence database nr using BLAST (http://www.ncbi.nlm.nih.gov/BLAST). The alignments with Q28409 (Fig. 2) reveal that, while the similarities between the two Mammal precursors Q28409 and P01210 are conserved along the entire sequences, the resemblances between Q28409 and Q8AX66/Q4RIZ7 from the remote phylum of Actinopterygii are confined to a limited region identified as '[KR]-[KR]-Y-G-G-F-[ML]-[KR]-[KR]'. The few highly conserved amino acids are also observed from the alignments between Q9W687 and Q5Y3C6 from Chondrichthyes and Q6SYA7 from Dipnoi (Fig. 3). However, this conserved region is too short to produce a significant score, and therefore BLAST comparison alone will fail to detect the limited similarity preserved among the distant homologues with a critical confidence level.

The existing PROSITE pattern and the Pfam motif both characterize only the conserved Nterminal region of the peptide precursors, they are thus not sufficient in identifying all short bioactive opioid peptides or sequence fragments which are cleaved from their large precursors and do not carry the N-terminal part of the proteins, but nevertheless bring the crucial conserved peptide sequence region with them and preserve the fundamental function of the peptide subfamily. Therefore, although the sequences, e.g. Q28409, Q8AX66 and Q9W687, cannot be identified by the existing motifs, they all share the pattern '[KR]- [KR]-Y-G-G-F-[ML]-[KR]-[KR]' from our 'PeptideMotif' database. The pattern, which is derived from the bioactive peptide sequences, could be more functionally conserved and more performable in identifying opioid peptides or entire precursor proteins.

#### **6.1.2 The subfamily of POMC-derived peptides**

The subfamily shares similar peptide sequences with opioid precursors, but also contains other non-opioid peptides such as ACTH and alpha-MSH, which are involved in the stress response and stimulate corticosteroid release (Arends et al., 1998).

The family includes subfamilies of opioid peptides and pro-opiomelanocortin (POMC) proteins, and proteins in this family vary in length ranging from large precursors with a few hundred amino acids, e.g. Q805B5 in *Chimaera phantasma* (325aa), to short peptides or partial

Opioid peptides are neuropeptides that are involved in pain control mechanisms in vertebrates, and they consist of proenkephalin (PENK), nociceptin (PNOC) and prodynorphin (PDYN) (Comb et al., 1982). The 41-column PROSITE pattern PS01252 'Cx(3)-C-x(2)-C-x(2)-[KRH]-x(6,7)-[LIF]-[DNS]-x(3)-C-x-[LIVM]-[EQ]-C-[EQ]-x(8)-W-x(2)-C' matches 39 Uniprot proteins. However, 92 remaining sequences from the subfamily are disregarded; including nine full peptide precursors e.g. zebrafish Q7T3L0 and 83 peptides or

The subfamily is also described by a 71-column Pfam motif PF01160. When querying this motif against all proteins in the subfamily by means of 'both global (ls) and fragment (fs)' search modes (http://www.sanger.ac.uk/Software/Pfam/search.shtml), 78 precursors are singled out. But, the other 53 opioid proteins, e.g. cat Q28409, zebrafish Q8AX66 and Q9W687 from *Acipenser transmontanus*, cannot be recognized by the Pfam motif with a score

A further investigation into the proteins missed by the Pfam motif is conducted by comparing them with all proteins in the non-redundant protein sequence database nr using BLAST (http://www.ncbi.nlm.nih.gov/BLAST). The alignments with Q28409 (Fig. 2) reveal that, while the similarities between the two Mammal precursors Q28409 and P01210 are conserved along the entire sequences, the resemblances between Q28409 and Q8AX66/Q4RIZ7 from the remote phylum of Actinopterygii are confined to a limited region identified as '[KR]-[KR]-Y-G-G-F-[ML]-[KR]-[KR]'. The few highly conserved amino acids are also observed from the alignments between Q9W687 and Q5Y3C6 from Chondrichthyes and Q6SYA7 from Dipnoi (Fig. 3). However, this conserved region is too short to produce a significant score, and therefore BLAST comparison alone will fail to detect the limited

similarity preserved among the distant homologues with a critical confidence level.

more performable in identifying opioid peptides or entire precursor proteins.

response and stimulate corticosteroid release (Arends et al., 1998).

**6.1.2 The subfamily of POMC-derived peptides**

The existing PROSITE pattern and the Pfam motif both characterize only the conserved Nterminal region of the peptide precursors, they are thus not sufficient in identifying all short bioactive opioid peptides or sequence fragments which are cleaved from their large precursors and do not carry the N-terminal part of the proteins, but nevertheless bring the crucial conserved peptide sequence region with them and preserve the fundamental function of the peptide subfamily. Therefore, although the sequences, e.g. Q28409, Q8AX66 and Q9W687, cannot be identified by the existing motifs, they all share the pattern '[KR]- [KR]-Y-G-G-F-[ML]-[KR]-[KR]' from our 'PeptideMotif' database. The pattern, which is derived from the bioactive peptide sequences, could be more functionally conserved and

The subfamily shares similar peptide sequences with opioid precursors, but also contains other non-opioid peptides such as ACTH and alpha-MSH, which are involved in the stress

**6.1 Opioid and POMC-derived peptides** 

**6.1.1 The subfamily of opioid peptides** 

sequence fragments e.g. human Q9BYY3.

higher than a gathering threshold.

sequence fragments, e.g. Q7M2Z6 in Sheep (13aa).

Query=Q28409|PENK\_FELCA Proenkephalin A-Felis silvestris catus(Mammalia) Length=187 > P01210|PENK\_HUMAN Proenkephalin A precursor - Homo sapiens **(Mammalia)** Length=267 Score = 429 bits (1004), Expect = 1e-118 Query WETCKEFLKLSQLEIPQDGTSALRESS-PEESHALRKKYGGFMKRYGGFMKKMDELYPQE WETCKE L+LS+ E+PQDGTS LRE+S PEESH L **K+YGGFMKRYGGFMKK**MDELYP E Sbjct WETCKELLQLSKPELPQDGTSTLRENSKPEESHLLAKRYGGFMKRYGGFMKKMDELYPME Query PEEEAP-AEILAKRYGGFMKKDAEEEEDALASSSDLLKELLGPGETETAAAPRGR----- PEEEA +EILA**KRYGGFMKK**DAEE+ D+LA+SSDLLKELL G+ R R Sbjct PEEEANGSEILAKRYGGFMKKDAEED-DSLANSSDLLKELLETGDN------RERSHHQD Query ---DDEDVSKSHGGFMRALKGSPQLAQEAKMLQKRYGGFMRRVGRPEWWMDYQKRYGGFL ++E+VSK +GGFMR LK SPQL EAK LQ**KRYGGFMRR**VGRPEWWMDYQ**KRYGGFL** Sbjct GSDNEEEVSKRYGGFMRGLKRSPQLEDEAKELQKRYGGFMRRVGRPEWWMDYQKRYGGFL Query KRFADSLPSDEEGESYS **KR**FA++LPSDEEGESYS Sbjct KRFAEALPSDEEGESYS > Q8AX66|Q8AX66\_BRARE Proenkephalin (Fragment) - Brachydanio rerio (**Actinopterygii**) Length=216 Score = 140 bits (324), Expect = 9e-32 Query KKYGGFMKRYGGFMKKMDELYPQEPEEEAPAEILAKRYGGFMKKDAE----EEED----- **KKYGGFMKR** +E L **KRYGGFMKK** AE E ED Sbjct KKYGGFMKR---------------------SESLIKRYGGFMKKAAEFYGLESEDVDQGR Query ALASSSDLLKELLGP-----GETETAAAPRGRDDED-VSKSHGGFMR-----ALKGSPQL A+ ++ D+ E+L GE E AA R + E+ +**K +GGFMR** AL Sbjct AILTNHDV--EMLANQVEADGEREEAALTRSKGGEEGTAKRYGGFMRRGGLYAL------ Query AQEA--KMLQKRYGGFMRRVGRPEWWMDYQ--KRYGGFLKRFADSLPSDEEGE E+ + LQ**KRYGGFMRR**VGRP+WW Q **KRYGGFLKR** S E+ E Sbjct --ESGVRELQKRYGGFMRRVGRPDWW---QESKRYGGFLKR------SQEQDE > Q4RIZ7|Q4RIZ7\_TETNG Chromosome undetermined SCAF15040 - Tetraodon nigroviridis **(Actinopterygii)** Length=246 Score = 123 bits (283), Expect = 2e-26 Query KKYGGFMKRYGGFMKKMD------ELYPQEPEEEA--PAEIL------------------ **KKYGGFMKRYGGFM** + D E +P +P+EE EIL Sbjct KKYGGFMKRYGGFMSRRDVPEGALE-HPSDPDEEENIRLEILKILNAAAVHGSEGGGKAG Query --AKRYGGFMKKDAEEEEDALASSSDLLKELLGPGETETAAAPRGRDDEDVSKSHGGFMR **KRYGGFM++** AEE A+ DLL+ +LG R Sbjct EEGKRYGGFMRR-AEEG----AAQGDLLEAVLG--------------------------R Query ALKGSPQLAQEAKMLQKRYGGFMRRVGRPEW--------------WM---DYQKRYGGFL LK **KRYGGFMRR**VGRPEW W D Q**KRYGGF+** Sbjct GLK-------------KRYGGFMRRVGRPEWLVDSSKRGGVLKRAWGSDNDLQKRYGGFM

Fig. 2. Sequence alignments between Q28409 and P01210/Q8AX66/Q4RIZ7 by BLAST. Notes: the conserved opioid peptide sequence similarities are in bold.

No signature represents the subfamily in PROSITE; three Pfam motifs explain the proteins including PF08384 (45 columns), PF00976 (41 columns) and PF08035 (31 columns). These motifs capture separate conserved regions located respectively at the N-ternimus of the precursors after the removal of the signal peptide, at the sequences coding for ACTH and for 'beta-endorphin' peptides. However, the remaining parts of the precursors encoding for peptides of gamma-MSH (12aa) and beta-MSH (17aa) are left untouched. As a result, 27 mature peptides or sequence fragments, e.g. Q9PRN3 from the *Sea lamprey*, horse P01202 and leech P41989, cannot be detected by any of these Pfam motifs.

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 137

Query= Q9PRN3|Q9PRN3\_PETMA Melanotropin MSH-B - Petromyzon marinus

> P01193|COLI\_MOUSE Corticotropin-lipotropin precursor(Proopiomelanocortin) (POMC) - Mus musculus (**Mammalia**) Length=235

> Q2L6A9|Q2L6A9\_MORMR Proopiomelanotropin (Fragment) - Mordacia mordax (**Hyperoartia**) Length=154 Score = 51.5 bits (114), Expect = 5e-06

(Hyperoartia) Length=20

Query 2 QESADGYRMQHFRWGQPLP 20 QE+ D **YR+QHFRW**G+PLP Sbjct 11 QENPDAYRIQHFRWGEPLP 29

Score = 32.5 bits (69), Expect = 2.4

Score = 30.8 bits (65), Expect = 7.7

Score = 22.3 bits (45), Expect = 2753

Query 3 ESADG-YRMQHFRWGQP 18 E DG **YR++HFRW** P Sbjct 183 EKDDGPYRVEHFRWSNP 199

Query 8 YRMQHFRWGQP 18 **Y M+HFRW**G+P Sbjct 125 YSMEHFRWGKP 135

Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 77 YVMGHFRW 84

Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 3 YVMGHFRW 10

Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 23 YVMSHFRW 30

Fig. 4. Sequence alignments between Q9PRN3 and P01193/Q53WY7/Q32U15 by BLAST.

> Q32U15|Q32U15\_9NEOB Proopiomelanocortin A (Fragment) - Trachycephalus jordani (**Amphibia**) Length=82 Score = 23.1 bits (47), Expect = 1529

> Q53WY7|Q53WY7\_HUMAN Proopiomelanocortin (Fragment) - Homo sapiens (**Mammalia**) Length=30 Score = 22.3 bits (45), Expect = 2753

The Clustal-W multi-alignment of all these FARP sequences together or within each of the seven phyla using default parameters (http://www.ebi.ac.uk/clustalw/) shows that the FARP precursors display sequence similarities within the mature peptide regions, particularly in the area containing the conserved peptide patterns, and that the remaining parts of the precursor sequences display rather low similarities. The FARP peptide precursors also differ from each other by the number of peptide repeat units within the sequences, which is thought to have arisen by unequal crossover events (Lee et al., 1998). In addition, we also observed that most of the mature FARP peptides share common Cterminal sequences but have much mutated N-terminal extensions. All these make it problematic to construct an accurate multiple alignment in order to derive a statistical

Note: the conserved peptide sequence similarities are in bold.

```
Query= Q9W687|Q9W687_ACITR Proenkephalin (Fragment)-Acipenser 
transmontanus (Actinopterygii) Length=45 
> Q5Y3C6|Q5Y3C6_HETPO Proenkephalin - Heterodontus portusjacksoni 
(Chondrichthyes) Length=264 Score = 39.2 bits (85), Expect = 0.032 
Query 14 RYDGFSKQ------PEHTDSKEITSEEV---EKRYGGFM 43 
 RY GF K+ P D EI S+EV EKRYGGFM
Sbjct 225 RYGGFMKRWNDILVPSDEDG-EIYSKEVPELEKRYGGFM 262 
Score = 31.2 bits (66), Expect = 8.7 
Query 14 RYDGFSKQPEHTDSKE--ITSEEVE----------KRYGGFM 43 
 RY GF K+ DS + I+ EV+ KRYGGFM
Sbjct 105 RYGGFMKK---ADSGDMYIS--EVDDENKGREILSKRYGGFM 141 
> Q6SYA7|Q6SYA7_PROAN Prodynorphin (Fragment) - Protopterus annectens 
(Dipnoi) Length=191 Score = 33.7 bits (72), Expect = 1.5 
Query 33 EEVEKRYGGFM 43 
 EE++KRYGGFM
Sbjct 169 EELQKRYGGFM 179
```
Fig. 3. Sequence alignments between Q9W687 and Q5Y3C6/Q6SYA7 by BLAST. Note: the conserved opioid peptide sequence similarities are in bold.

The BLAST alignment between Q9PRN3 and all proteins in the nr database unveils that, although Q9PRN3 cannot be identified by the Pfam motifs, it shares the highly conserved 'PeptideMotif' pattern 'Y-x-[MV]-x-H-F-R-W' with other POMC subfamily members, e.g., Q2L6A9 from Hyperoartia, P01193 and Q53WY7 from Mammalia, and Q32U15 from Amphibia (Fig. 4). This 8-column peptide pattern is a part of the 41-column Pfam motif PF00976. While the sequence region, which is described by this Pfam motif, may be an entire functional or structural domain, this peptide pattern contained within the longer domain is probably the most essentially functional part.

In total, our procedure identifies six novel peptide patterns in the combination of these two subfamilies. Among all the 397 proteins in this family, 113 were found to contain two of the peptide patterns, and the rest match one of them. These patterns characterize conserved domains located at different regions of a precursor sequence, and each of them can exclusively represent an opioid or POMC peptide or its precursor protein.

#### **6.2 FMRFamide and related neuropeptides (FARPs)**

It is widely known that FARPs occur throughout the whole animal kingdom and therefore this family is an ideally suited test case to check whether the disclosed pattern is capable of retrieving FARPs from all metazoan species (Ubuka et al., 2009). In total, 23 conserved peptide patterns have been uncovered from the family, and they match 214 FARPs sequences with 605 hits due to the presence of multiple copies of the conserved patterns within some precursor proteins. The identified FARPs distribute among a wide range of phyla, including Nematoda (85), Arthropoda (50), Mollusca (24), Annelida (9), Platyhelminthes (1), Cnidaria (10) and Chordata (35).

An 11-column Pfam motif PF01581 characterizes FARPs from all above-mentioned phyla except Chordata, e.g. human Q9HCQ7 and mouse Q9WVA8. In addition, conversely to the 'PeptideMotif' patterns, 49 FARP peptides or precursor proteins in these characterized phyla, e.g., Q9TWD2 from *Lymnaea stagnalis* and Q95QP2 from *Caenorhabditis elegans*, cannot be revealed by the Pfam motif with a significant score (e-value <0.01).

Query= Q9W687|Q9W687\_ACITR Proenkephalin (Fragment)-Acipenser

Query 14 RYDGFSKQ------PEHTDSKEITSEEV---EKRYGGFM 43 RY GF K+ P D EI S+EV E**KRYGGFM** Sbjct 225 RYGGFMKRWNDILVPSDEDG-EIYSKEVPELEKRYGGFM 262

Query 14 RYDGFSKQPEHTDSKE--ITSEEVE----------KRYGGFM 43 RY GF K+ DS + I+ EV+ **KRYGGFM** Sbjct 105 RYGGFMKK---ADSGDMYIS--EVDDENKGREILSKRYGGFM 141

(**Dipnoi**) Length=191 Score = 33.7 bits (72), Expect = 1.5

> Q5Y3C6|Q5Y3C6\_HETPO Proenkephalin - Heterodontus portusjacksoni (**Chondrichthyes**) Length=264 Score = 39.2 bits (85), Expect = 0.032

transmontanus (Actinopterygii) Length=45

Score = 31.2 bits (66), Expect = 8.7

Query 33 EEVEKRYGGFM 43 EE++**KRYGGFM** Sbjct 169 EELQKRYGGFM 179

Fig. 3. Sequence alignments between Q9W687 and Q5Y3C6/Q6SYA7 by BLAST.

The BLAST alignment between Q9PRN3 and all proteins in the nr database unveils that, although Q9PRN3 cannot be identified by the Pfam motifs, it shares the highly conserved 'PeptideMotif' pattern 'Y-x-[MV]-x-H-F-R-W' with other POMC subfamily members, e.g., Q2L6A9 from Hyperoartia, P01193 and Q53WY7 from Mammalia, and Q32U15 from Amphibia (Fig. 4). This 8-column peptide pattern is a part of the 41-column Pfam motif PF00976. While the sequence region, which is described by this Pfam motif, may be an entire functional or structural domain, this peptide pattern contained within the longer domain is

> Q6SYA7|Q6SYA7\_PROAN Prodynorphin (Fragment) - Protopterus annectens

In total, our procedure identifies six novel peptide patterns in the combination of these two subfamilies. Among all the 397 proteins in this family, 113 were found to contain two of the peptide patterns, and the rest match one of them. These patterns characterize conserved domains located at different regions of a precursor sequence, and each of them can

It is widely known that FARPs occur throughout the whole animal kingdom and therefore this family is an ideally suited test case to check whether the disclosed pattern is capable of retrieving FARPs from all metazoan species (Ubuka et al., 2009). In total, 23 conserved peptide patterns have been uncovered from the family, and they match 214 FARPs sequences with 605 hits due to the presence of multiple copies of the conserved patterns within some precursor proteins. The identified FARPs distribute among a wide range of phyla, including Nematoda (85), Arthropoda (50), Mollusca (24), Annelida (9),

An 11-column Pfam motif PF01581 characterizes FARPs from all above-mentioned phyla except Chordata, e.g. human Q9HCQ7 and mouse Q9WVA8. In addition, conversely to the 'PeptideMotif' patterns, 49 FARP peptides or precursor proteins in these characterized phyla, e.g., Q9TWD2 from *Lymnaea stagnalis* and Q95QP2 from *Caenorhabditis elegans*, cannot

Note: the conserved opioid peptide sequence similarities are in bold.

exclusively represent an opioid or POMC peptide or its precursor protein.

be revealed by the Pfam motif with a significant score (e-value <0.01).

probably the most essentially functional part.

**6.2 FMRFamide and related neuropeptides (FARPs)** 

Platyhelminthes (1), Cnidaria (10) and Chordata (35).

Query= Q9PRN3|Q9PRN3\_PETMA Melanotropin MSH-B - Petromyzon marinus (Hyperoartia) Length=20 > Q2L6A9|Q2L6A9\_MORMR Proopiomelanotropin (Fragment) - Mordacia mordax (**Hyperoartia**) Length=154 Score = 51.5 bits (114), Expect = 5e-06 Query 2 QESADGYRMQHFRWGQPLP 20 QE+ D **YR+QHFRW**G+PLP Sbjct 11 QENPDAYRIQHFRWGEPLP 29 > P01193|COLI\_MOUSE Corticotropin-lipotropin precursor(Proopiomelanocortin) (POMC) - Mus musculus (**Mammalia**) Length=235 Score = 32.5 bits (69), Expect = 2.4 Query 8 YRMQHFRWGQP 18 **Y M+HFRW**G+P Sbjct 125 YSMEHFRWGKP 135 Score = 30.8 bits (65), Expect = 7.7 Query 3 ESADG-YRMQHFRWGQP 18 E DG **YR++HFRW** P Sbjct 183 EKDDGPYRVEHFRWSNP 199 Score = 22.3 bits (45), Expect = 2753 Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 77 YVMGHFRW 84 > Q53WY7|Q53WY7\_HUMAN Proopiomelanocortin (Fragment) - Homo sapiens (**Mammalia**) Length=30 Score = 22.3 bits (45), Expect = 2753 Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 3 YVMGHFRW 10 > Q32U15|Q32U15\_9NEOB Proopiomelanocortin A (Fragment) - Trachycephalus jordani (**Amphibia**) Length=82 Score = 23.1 bits (47), Expect = 1529 Query 8 YRMQHFRW 15 **Y M HFRW**  Sbjct 23 YVMSHFRW 30

Fig. 4. Sequence alignments between Q9PRN3 and P01193/Q53WY7/Q32U15 by BLAST. Note: the conserved peptide sequence similarities are in bold.

The Clustal-W multi-alignment of all these FARP sequences together or within each of the seven phyla using default parameters (http://www.ebi.ac.uk/clustalw/) shows that the FARP precursors display sequence similarities within the mature peptide regions, particularly in the area containing the conserved peptide patterns, and that the remaining parts of the precursor sequences display rather low similarities. The FARP peptide precursors also differ from each other by the number of peptide repeat units within the sequences, which is thought to have arisen by unequal crossover events (Lee et al., 1998). In addition, we also observed that most of the mature FARP peptides share common Cterminal sequences but have much mutated N-terminal extensions. All these make it problematic to construct an accurate multiple alignment in order to derive a statistical

A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 139

When determining short functional patterns for peptide sequences, we have to evaluate how representative the peptide motifs are in the 110 characterized peptide families. Short motifs often have some degree of degeneracy and the presence of a motif in a protein may reflect a conserved functional role, a yet to be discovered structural functional role or a nonfunctional role. When using the short currently identified peptide patterns, while the false positives are kept to zero, we observe that 440 (3.8%) of the mature peptides or sequence fragments and 282 (2.5%) of the peptide precursor proteins in these described families cannot be recognized by the peptide patterns. Many of them could be determined by combining the peptide pattern search procedure with the structural hallmarks of bioactive peptides and their precursors (Liu et al., 2006), such as the length of a peptide precursor which is usually not longer than 500 amino acids, the presence of a signal peptide which directs a precursor protein into the secretary pathway of the cell, and the presence of typical cleavage sites flanking the mature peptides. To be even more successful in identifying all false negatives while eliminating all false positives because of the short length and degeneracy of most short motifs, it may be possible to make use of 3D structural patterns when they become available for peptide precursor proteins. Patterns that integrate 3D structural information of the sequences will be more sensitive in identifying peptides and

While the majority of known peptide families have been profiled by the established peptide patterns, the remaining ones accounting for in total 251 peptides and precursor proteins (2% of all the proteins in the peptide-precursor database) are not processed by the pattern search procedure. They are from small peptide families, such as eclosion hormones, ecdysistriggering hormones and apelin, which have only a few homologies so far. A pattern based on the small number of peptides usually cannot gain enough confidence in representing the family, and also cannot sufficiently reflect the sequence divergence accumulated in the evolutionary course of the family member. As more peptides and precursor proteins are sequenced, our patterns search procedure can be applied to the corresponding families and the 'PeptideMotif' database will be updated accordingly, keeping the peptide pattern database widely applicable for the identification of critical functional residues and for the

Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W. & Lipman,

Arends, R.J.; Vermeer, H.; Martens, G.J.; Leunissen, J.A.; Wendelaar Bonga, S.E. & Flik G.

Baggerman, G.; Cerstiaens, A.; De Loof, A. & Schoofs, L. (2002). Peptidomics of the larval

Baggerman, G.; Liu, F.; Wets, G. & Schoofs, L. (2005). Bioinformatic analysis of peptide precursor proteins. *Ann N Y Acad Sci,* Vol.1040, pp. 59–65, ISSN 0077-8923

D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. *Nucleic Acids Res,* Vol. 25, No. 17, pp. 3389-3402, ISSN

(1998). Cloning and expression of two proopiomelanocortin mRNAs in the common carp (Cyprinus carpio L.). *Mol Cell Endocrinol,* Vol.143, No. 1-2, (August

Drosophila melanogaster central nervous system. *J. Biol. Chem,* Vol. 277, pp. 40368-

peptide precursors (Gribskov et al., 1988; Taylor et al., 2004).

annotation of hypothetical molecules in various peptide families.

**8. References** 

0305-1048

1998), pp. 23–31, ISSN 0303-7207

40374, ISSN 0021-9258

model which represents distantly related proteins from various phyla throughout the evolutionary history of the FARP peptide family.

### **7. Conclusion**

Protein domains are highly conserved throughout evolution and there are several databases available that catalogue protein families and domains. Such motif and domain databases are very useful in assigning a putative function to an unknown protein. Peptide precursor proteins are a distinctive class of molecules because they undertake various posttranslational modifications in order to ultimately synthesize stabilized and functional mature peptides, making the annotation of peptides and peptide precursor proteins challenging. This is illustrated by the fact that many metazoan peptides and peptide precursors are not represented by the motifs currently present in the widely used motif database such as PROSITE.

Because of the tremendously increasing number of protein sequences and because of the wide range of peptide families, a comprehensive database of conserved patterns typical for endogenously occurring mature peptides is of great value in identifying new peptides and precursor proteins to catch up with their sequencing rate. We therefore have designed a searching procedure to find conserved patterns within the known peptides, and as a result, we have constructed a 'PeptideMotif' database that is representative of most currently known peptide families.

Many peptides have been isolated and sequenced as mature peptides and their precursor proteins are often unknown as yet. Therefore, these small peptides are difficult to be identified by other motif databases. Motifs in databases such as Pfam contain two Hidden Marcov Models (HMMs) for each family based on a multiple protein sequence alignment, one built to find complete domains (ls mode) and the other to match fragments of domains (fs mode) (Durbin et al., 1998). These motifs are sensitive at identifying complete domains and thus they can efficiently detect the proteins which have similarities that cover the full length protein sequence or at least contain a complete domain. However, these motifs do not work very well when they encounter short peptides which lack information on amino acids at the sites outside the peptide sequences, or when the conserved regions are limited, especially in distantly related proteins where the overalllength sequence similarity may be not well preserved. In contrast, the patterns derived directly from the mature peptide sequences grasp the highly preserved region of the precursor proteins and thus are able to identify not only the peptide precursor molecules but also the fully processed peptides.

Conservative peptide sequence patterns correspond to functionally and structurally important parts of the peptides, i.e. the binding site to specific receptors, the disulphide bonds for stability and tertiary structure. The discovery of peptide motifs will be undoubtedly of great value for any peptide-related studies ranging from the identification of putative peptides and precursor proteins to the annotation of critical functional residues (Husson et al., 2010), to the complement of peptidomic research in detecting and verifying peptides in vitro (Baggerman et al., 2004; Boonen et al., 2008; Menschaert et al., 2010). For example, scanning the peptide patterns against Uniprot revealed 95 proteins (listed in Tables 1 and 2) which are not as yet annotated as putative peptides or precursor proteins.

When determining short functional patterns for peptide sequences, we have to evaluate how representative the peptide motifs are in the 110 characterized peptide families. Short motifs often have some degree of degeneracy and the presence of a motif in a protein may reflect a conserved functional role, a yet to be discovered structural functional role or a nonfunctional role. When using the short currently identified peptide patterns, while the false positives are kept to zero, we observe that 440 (3.8%) of the mature peptides or sequence fragments and 282 (2.5%) of the peptide precursor proteins in these described families cannot be recognized by the peptide patterns. Many of them could be determined by combining the peptide pattern search procedure with the structural hallmarks of bioactive peptides and their precursors (Liu et al., 2006), such as the length of a peptide precursor which is usually not longer than 500 amino acids, the presence of a signal peptide which directs a precursor protein into the secretary pathway of the cell, and the presence of typical cleavage sites flanking the mature peptides. To be even more successful in identifying all false negatives while eliminating all false positives because of the short length and degeneracy of most short motifs, it may be possible to make use of 3D structural patterns when they become available for peptide precursor proteins. Patterns that integrate 3D structural information of the sequences will be more sensitive in identifying peptides and peptide precursors (Gribskov et al., 1988; Taylor et al., 2004).

While the majority of known peptide families have been profiled by the established peptide patterns, the remaining ones accounting for in total 251 peptides and precursor proteins (2% of all the proteins in the peptide-precursor database) are not processed by the pattern search procedure. They are from small peptide families, such as eclosion hormones, ecdysistriggering hormones and apelin, which have only a few homologies so far. A pattern based on the small number of peptides usually cannot gain enough confidence in representing the family, and also cannot sufficiently reflect the sequence divergence accumulated in the evolutionary course of the family member. As more peptides and precursor proteins are sequenced, our patterns search procedure can be applied to the corresponding families and the 'PeptideMotif' database will be updated accordingly, keeping the peptide pattern database widely applicable for the identification of critical functional residues and for the annotation of hypothetical molecules in various peptide families.

#### **8. References**

138 Bioinformatics – Trends and Methodologies

model which represents distantly related proteins from various phyla throughout the

Protein domains are highly conserved throughout evolution and there are several databases available that catalogue protein families and domains. Such motif and domain databases are very useful in assigning a putative function to an unknown protein. Peptide precursor proteins are a distinctive class of molecules because they undertake various posttranslational modifications in order to ultimately synthesize stabilized and functional mature peptides, making the annotation of peptides and peptide precursor proteins challenging. This is illustrated by the fact that many metazoan peptides and peptide precursors are not represented by the motifs currently present in the widely used motif

Because of the tremendously increasing number of protein sequences and because of the wide range of peptide families, a comprehensive database of conserved patterns typical for endogenously occurring mature peptides is of great value in identifying new peptides and precursor proteins to catch up with their sequencing rate. We therefore have designed a searching procedure to find conserved patterns within the known peptides, and as a result, we have constructed a 'PeptideMotif' database that is representative of most currently

Many peptides have been isolated and sequenced as mature peptides and their precursor proteins are often unknown as yet. Therefore, these small peptides are difficult to be identified by other motif databases. Motifs in databases such as Pfam contain two Hidden Marcov Models (HMMs) for each family based on a multiple protein sequence alignment, one built to find complete domains (ls mode) and the other to match fragments of domains (fs mode) (Durbin et al., 1998). These motifs are sensitive at identifying complete domains and thus they can efficiently detect the proteins which have similarities that cover the full length protein sequence or at least contain a complete domain. However, these motifs do not work very well when they encounter short peptides which lack information on amino acids at the sites outside the peptide sequences, or when the conserved regions are limited, especially in distantly related proteins where the overalllength sequence similarity may be not well preserved. In contrast, the patterns derived directly from the mature peptide sequences grasp the highly preserved region of the precursor proteins and thus are able to identify not only the peptide precursor molecules

Conservative peptide sequence patterns correspond to functionally and structurally important parts of the peptides, i.e. the binding site to specific receptors, the disulphide bonds for stability and tertiary structure. The discovery of peptide motifs will be undoubtedly of great value for any peptide-related studies ranging from the identification of putative peptides and precursor proteins to the annotation of critical functional residues (Husson et al., 2010), to the complement of peptidomic research in detecting and verifying peptides in vitro (Baggerman et al., 2004; Boonen et al., 2008; Menschaert et al., 2010). For example, scanning the peptide patterns against Uniprot revealed 95 proteins (listed in Tables 1 and 2) which are not as yet annotated as putative peptides or precursor

evolutionary history of the FARP peptide family.

**7. Conclusion** 

database such as PROSITE.

known peptide families.

but also the fully processed peptides.

proteins.


A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families 141

Husson, S.J.; Landuyt, B.; Nys, T.; Baggerman, G.; Boonen, K.; Clynen, E.; Lindemans, M.;

Jonassen, I.; Collins, J.F. & Higgins, D.G. (1995). Finding flexible patterns in unaligned protein sequences. *Protein Sci,* Vol. 4, No. 8, pp. 1587–1595, ISSN 0961-8368 Lee, H.S.; Simon, J.A. & Lis, J.T. (1998). Structure and expression of ubiquitin genes of

Liu, F.; Baggerman, G.; D'Hertog, W.; Verleyen, P.; Schoofs, L. & Wets, G. (2006). In silico

Liu, F. & Wets, G. (2005). A Neural Network Method for Prediction of Proteolytic Cleavage

Marchler-Bauer, A.; Anderson, J.B.; Cherukuri, P.F.; Weese-Scott, C.; Geer, L.Y.; Gwadz, M.;

Masashi, Y.; Watanobe, H. & Terano, A. (2001). Central regulation of hepatic function by

Menschaert, G.; Vandekerckhove, T.T.; Baggerman, G.; Schoofs, L.; Luyten, W. & Van

Rouille, Y.; Duguay, S.J.; Lund, K.; Furuta, M.; Gong, Q.; Lipkind, G.; Oliva AA, J., Chan, S.J.

Schlesinger, D.H.; Pickart, L. & Thaler, M.M. (1977). Growth-modulating serum tripeptide is glycyl-histidyl-lysine. *Experientia,* Vol. 33, No. 3, pp. 324–325, ISSN 0014-4754 Schoofs, L. & Baggerman, G. (2003). Peptidomics in Drosophila melanogaster. *Brief Funct* 

Taylor, W.R. & Jonassen, I. (2004). A structural pattern-based method for protein fold

Vandenborne, K.; Roelens, S.A.; Darras, V.M.; Kuhn, E.R. & Van der, G.S. (2005). Cloning

precursor cDNA. *J Endocrinol*, Vol. 186, No. 2, pp. 387–396, ISSN 0022-0795 Smith, R.F. & Smith, T.F. (1990). Automatic generation of primary sequence patterns from

and hypothalamic distribution of the chicken thyrotropin-releasing hormone

sets of related protein sequences. *Proc Natl Acad Sci USA*, Vol. 87, No. 1, pp. 118–

neuropeptides. *J Gastroenterol*, Vol. 36, pp. 361–367, ISSN 0002-9270

*Neuroendocrinol,* Vol. 16, No. 4, pp. 322–361, ISSN 0091-3022

*Genomic Proteomic,* Vol. 2, No. 2, pp. 114-120, ISSN 2041-2647

recognition. *Proteins*, Vol. 56, No. 2, pp. 222–234, ISSN 0887-3585

*Proteomics,* Vol. 5, No. 3, pp. 510–522, ISSN 1535-9476

ShangHai, China, September 1-4, 2005

D192–196, ISSN 0305-1048

1535-3893

122, ISSN 1091-6490

457, ISSN 0196-9781

7750

Janssen, T. & Schoofs, L. (2009) Comparative peptidomics of Caenorhabditis elegans versus C. briggsae by LC-MALDI-TOF MS. *Peptides,* Vol. 30, No. 3, pp. 449-

Drosophila melanogaster. *Mol Cell Biol,* Vol. 8, No. 11, pp. 4727–4735, ISSN 0898-

identification of new secretory peptide genes in Drosophila melanogaster. *Mol Cell* 

Sites in Neuropeptide Precursors. *Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference*, Vol. 3, pp. 2805-2808, ISSN 1557-170X,

He, S.; Hurwitz, D.I.; Jackson, J.D.; Ke, Z.; Lanczycki, C.J.; Liebert, C.A.; Liu, C.; Lu, F.; Marchler, G.H.; Mullokandov, M.; Shoemaker, B.A.; Simonyan, V.; Song, J.S.; Thiessen, P.A.; Yamashita, R.A.; Yin, J.J.; Zhang, D. & Bryant, S.H. (2005). CDD: a conserved domain database for protein classification. *Nucl Acids Res,* Vol. 33, pp.

Criekinge, W. (2010). Peptidomics coming of age: a review of contributions from a bioinformatics angle. *J Proteome Res,* Vol. 9, No. 5, pp. 2051-2061, ISSN

& Steiner, D.F. (1995). Proteolytic processing mechanisms in the biosynthesis of neuroendocrine peptides: the subtilisin-like proprotein convertases. *Front* 


Baggerman, G.; Verleyen, P.; Clynen, E.; Huybrechts, J.; De Loof, A. & Schoofs, L. (2004).

Bateman, A.; Coin, L.; Durbin, R.; Finn R.D.; Hollich, V.; Griffiths-Jones, S.; Khanna, A.;

Boonen, K.; Baggerman, G.; D'Hertog, W.; Husson, S.J.; Overbergh, L.; Mathieu, C. &

Boonen, K.; Landuyt, B.; Baggerman, G.; Husson, S.J.; Huybrechts, J. & Schoofs, L. (2008).

Boonen, K.; Husson, S.J.; Landuyt, B.; Baggerman, G.; Hayakawa, E.; Luyten, W.H. &

Comb, M.; Seeburg, P.H.; Adelman, J.; Eiden, L. & Herbert, E. (1982). Primary structure of

Durbin, R.; Eddy, S.; Krogh, A. & Mitchison, G. (1998). *Biological sequence analysis:* 

Filipsson, K.; Kvist-Reimer, M. & Ahren, B. (2001). The neuropeptide pituitary adenylate

Finn, R.D.; Mistry, J.; Tate, J.; Coggill, P.; Heger, A.; Pollington, J.E.; Gavin, O.L.;

Gribskov, M.; Homyak, M.; Edenfield, J. & Eisenberg, D. (1988). Profile scanning for three-

Henry, J.; Favrel, P. & Boucaud-Camou, E. (1997). Isolation and identification of a novel Ala-

Hulo, N.; Sigrist, C.J.; Le, S.V.; Langendijk-Genevaux, P.S.; Bordoli, L.; Gattiker, A.; De

Husson, S.J.; Clynen, E.; Boonen, K.; Janssen, T.; Lindemans, M.; Baggerman, G. & Schoofs,

database. *Nucl Acids Res,* Vol. 32, pp. D134–D137, ISSN 0305-1048

*Res,* Vol.38, No. suppl 1, pp. D211-D222, ISSN 0305-1048

*Gen Comp Endocrinol,* Vol. 152, No. 2-3, pp. 231-241, ISSN 0016-6480

ISSN 1570-0232

1615-9306

6029

9781

pp. D138–41, ISSN 0305-1048

663-666, ISSN 0028-0836

1969, ISSN 0012-1797

9780521629713, Cambridge, UK

No. 1, pp. 61–66, ISSN 0266-7061

Peptidomics. *J Chromatogr B Analyt Technol Biomed Life Sci,* Vol. 803, No. 1, pp. 3-16,

Marshall, M.; Moxon, S.; Sonnhammer, E.L.L.; Studholme, D.J.; Yeats, C. & Eddy, S.R. (2004). The Pfam protein families database. *Nucl Acids Res,* Vol. 32, No. suppl 1,

Schoofs, L. (2007). Neuropeptides of the islets of Langerhans: a peptidomics study.

Peptidomics: the integrated approach of MS, hyphenated techniques and bioinformatics for neuropeptide analysis. *J Sep Sci,* Vol. 31, No. 3, pp. 427-445, ISSN

Schoofs, L. (2010). Identification and relative quantification of neuropeptides from the endocrine tissues. *Methods Mol Biol,* Vol. 615, pp. 191-206, ISSN 1940-

the human Met- and Leu-enkephalin precursor and its mRNA. *Nature*, Vol. 295, pp.

*probabilistic models of proteins and nucleic acids,* Cambridge University Press, ISBN

cyclase-activating polypeptide and islet function. *Diabetes,* Vol.50, No.9, pp. 1959–

Gunasekaran, P.; Ceric, G.; Forslund, K.; Holm, L.; Sonnhammer, E. L. L.; Eddy S. R. & Bateman A. (2010). The Pfam protein protein families database. *Nucleic Acids* 

dimensional structural patterns in protein sequences*. Comput Appl Biosci,* Vol. 4,

Pro-Gly-Trp-amide-related peptide inhibiting the motility of the mature oviduct in the cuttlefish, Sepia officinalis. *Peptides,* Vol.18, No. 10, pp. 1469–1474, ISSN 0196-

Castro, E.; Bucher, P. & Bairoch, A. (2004). Recent improvements to the PROSITE

L. (2010). Approaches to identify endogenous peptides in the soil nematode Caenorhabditis elegans. *Methods Mol Biol,* Vol. 615, pp. 29-47, ISSN 1940-6029


**7** 

*U.S.A.* 

*School of Medicine, Philadelphia,* 

**Database Mining: Defining the Pathogenesis of** 

**Inflammatory and Immunological Diseases** 

*Department of Pharmacology, Cardiovascular Research Center, Temple University*

Cardiovascular disease (CVD) is a leading cause of mortality in developed countries (Jan et al., 2010; Yang et al., 2008). Despite a long held understanding and strong characterization of the traditional and non-traditional risk factors for CVD, some mechanisms of CVD onset have only recently been uncovered. As a chronic inflammatory autoimmune disease, atherosclerosis and its progression involve innate and adaptive immune systems. Using new concepts and technologies to improve the current understandings of the molecular pathogenesis of inflammatory and immune responses would lead to the future development

Biomedical literature and databases, available in electronic forms, contain a vast amount of knowledge resulting from experimental research (Ishii et al., 2007; Palakal et al., 2007). In the past decade, both traditional hypothesis-driven research and discovery-driven "-omics" research, including genomics, transcriptomics (Liang et al., 2005), proteinomics, metabolomics, glycomics, lipidomics, localizomics, protein-DNA interactomics, proteinprotein interactomics, fluxomics, phenomics (Joyce & Palsson, 2006), and antigen-omics (http://www.cancerimmunity.org/links/databases.htm) (Houle et al., 2010; Shimokawa et al., 2010; Weinstein, 1998;2002), has generated a tremendous amount of data and established many experimental data-based searchable databases. These databases include PubMed, nucleotide database, protein database, and other databases generated by the National Institutes of Health (NIH)/National Center for Biotechnology Information (NCBI) (see the NCBI handbook at http://www.ncbi.nlm.nih.gov/books/NBK21101/) and other institutions. This development has not only provided resources, but also raised unprecedented challenges and opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the data contained in the databases and generate new hypotheses. The inconsistency between the vast amount of experimental data, various searchable databases, and relatively smaller numbers of database-mining research papers (< 50 papers on database mining in inflammation and immune responses listed in the PubMed) indicate the challenges that experimental biomedical scientists face, which

include both technical/methodological difficulties and out-of-date concepts.

Traditionally, medical literature search using the Index Medicus was the major approach for biomedical scientists to identify knowledge gaps and preparing new hypotheses. However, this approach has been significantly enhanced by more systemic approaches such as *1)*

**1. Introduction** 

of novel therapeutics for these diseases.

Fan Yang, Irene Hwa Yang, Hong Wang and Xiao-Feng Yang

Ubuka, T.; Morgan, K.; Pawson, A.J.; Osugi, T.; Chowdhury, V.S.; Minakata, H.; Tsutsui, K.; Millar, R.P. & Bentley, G.E. (2009). Identification of human GnIH homologs, RFRP-1 and RFRP-3, and the cognate receptor, GPR147 in the human hypothalamic pituitary axis. *PLoS ONE*, Vol. 4, No. 12, e8400, ISSN 1932-6203

### **Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases**

Fan Yang, Irene Hwa Yang, Hong Wang and Xiao-Feng Yang *Department of Pharmacology, Cardiovascular Research Center, Temple University School of Medicine, Philadelphia, U.S.A.* 

#### **1. Introduction**

142 Bioinformatics – Trends and Methodologies

Ubuka, T.; Morgan, K.; Pawson, A.J.; Osugi, T.; Chowdhury, V.S.; Minakata, H.; Tsutsui, K.;

pituitary axis. *PLoS ONE*, Vol. 4, No. 12, e8400, ISSN 1932-6203

Millar, R.P. & Bentley, G.E. (2009). Identification of human GnIH homologs, RFRP-1 and RFRP-3, and the cognate receptor, GPR147 in the human hypothalamic

> Cardiovascular disease (CVD) is a leading cause of mortality in developed countries (Jan et al., 2010; Yang et al., 2008). Despite a long held understanding and strong characterization of the traditional and non-traditional risk factors for CVD, some mechanisms of CVD onset have only recently been uncovered. As a chronic inflammatory autoimmune disease, atherosclerosis and its progression involve innate and adaptive immune systems. Using new concepts and technologies to improve the current understandings of the molecular pathogenesis of inflammatory and immune responses would lead to the future development of novel therapeutics for these diseases.

> Biomedical literature and databases, available in electronic forms, contain a vast amount of knowledge resulting from experimental research (Ishii et al., 2007; Palakal et al., 2007). In the past decade, both traditional hypothesis-driven research and discovery-driven "-omics" research, including genomics, transcriptomics (Liang et al., 2005), proteinomics, metabolomics, glycomics, lipidomics, localizomics, protein-DNA interactomics, proteinprotein interactomics, fluxomics, phenomics (Joyce & Palsson, 2006), and antigen-omics (http://www.cancerimmunity.org/links/databases.htm) (Houle et al., 2010; Shimokawa et al., 2010; Weinstein, 1998;2002), has generated a tremendous amount of data and established many experimental data-based searchable databases. These databases include PubMed, nucleotide database, protein database, and other databases generated by the National Institutes of Health (NIH)/National Center for Biotechnology Information (NCBI) (see the NCBI handbook at http://www.ncbi.nlm.nih.gov/books/NBK21101/) and other institutions. This development has not only provided resources, but also raised unprecedented challenges and opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the data contained in the databases and generate new hypotheses. The inconsistency between the vast amount of experimental data, various searchable databases, and relatively smaller numbers of database-mining research papers (< 50 papers on database mining in inflammation and immune responses listed in the PubMed) indicate the challenges that experimental biomedical scientists face, which include both technical/methodological difficulties and out-of-date concepts.

> Traditionally, medical literature search using the Index Medicus was the major approach for biomedical scientists to identify knowledge gaps and preparing new hypotheses. However, this approach has been significantly enhanced by more systemic approaches such as *1)*

Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 145

often involved in the bioinformatic algorithm generation, but may want to use database mining methods in their research either as parts of existing experimental studies or as freestanding projects. Of note, the database mining concept is not "brand new". Medical research has a long history in full-value extraction from costly data. For example, a metaanalysis uses a statistical approach to combine the results of several epidemiological studies that address a set of related research hypotheses. This practice started well over 100 years ago and has been widely used in various disease-related researches (http://en.wikipedia.org/wiki/Meta-analysis) (Egger & Smith, 1997; Egger et al., 1997). We believe that the practice of database mining will become a routine exercise to identify

In recent years, many databases regarding immune responses and inflammation have been established (Jan et al., 2010; Yang et al., 2006a), which have expanded the scope and depth of a publicly searchable online repertoire of tools. The results derived from the database mining analyses have become parts of many research papers or free-standing papers. Although projects may vary in format, database mining approaches follow the same set of principles (Fig. 1): 1) Hypothesis: A clearly-presented hypothesis based on the current biomedical literature search in a given field and previous experimental data in the lab is required to carry on a database mining project as we reported (Ng et al., 2004; Yan et al., 2004), which is similar to that of experimental projects. Of note, the database mining referred here focuses on database mining as a free standing project rather than as a part of experimental research; 2) Scope: Database mining scopes in terms of gene numbers are far more than that examined in experimental approaches. For example, our own research will examine mRNA transcript expressions of about 30 genes including all the reported toll-like receptors, NOD-like receptors, and inflammatory caspases in more than ten tissues. This scope allows us to obtain a panoramic view on the expressions of inflammatory pathways without focusing on a single gene in many tissues (Yin et al., 2009); 3) Suitable databases: Databases that are suitable for examining the hypothesis are available for online analytic search, which is also similar to the methods and reagents for experimental projects; 4) Sizable experimentally verified data for generating confidence intervals with statistical significance: To consolidate the results generated from database mining, the experimentally verified data are published by various laboratories, which can be used to generate statistically significant confidence intervals by using the same online analysis tools as we reported (Virtue, 2011). In this study, our analysis in the TargetScan yielded 524 microRNAs, which were predicted to participate in 1368 unique interactions with the 33 inflammatory gene mRNAs. To ensure relevance, we examined the context value and percentage of experimentally verified microRNAs. Confidence intervals were generated from 45 interactions between 28 experimentally verified human microRNAs and 36 genes found within the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/) (Papadopoulos et al., 2009; Sethupathy et al., 2006). These experimental interactions were also selected based on their confirmation by luciferase reporter assays and single site specificity. The 45 microRNA-mRNA interactions that met these criteria were then evaluated in TargetScan to determine the microRNA

existing knowledge gaps and to generate new hypotheses.

**2. Principles of database mining** 

NCBI-PubMed search and Google Scholar search; *2)* experimentally screening cDNA libraries and various arrays (nucleic acid arrays, antibody arrays, protein arrays and metabolic arrays) (King et al., 2005; Loza et al., 2007; Pandey et al., 2004; Warner & Dieckgraefe, 2002); and *3)* mining experimental databases (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Yang et al., 2006a; Yang et al., 2006b; Yin et al., 2009). The screening analysis of microarray data often requires bioinformatic methods, algorithms, and expertise. In comparison, database mining offers many advantages. First, database mining requires much less bioinformatic assistance in each laboratory when compared to the generation of algorithms required in microarray analyses, since the purpose of generating databases is to use bioinformatic approaches to mine easily organize the experimental data for biomedical scientists to mine (Spasic et al., 2005). Second, database mining enables full-value extraction from costly experimental data, and third, it provides panoramic analyses on existing knowledge gaps by generating new hypotheses for further experimental research. However, database mining requires biomedical scientists to have more conceptual advances than technical assistances. The purpose of database mining is to analyze experimental data deposited by various research projects, rather than predicting theoretic results based on pure theoretical bioinformatic studies. Thus, database mining is not limited to sequence comparisons of nucleic acids and proteins (Mount, 2004), sequence alignments, analysis of hydrophobicity index and functional domain prediction of proteins. Additionally, database mining has not generally been listed as a required course for graduate and postdoctoral studies, which presents a challenge of properly training young biomedical scientists with essential database mining techniques. On top of these aforementioned challenges, reviewers from peer-reviewed database mining publications often mistakenly regard the experimental data in electronic forms deposited in databases as "non-experimental or theoretical" and demand ridiculous additional verifying experiments to be performed, even requiring the use of outdated experimental techniques or methods. To overcome these difficulties, bioinformatic scientists will have to work together with biomedical colleagues and delve into the biological significance of database mining projects, rather than sticking to an argument of "no algorithms means no bioinformatics". Already, more and more database mining papers have been published as scientists put aside their differences. For example, the 2011 (18th) database issue of the journal "Nucleic Acid Research" features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology (Galperin & Cochrane, 2011). The Nucleic Acids Research online Database Collection, available at: http://www.oxfordjournals.org/nar/database/a/, now lists 1330 carefully selected molecular biology databases. In addition, 32 databases and analysis resources of immunological interest have been established (Salimi et al., 2010). Moreover, our recent invited review lists 11 B cell antigen epitope databases and 13 T cell antigen epitope analysis resources (Jan et al., 2010). These progresses suggest that a data mining approach has gradually been accepted as mainstream practice in analyzing experimental data and generating new hypotheses for various projects (Salimi et al., 2010).

Our lab has successfully pioneered major advances in database mining in the fields of adaptive immune reactions, innate immune responses, and inflammation (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Virtue, 2011; Yang et al., 2006a; Yang et al., 2006b; Yin et al., 2009). In this chapter, we will summarize the general approaches, principles, and databases used and new working models proposed in our database mining research. This discussion will prove to be important and useful for most biomedical scientists, since many are not often involved in the bioinformatic algorithm generation, but may want to use database mining methods in their research either as parts of existing experimental studies or as freestanding projects. Of note, the database mining concept is not "brand new". Medical research has a long history in full-value extraction from costly data. For example, a metaanalysis uses a statistical approach to combine the results of several epidemiological studies that address a set of related research hypotheses. This practice started well over 100 years ago and has been widely used in various disease-related researches (http://en.wikipedia.org/wiki/Meta-analysis) (Egger & Smith, 1997; Egger et al., 1997). We believe that the practice of database mining will become a routine exercise to identify existing knowledge gaps and to generate new hypotheses.

#### **2. Principles of database mining**

144 Bioinformatics – Trends and Methodologies

NCBI-PubMed search and Google Scholar search; *2)* experimentally screening cDNA libraries and various arrays (nucleic acid arrays, antibody arrays, protein arrays and metabolic arrays) (King et al., 2005; Loza et al., 2007; Pandey et al., 2004; Warner & Dieckgraefe, 2002); and *3)* mining experimental databases (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Yang et al., 2006a; Yang et al., 2006b; Yin et al., 2009). The screening analysis of microarray data often requires bioinformatic methods, algorithms, and expertise. In comparison, database mining offers many advantages. First, database mining requires much less bioinformatic assistance in each laboratory when compared to the generation of algorithms required in microarray analyses, since the purpose of generating databases is to use bioinformatic approaches to mine easily organize the experimental data for biomedical scientists to mine (Spasic et al., 2005). Second, database mining enables full-value extraction from costly experimental data, and third, it provides panoramic analyses on existing knowledge gaps by generating new hypotheses for further experimental research. However, database mining requires biomedical scientists to have more conceptual advances than technical assistances. The purpose of database mining is to analyze experimental data deposited by various research projects, rather than predicting theoretic results based on pure theoretical bioinformatic studies. Thus, database mining is not limited to sequence comparisons of nucleic acids and proteins (Mount, 2004), sequence alignments, analysis of hydrophobicity index and functional domain prediction of proteins. Additionally, database mining has not generally been listed as a required course for graduate and postdoctoral studies, which presents a challenge of properly training young biomedical scientists with essential database mining techniques. On top of these aforementioned challenges, reviewers from peer-reviewed database mining publications often mistakenly regard the experimental data in electronic forms deposited in databases as "non-experimental or theoretical" and demand ridiculous additional verifying experiments to be performed, even requiring the use of outdated experimental techniques or methods. To overcome these difficulties, bioinformatic scientists will have to work together with biomedical colleagues and delve into the biological significance of database mining projects, rather than sticking to an argument of "no algorithms means no bioinformatics". Already, more and more database mining papers have been published as scientists put aside their differences. For example, the 2011 (18th) database issue of the journal "Nucleic Acid Research" features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology (Galperin & Cochrane, 2011). The Nucleic Acids Research online Database Collection, available at: http://www.oxfordjournals.org/nar/database/a/, now lists 1330 carefully selected molecular biology databases. In addition, 32 databases and analysis resources of immunological interest have been established (Salimi et al., 2010). Moreover, our recent invited review lists 11 B cell antigen epitope databases and 13 T cell antigen epitope analysis resources (Jan et al., 2010). These progresses suggest that a data mining approach has gradually been accepted as mainstream practice in analyzing experimental data and

generating new hypotheses for various projects (Salimi et al., 2010).

Our lab has successfully pioneered major advances in database mining in the fields of adaptive immune reactions, innate immune responses, and inflammation (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Virtue, 2011; Yang et al., 2006a; Yang et al., 2006b; Yin et al., 2009). In this chapter, we will summarize the general approaches, principles, and databases used and new working models proposed in our database mining research. This discussion will prove to be important and useful for most biomedical scientists, since many are not In recent years, many databases regarding immune responses and inflammation have been established (Jan et al., 2010; Yang et al., 2006a), which have expanded the scope and depth of a publicly searchable online repertoire of tools. The results derived from the database mining analyses have become parts of many research papers or free-standing papers. Although projects may vary in format, database mining approaches follow the same set of principles (Fig. 1): 1) Hypothesis: A clearly-presented hypothesis based on the current biomedical literature search in a given field and previous experimental data in the lab is required to carry on a database mining project as we reported (Ng et al., 2004; Yan et al., 2004), which is similar to that of experimental projects. Of note, the database mining referred here focuses on database mining as a free standing project rather than as a part of experimental research; 2) Scope: Database mining scopes in terms of gene numbers are far more than that examined in experimental approaches. For example, our own research will examine mRNA transcript expressions of about 30 genes including all the reported toll-like receptors, NOD-like receptors, and inflammatory caspases in more than ten tissues. This scope allows us to obtain a panoramic view on the expressions of inflammatory pathways without focusing on a single gene in many tissues (Yin et al., 2009); 3) Suitable databases: Databases that are suitable for examining the hypothesis are available for online analytic search, which is also similar to the methods and reagents for experimental projects; 4) Sizable experimentally verified data for generating confidence intervals with statistical significance: To consolidate the results generated from database mining, the experimentally verified data are published by various laboratories, which can be used to generate statistically significant confidence intervals by using the same online analysis tools as we reported (Virtue, 2011). In this study, our analysis in the TargetScan yielded 524 microRNAs, which were predicted to participate in 1368 unique interactions with the 33 inflammatory gene mRNAs. To ensure relevance, we examined the context value and percentage of experimentally verified microRNAs. Confidence intervals were generated from 45 interactions between 28 experimentally verified human microRNAs and 36 genes found within the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/) (Papadopoulos et al., 2009; Sethupathy et al., 2006). These experimental interactions were also selected based on their confirmation by luciferase reporter assays and single site specificity. The 45 microRNA-mRNA interactions

that met these criteria were then evaluated in TargetScan to determine the microRNA

Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 147

to verify the data generated by the database mining (Yan et al., 2004); and *6)* A new working model/hypothesis: Through database mining, a new knowledge gap will be identified, and a new hypothesis will be proposed to test fewer, much more-focused genes in further experiments. The following sections will illustrate these principles in our own publications (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Virtue, 2011; Yang et al., 2006a; Yang et al.,

**3. Database mining example 1: Stimulation-responsive alternative splicing is an important mechanism in generating self-antigen epitopes (Ng et al., 2004;** 

In our invited review, we pointed out that the identification and molecular characterization of self-antigens expressed by human malignancies, that are capable of elicitation of anti-tumor immune responses in patients, have been an active field in tumor immunology (Yang & Yang, 2005). More than 2,000 tumor antigens have been identified, and most of these antigens are self-antigens (Yang & Yang, 2005). Despite this, the important question of how non-mutated self-protein antigens, generated from normal cells and tumor cells, gain immunogenicity and trigger immune recognition remained unanswered (Yang & Yang, 2005). Mutations may be responsible for some aspects of elevated immunogenicity underlying certain tumor-specific antigens (p53 and Ras), while chromosome translocations and abnormalities, such as expression of the fusion oncogene Bcr-Abl in chronic myelogenous leukemia (Clark et al., 2001; Pinilla-Ibarz et al., 2000; Yotnda et al., 1998; Zorn, 2001) (Yang et al., 2002; Yang et al., 2001) are responsible for other aspects. However, the mechanism underlying the immunogenicity of most non-mutated self-tumor antigens is their aberrant overexpression in tumors (Yang & Yang, 2005). Zinkernagel *et al* (Zinkernagel & Hengartner, 2001) suggested that the overexpression of self-antigens or novel antigenic structure, overcomes the threshold of antigen concentration at which an immune response is initiated (Shlomchik et al., 2001). This threshold might be lower for certain untolerized regions of certain antigen epitopes. Overexpressed genes, often encode tumor antigens up to 100 fold. These genes are identified by serological identification of self-antigens by screening a cDNA library with patients' sera (SEREX) (Sahin et al., 1995), which may reflect the inherent methodological bias for the detection of abundant transcript (Preuss et al., 2002). The overexpression of tumor antigens in tumors may result from transcriptional and post-transcriptional mechanisms. We recently demonstrated that overexpression of tumor antigen CML66L in leukemia cells and tumor cells via alternative splicing is the mechanism for its immunogenicity in patients with tumors (Yan et al., 2004; Yang et al., 2001). This not only illustrates the principle of overexpression of tumor antigen, but also elucidated alternative splicing as its molecular mechanism (Yan et al., 2004). A significant proportion of the SEREX-defined self-tumor antigens are autoantigens (Chen, 2004), for example, CML28 that we identified is autoantigen Rrp46p (Yang et al., 2002). Using this information gathered from SEREX, we hypothesized that alternative splicing is a general mechanism for the overexpression of untolerized self-antigen epitopes in tumors and autoimmune diseases. In order to test this hypothesis, we database mined the NIH-NCBI AceView database to examine the potential mechanisms of how non-mutated self-proteins gain new untolerized structures that trigger immune recognition (Ng et al., 2004). The AceView database provides a curated, comprehensive, and non-redundant sequence representation of all public mRNA sequences (mRNAs from GenBank or RefSeq, and single pass cDNA sequences from dbEST and Trace). These experimental cDNA sequences are first

**Xiong et al., 2006; Yan et al., 2004; Yang et al., 2006a; Yang, 2007)** 

2006b; Yin et al., 2009).

Fig. 1. Detabase mining flow-chart and principles.

context values and percentages. Analysis of this data yielded a mean and standard deviation (SD) of -0.25 ± 0.12 and 76.07 ± 19.07 for context value and context percentage, respectively. The intervals were then constructed and the lower limits (the mean - 2 x standard deviations) were calculated for context percentage (76.07-1.96 (19.07/SQRT (46)) = 76.07 - 5.51 = 70.56) and context value (-0.25-1.96(0.12/SQRT (46) = - 0.25 - 0.04= -0.22). All predicted microRNAs interactions with a context value ≤-0.22 and context percentage ≥70 were accepted. Using the lower limit thresholds for context value and percentage, 297 out of the 524 predicted microRNAs met the criteria and were considered equivalent to the experimentally verified microRNAs. In order to generate valid confidence intervals, sample sizes have to be estimated with statistical tools of sample size determination (Rosner, 2000) as we reported (Ng et al., 2004); *5)* Verifiable methods: Experimental methods are available

Develop a well-presented hypothesis based on experimental data and analysis on current knowledge gap

Determine the scope, focus, and pathway for database mining

Find suitable databases that contain experimental data, which are organized with appropriate bioinformaticexpertise

Find sizable experimental data for generating confidence intervals, which are consolidated with statistical tools of sample size determination

Identify verifiable experimental methods

Propose a new working hypothesis/working model for further experimental research

context values and percentages. Analysis of this data yielded a mean and standard deviation (SD) of -0.25 ± 0.12 and 76.07 ± 19.07 for context value and context percentage, respectively. The intervals were then constructed and the lower limits (the mean - 2 x standard deviations) were calculated for context percentage (76.07-1.96 (19.07/SQRT (46)) = 76.07 - 5.51 = 70.56) and context value (-0.25-1.96(0.12/SQRT (46) = - 0.25 - 0.04= -0.22). All predicted microRNAs interactions with a context value ≤-0.22 and context percentage ≥70 were accepted. Using the lower limit thresholds for context value and percentage, 297 out of the 524 predicted microRNAs met the criteria and were considered equivalent to the experimentally verified microRNAs. In order to generate valid confidence intervals, sample sizes have to be estimated with statistical tools of sample size determination (Rosner, 2000) as we reported (Ng et al., 2004); *5)* Verifiable methods: Experimental methods are available

Fig. 1. Detabase mining flow-chart and principles.

to verify the data generated by the database mining (Yan et al., 2004); and *6)* A new working model/hypothesis: Through database mining, a new knowledge gap will be identified, and a new hypothesis will be proposed to test fewer, much more-focused genes in further experiments. The following sections will illustrate these principles in our own publications (Chen et al., 2010; Jan et al., 2010; Ng et al., 2004; Virtue, 2011; Yang et al., 2006a; Yang et al., 2006b; Yin et al., 2009).

#### **3. Database mining example 1: Stimulation-responsive alternative splicing is an important mechanism in generating self-antigen epitopes (Ng et al., 2004; Xiong et al., 2006; Yan et al., 2004; Yang et al., 2006a; Yang, 2007)**

In our invited review, we pointed out that the identification and molecular characterization of self-antigens expressed by human malignancies, that are capable of elicitation of anti-tumor immune responses in patients, have been an active field in tumor immunology (Yang & Yang, 2005). More than 2,000 tumor antigens have been identified, and most of these antigens are self-antigens (Yang & Yang, 2005). Despite this, the important question of how non-mutated self-protein antigens, generated from normal cells and tumor cells, gain immunogenicity and trigger immune recognition remained unanswered (Yang & Yang, 2005). Mutations may be responsible for some aspects of elevated immunogenicity underlying certain tumor-specific antigens (p53 and Ras), while chromosome translocations and abnormalities, such as expression of the fusion oncogene Bcr-Abl in chronic myelogenous leukemia (Clark et al., 2001; Pinilla-Ibarz et al., 2000; Yotnda et al., 1998; Zorn, 2001) (Yang et al., 2002; Yang et al., 2001) are responsible for other aspects. However, the mechanism underlying the immunogenicity of most non-mutated self-tumor antigens is their aberrant overexpression in tumors (Yang & Yang, 2005). Zinkernagel *et al* (Zinkernagel & Hengartner, 2001) suggested that the overexpression of self-antigens or novel antigenic structure, overcomes the threshold of antigen concentration at which an immune response is initiated (Shlomchik et al., 2001). This threshold might be lower for certain untolerized regions of certain antigen epitopes. Overexpressed genes, often encode tumor antigens up to 100 fold. These genes are identified by serological identification of self-antigens by screening a cDNA library with patients' sera (SEREX) (Sahin et al., 1995), which may reflect the inherent methodological bias for the detection of abundant transcript (Preuss et al., 2002). The overexpression of tumor antigens in tumors may result from transcriptional and post-transcriptional mechanisms. We recently demonstrated that overexpression of tumor antigen CML66L in leukemia cells and tumor cells via alternative splicing is the mechanism for its immunogenicity in patients with tumors (Yan et al., 2004; Yang et al., 2001). This not only illustrates the principle of overexpression of tumor antigen, but also elucidated alternative splicing as its molecular mechanism (Yan et al., 2004). A significant proportion of the SEREX-defined self-tumor antigens are autoantigens (Chen, 2004), for example, CML28 that we identified is autoantigen Rrp46p (Yang et al., 2002). Using this information gathered from SEREX, we hypothesized that alternative splicing is a general mechanism for the overexpression of untolerized self-antigen epitopes in tumors and autoimmune diseases. In order to test this hypothesis, we database mined the NIH-NCBI AceView database to examine the potential mechanisms of how non-mutated self-proteins gain new untolerized structures that trigger immune recognition (Ng et al., 2004). The AceView database provides a curated, comprehensive, and non-redundant sequence representation of all public mRNA sequences (mRNAs from GenBank or RefSeq, and single pass cDNA sequences from dbEST and Trace). These experimental cDNA sequences are first

Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 149

hyperlipidemia, oxidized low density lipoprotein, cigarette smoking, diabetes, hypertension, obesity (Ross, 1992), and hyperhomocysteinemia (HHcy), etc. Chronic vascular inflammation is an essential requirement for the progression of atherosclerosis in patients (Hansson, 2005). Recent progress in characterizing pathogen-associated molecular patterns' (PAMPs) receptor families (PAMP-Rs) and inflammasomes (the protein complex for activation of caspase-1) has further emphasized the importance of proinflammatory cytokine interleukin-1β (IL-1β) signaling in bridging proatherogenic risk factors to initiate inflammation (Yang et al., 2008). However, constitutive expression levels and expression readiness of PAMP-Rs, inflammasome components and proinflammtory caspases in tissues remained poorly defined. We hypothesized that PAMP-Rs, inflammasome components, proinflammatory caspases, IL-1, and IL-18 are differentially expressed in cardiovascular tissues. To examine this hypothesis, we mined the NCBI-UniGene database, analyzed cDNA cloning and DNA sequencing data from tissue cDNA libraries and studied expression profiles of Toll-like receptors (TLRs), cytosolic nucleotide binding and oligomerization domain (NOD)-like receptors (NLRs), inflammasome components, inflammatory caspases, and caspase-1 cleavable inflammatory cytokines. The UniGene database provides an organized view of the transcriptome with information on protein similarities, gene expression, cDNA clone reagents, and genomic location (http://www.ncbi.nlm.nih.gov/unigene), in which each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene). After analyzing the data from the UniGene database, we made several important findings: (1) Among 11 tissues examined, vascular tissues and heart express fewer types of TLRs and NLRs than immune system tissues including blood, lymph nodes, thymus, and trachea; (2) Brain, lymph nodes, and thymus do not express proinflammatory cytokines IL-1β and IL-18 constitutively, suggesting that these two cytokines need to be upregulated in response to inflammatory stimuli in the tissues; and (3) based on the expression data of three characterized inflammasomes (NALP1, NALP3 and IPAF inflammasomes), the examined tissues can be classified into three tiers: the first tier tissues including brain, placenta, blood, and thymus express inflammasome(s) in constitutive status; the second tier tissues have inflammasome(s) in nearly-ready expression status (with the requirement of upregulation of one component); and the third tier tissues like heart and bone marrow, require upregulation of at least two components in order to assemble functional inflammasomes. Based on the expression readiness of inflammasomes in tissues, we propose a new working model of three-tier responsive expression of inflammasomes in tissues and suggest a new concept of third tier tissues' inflammatory privilege, which provides an insight on the differences of tissues in initiating acute inflammations. This model suggests that *(a)* first-tier tissues with constitutively expressed inflammasomes initiate inflammation quicker than second and third-tier tissues; and *(b)* second tier tissues (requiring one component of upregulation) including vascular tissue, and third tier tissues including heart (requiring more than one component upregulation) are in an inducible expression status of inflammasomes. The inducible expressions of inflammasomes are presumably mediated through various signal pathways that initiate inflammation, and the interplay between the signal pathways, may take a longer time and overcome a higher threshold than first tier tissues. Traditional concepts of immune privilege suggests a protective mechanism from autoimmune destruction based on the lack of expression of antigen-presenting self-major compatibility complex (MHC) molecules in tissues (Yang & Yang, 2005). The lack of expression of self-MHCs in immune privileged tissues including

co-aligned on the genome, and then clustered into a minimal number of alternative transcript variants and grouped into genes (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/). Our results showed that alternative splicing occurs in 100% of autoantigen transcripts. This is significantly higher than the approximately 42% rate of alternative splicing observed in the 9554 randomly selected human gene transcripts (*p*<0.001). Within the isoform-specific regions of the autoantigens, 92% and 88% encoded MHC class I and class II-restricted T-cell antigen epitopes, respectively, and 70% encoded antibody binding domains. Alternative splicing can be canonical or non-canonical. Canonical splicing removes introns that have 5'GT and 3'AG consensus flanking sequences (GT-AG rule) (Lewin, 2000). Our results demonstrated that 80% of the autoantigen transcripts undergo non-canonical alternative splicing, which is significantly higher than the less than 1% rate in randomly selected gene transcripts (*p*<0.001). These studies suggest that non-canonical alternative splicing may be an important mechanism for the generation of untolerized epitopes that may lead to autoimmunity. Furthermore, the product of a transcript that does not undergo alternative splicing is unlikely to be a target antigen in autoimmunity (Ng et al., 2004). To consolidate this finding, we also examined the effect of proinflammatory cytokine tumor necrosis factor-α (TNF-α) on the prototypic alternative splicing factor (ASF)/SF2 in the splicing machinery. Our results show that TNF-α downregulates ASF/SF2 expression in cultured muscle cells. This result correlates with our finding of reduced expression of ASF/SF2 in inflamed muscle cells from patients with autoimmune myositis (Xiong et al., 2006). Based on our and others' data, we recently proposed a new model of stimulation-responsive splicing for the selection of autoantigens and self-tumor antigens (Yang et al., 2006a) [also see Fig. 1 at (http://preview.ncbi.nlm.nih.gov/pubmed/16890493)]. Our new model theorizes that the significantly higher rates of alternative splicing of autoantigen and self-tumor antigen transcripts that occur in response to stimuli, such as proinflammatory cytokines, could induce extra-thymic expression of untolerized antigen epitopes to elicit autoimmune and anti-tumor responses. By using B lymphocyte (B cell) antigen epitope analysis databases and T cell antigen epitope analysis databases listed in Tables in our recent invited review (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858284/pdf/JBB2010-459798.pdf) (Jan et al., 2010), we showed that protein sequences encoded by alternatively spliced exons are sufficient to equip antibody-binding antigen epitopes and major histocompatibility complex (MHC) class I- and MHC II-restricted T cell antigen epitopes to stimulate B lymphocytes and T lymphocytes, respectively (Ng et al., 2004). Of note, our model not only applies to nonmutated self-tumor antigens associated tumors and autoantigens associated with various autoimmune diseases, but also to the composition and expansion of the self-antigen repertoire of stem cells. Our additional database mining study has generated a new model of differential epitope processing for MHC class I-restricted viral antigen epitopes and tumor antigen epitopes (Yang et al., 2006b). Our reports have demonstrated the principles of database mining in adaptive immune responses.

#### **4. Database mining example 2: Three-tier model for inflammasome/caspase-1 activation and inflammation privilege of tissues are important mechanisms underlying the differences in the readiness of inflammation initiation in tissues**

Atherosclerosis is the leading cause of morbidity and mortality in industrialized society. Several "traditional" risk factors have been identified for atherosclerosis including

co-aligned on the genome, and then clustered into a minimal number of alternative transcript variants and grouped into genes (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/). Our results showed that alternative splicing occurs in 100% of autoantigen transcripts. This is significantly higher than the approximately 42% rate of alternative splicing observed in the 9554 randomly selected human gene transcripts (*p*<0.001). Within the isoform-specific regions of the autoantigens, 92% and 88% encoded MHC class I and class II-restricted T-cell antigen epitopes, respectively, and 70% encoded antibody binding domains. Alternative splicing can be canonical or non-canonical. Canonical splicing removes introns that have 5'GT and 3'AG consensus flanking sequences (GT-AG rule) (Lewin, 2000). Our results demonstrated that 80% of the autoantigen transcripts undergo non-canonical alternative splicing, which is significantly higher than the less than 1% rate in randomly selected gene transcripts (*p*<0.001). These studies suggest that non-canonical alternative splicing may be an important mechanism for the generation of untolerized epitopes that may lead to autoimmunity. Furthermore, the product of a transcript that does not undergo alternative splicing is unlikely to be a target antigen in autoimmunity (Ng et al., 2004). To consolidate this finding, we also examined the effect of proinflammatory cytokine tumor necrosis factor-α (TNF-α) on the prototypic alternative splicing factor (ASF)/SF2 in the splicing machinery. Our results show that TNF-α downregulates ASF/SF2 expression in cultured muscle cells. This result correlates with our finding of reduced expression of ASF/SF2 in inflamed muscle cells from patients with autoimmune myositis (Xiong et al., 2006). Based on our and others' data, we recently proposed a new model of stimulation-responsive splicing for the selection of autoantigens and self-tumor antigens (Yang et al., 2006a) [also see Fig. 1 at (http://preview.ncbi.nlm.nih.gov/pubmed/16890493)]. Our new model theorizes that the significantly higher rates of alternative splicing of autoantigen and self-tumor antigen transcripts that occur in response to stimuli, such as proinflammatory cytokines, could induce extra-thymic expression of untolerized antigen epitopes to elicit autoimmune and anti-tumor responses. By using B lymphocyte (B cell) antigen epitope analysis databases and T cell antigen epitope analysis databases listed in Tables in our recent invited review (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858284/pdf/JBB2010-459798.pdf) (Jan et al., 2010), we showed that protein sequences encoded by alternatively spliced exons are sufficient to equip antibody-binding antigen epitopes and major histocompatibility complex (MHC) class I- and MHC II-restricted T cell antigen epitopes to stimulate B lymphocytes and T lymphocytes, respectively (Ng et al., 2004). Of note, our model not only applies to nonmutated self-tumor antigens associated tumors and autoantigens associated with various autoimmune diseases, but also to the composition and expansion of the self-antigen repertoire of stem cells. Our additional database mining study has generated a new model of differential epitope processing for MHC class I-restricted viral antigen epitopes and tumor antigen epitopes (Yang et al., 2006b). Our reports have demonstrated the principles of database mining

**4. Database mining example 2: Three-tier model for inflammasome/caspase-1 activation and inflammation privilege of tissues are important mechanisms underlying the differences in the readiness of inflammation initiation in** 

Atherosclerosis is the leading cause of morbidity and mortality in industrialized society. Several "traditional" risk factors have been identified for atherosclerosis including

in adaptive immune responses.

**tissues** 

hyperlipidemia, oxidized low density lipoprotein, cigarette smoking, diabetes, hypertension, obesity (Ross, 1992), and hyperhomocysteinemia (HHcy), etc. Chronic vascular inflammation is an essential requirement for the progression of atherosclerosis in patients (Hansson, 2005). Recent progress in characterizing pathogen-associated molecular patterns' (PAMPs) receptor families (PAMP-Rs) and inflammasomes (the protein complex for activation of caspase-1) has further emphasized the importance of proinflammatory cytokine interleukin-1β (IL-1β) signaling in bridging proatherogenic risk factors to initiate inflammation (Yang et al., 2008). However, constitutive expression levels and expression readiness of PAMP-Rs, inflammasome components and proinflammtory caspases in tissues remained poorly defined. We hypothesized that PAMP-Rs, inflammasome components, proinflammatory caspases, IL-1, and IL-18 are differentially expressed in cardiovascular tissues. To examine this hypothesis, we mined the NCBI-UniGene database, analyzed cDNA cloning and DNA sequencing data from tissue cDNA libraries and studied expression profiles of Toll-like receptors (TLRs), cytosolic nucleotide binding and oligomerization domain (NOD)-like receptors (NLRs), inflammasome components, inflammatory caspases, and caspase-1 cleavable inflammatory cytokines. The UniGene database provides an organized view of the transcriptome with information on protein similarities, gene expression, cDNA clone reagents, and genomic location (http://www.ncbi.nlm.nih.gov/unigene), in which each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene). After analyzing the data from the UniGene database, we made several important findings: (1) Among 11 tissues examined, vascular tissues and heart express fewer types of TLRs and NLRs than immune system tissues including blood, lymph nodes, thymus, and trachea; (2) Brain, lymph nodes, and thymus do not express proinflammatory cytokines IL-1β and IL-18 constitutively, suggesting that these two cytokines need to be upregulated in response to inflammatory stimuli in the tissues; and (3) based on the expression data of three characterized inflammasomes (NALP1, NALP3 and IPAF inflammasomes), the examined tissues can be classified into three tiers: the first tier tissues including brain, placenta, blood, and thymus express inflammasome(s) in constitutive status; the second tier tissues have inflammasome(s) in nearly-ready expression status (with the requirement of upregulation of one component); and the third tier tissues like heart and bone marrow, require upregulation of at least two components in order to assemble functional inflammasomes. Based on the expression readiness of inflammasomes in tissues, we propose a new working model of three-tier responsive expression of inflammasomes in tissues and suggest a new concept of third tier tissues' inflammatory privilege, which provides an insight on the differences of tissues in initiating acute inflammations. This model suggests that *(a)* first-tier tissues with constitutively expressed inflammasomes initiate inflammation quicker than second and third-tier tissues; and *(b)* second tier tissues (requiring one component of upregulation) including vascular tissue, and third tier tissues including heart (requiring more than one component upregulation) are in an inducible expression status of inflammasomes. The inducible expressions of inflammasomes are presumably mediated through various signal pathways that initiate inflammation, and the interplay between the signal pathways, may take a longer time and overcome a higher threshold than first tier tissues. Traditional concepts of immune privilege suggests a protective mechanism from autoimmune destruction based on the lack of expression of antigen-presenting self-major compatibility complex (MHC) molecules in tissues (Yang & Yang, 2005). The lack of expression of self-MHCs in immune privileged tissues including

Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 151

characterizing microRNAs in atherosclerosis disease models, which had been previously reported to have elevated expression in disease conditions (Haver et al., 2010; Rink & Khanna, 2010). Thus, current microRNAs research has failed to provide a panoramic view of how microRNAs regulate proatherogenic inflammatory genes in a panoramic view and whether upregulation of proatherogenic inflammatory genes is the result of antiinflammatory microRNA downregulation. To address these issues, we hypothesized that a group of anti-inflammatory microRNAs may regulate the expressions of proatherogenic molecules (Virtue, 2011). We then developed a novel database mining approach using three types of databases including the online microRNA target prediction software TargetScan (http://www.targetscan.org/) (Dong et al., 2010; Rosero et al., 2010; Vickers & Remaley, 2010), the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/) (Papadopoulos et al., 2009; Sethupathy et al., 2006), and the online microRNA.org expression database (http://www.microrna.org/ microrna/home.do) (Betel et al., 2008), in concert with a statistical analysis strategy established in our previous database mining publications (Chen et al., 2010; Ng et al., 2004; Shen et al., 2010; Yang et al., 2006b; Yin et al., 2009). Our unique research using database mining yielded several key findings. First, we discovered that the expression of 33 inflammatory genes (mRNAs) is upregulated in atherosclerotic lesions and second, that the mRNAs of those genes contain structural features in their 3'UTR for potential regulation by microRNAs. Furthermore, these structural features are statistically identical to experimentally verified 3'UTR microRNAs binding sites. Third, 21 out of the 33 inflammatory genes (64%) are targeted by highly expressed microRNAs while the remaining 12 inflammatory genes (36%) are targeted by normally expressed microRNAs. Fourth, it was also established that 10 of the 21 highly expressed microRNA-targeted inflammatory genes (48%) were targeted by a single microRNA, suggesting the specificity of microRNA regulation. Meanwhile, 12 out of the 25 highly expressed microRNAs (48%) targeted single inflammatory genes while the other 13 microRNAs targeted multiple inflammatory genes. Finally, it was determined that the microRNAs targeting atherosclerotic inflammatory genes use statistically higher numbers of "poorly conserved" binding interactions than the control group of microRNAs from the confidence interval. These results suggest that the microRNAs regulating atherosclerotic inflammatory genes

Previous research has shown that microRNAs participate in modulating atherosclerosisrelated processes including hyperlipidemia (microRNA-33, microRNA-125a-5p), hypertension (microRNA-155), plaque rupture (microRNA-222, microRNA-210), and atherosclerosis itself (microRNA-21, microRNA-126) (Rink & Khanna, 2010). However, whether certain microRNAs play a role in preventing the disease development remains unknown. One of the most interesting findings from our study is that the 25 microRNAs that are highly expressed under normal untreated conditions target 21 out of the 33 (64%) atherosclerosis-upregulated inflammatory genes. The important result suggests a novel mechanism where a group of highly expressed anti-inflammatory microRNAs suppress the upregulation of proatherogenic inflammatory genes under normal physiological conditions. It has been well established that microRNAs play important roles in fine-tuning developmental processes and participate in the development of diseases such as cancer. Our results are the first to suggest that microRNAs may play a protective role by suppressing proatherogenic genes to maintain healthy arteries. Our conclusion is supported by other publications, which show that 7 out of the 20 microRNAs identified in this study were

possess special features (Virtue, 2011).

testis results in the failure of self-antigen presentation that stimulates the hosts' immune system, thereby protecting immune privileged tissues from autoimmune destruction. Similarly, we proposed a new concept of tissues' inflammatory privileges that emphasize a protective mechanism against tissue destruction mediated by inflammasome/IL-1β-based innate immune responses. In our new concept of tissues' inflammatory privilege, vascular tissue and heart disproportionally express fewer types of TLRs and NLRs and may only inducibly express inflammasomes, thus preventing against uncontrolled inflammatory destruction mediated by inflammasome-based innate immune responses (Streilein & Stein-Streilein, 2000). Our new concept and model may also explain the potential differences between cardiovascular tissues and other tissues in initiating acute inflammation. The firsttier tissues may have a higher probability of experiencing acute inflammation than the second-tier and third-tier tissues.

We and others showed that elevated levels of plasma homocysteine (Hcy), termed hyperhomocysteinemia (HHcy), is an independent risk factor, equivalent to hyperlipidemia, for cardiovascular diseases (CVD) including coronary heart disease and stroke (Maron & Loscalzo, 2009; Wang et al., 2003; Zhang et al., 2009). Recently, we performed an additional database mining study using to examine the expression of more than 20 homocysteine metabolic enzymes and methylation enzymes in >20 tissues in humans and mouse (Chen et al., 2010). We generated a new model of how hypomethylation (a post-translational protein modification) modulates the expressions of homocysteine-metabolizing enzymes (Chen et al., 2010). Taken together, our studies have demonstrated the principles of database mining in innate immune reactions.

#### **5. Database mining example 3: A group of anti-inflammatory microRNAs may play critical roles in inhibiting the expression of proatherogenic molecules**

Previous research has established that numerous genes are upregulated in atherogenesis through epigenetic or genetic transcriptional mechanisms (Turunen et al., 2009). However, transcription-independent mechanisms have received far less scrutiny. Recent publications suggest that microRNAs, a newly characterized class of short (18-24 nucleotide long), endogenous, non-coding RNAs (Bartel, 2009), contribute to the development of particular disease states by regulating diverse biological processes such as cell growth, differentiation, proliferation, and apoptosis (Zhang, 2008). This biological control is accomplished by posttranscriptional gene silencing (Naeem et al., 2010) through Watson and Crick base-pairing predominately at the 3'-untranslated region (3'UTR) of messenger RNAs (mRNAs) (Cordes et al., 2009; Rasmussen et al., 2010). This pairing can be further characterized as "perfect" or "near perfect", leading to target mRNA cleavage and degradation, or "imperfect", causing the inhibition of mRNA translation (Naeem et al., 2010). With the identification and sequencing of more than 800 human microRNAs thus far, it is thought that up to 30% of human genes may be regulated by microRNAs (Cheng et al., 2010; Zhang, 2008). Supporting evidence suggests that microRNAs function as key players during critical stages of cellular development and finely tune gene expression in the maintenance of routine cellular functioning (Baek et al., 2008). Furthermore, microRNAs can act on transcription factors, which lead to a broad indirect cellular effect as a result of their widespread gene modulating nature. In addition, the recent research has demonstrated that changes in microRNAs expression patterns are connected to several pathological conditions including cardiovascular disease and atherosclerosis. These studies primarily focused on

testis results in the failure of self-antigen presentation that stimulates the hosts' immune system, thereby protecting immune privileged tissues from autoimmune destruction. Similarly, we proposed a new concept of tissues' inflammatory privileges that emphasize a protective mechanism against tissue destruction mediated by inflammasome/IL-1β-based innate immune responses. In our new concept of tissues' inflammatory privilege, vascular tissue and heart disproportionally express fewer types of TLRs and NLRs and may only inducibly express inflammasomes, thus preventing against uncontrolled inflammatory destruction mediated by inflammasome-based innate immune responses (Streilein & Stein-Streilein, 2000). Our new concept and model may also explain the potential differences between cardiovascular tissues and other tissues in initiating acute inflammation. The firsttier tissues may have a higher probability of experiencing acute inflammation than the

We and others showed that elevated levels of plasma homocysteine (Hcy), termed hyperhomocysteinemia (HHcy), is an independent risk factor, equivalent to hyperlipidemia, for cardiovascular diseases (CVD) including coronary heart disease and stroke (Maron & Loscalzo, 2009; Wang et al., 2003; Zhang et al., 2009). Recently, we performed an additional database mining study using to examine the expression of more than 20 homocysteine metabolic enzymes and methylation enzymes in >20 tissues in humans and mouse (Chen et al., 2010). We generated a new model of how hypomethylation (a post-translational protein modification) modulates the expressions of homocysteine-metabolizing enzymes (Chen et al., 2010). Taken together, our studies have demonstrated the principles of database mining

**5. Database mining example 3: A group of anti-inflammatory microRNAs may play critical roles in inhibiting the expression of proatherogenic molecules**  Previous research has established that numerous genes are upregulated in atherogenesis through epigenetic or genetic transcriptional mechanisms (Turunen et al., 2009). However, transcription-independent mechanisms have received far less scrutiny. Recent publications suggest that microRNAs, a newly characterized class of short (18-24 nucleotide long), endogenous, non-coding RNAs (Bartel, 2009), contribute to the development of particular disease states by regulating diverse biological processes such as cell growth, differentiation, proliferation, and apoptosis (Zhang, 2008). This biological control is accomplished by posttranscriptional gene silencing (Naeem et al., 2010) through Watson and Crick base-pairing predominately at the 3'-untranslated region (3'UTR) of messenger RNAs (mRNAs) (Cordes et al., 2009; Rasmussen et al., 2010). This pairing can be further characterized as "perfect" or "near perfect", leading to target mRNA cleavage and degradation, or "imperfect", causing the inhibition of mRNA translation (Naeem et al., 2010). With the identification and sequencing of more than 800 human microRNAs thus far, it is thought that up to 30% of human genes may be regulated by microRNAs (Cheng et al., 2010; Zhang, 2008). Supporting evidence suggests that microRNAs function as key players during critical stages of cellular development and finely tune gene expression in the maintenance of routine cellular functioning (Baek et al., 2008). Furthermore, microRNAs can act on transcription factors, which lead to a broad indirect cellular effect as a result of their widespread gene modulating nature. In addition, the recent research has demonstrated that changes in microRNAs expression patterns are connected to several pathological conditions including cardiovascular disease and atherosclerosis. These studies primarily focused on

second-tier and third-tier tissues.

in innate immune reactions.

characterizing microRNAs in atherosclerosis disease models, which had been previously reported to have elevated expression in disease conditions (Haver et al., 2010; Rink & Khanna, 2010). Thus, current microRNAs research has failed to provide a panoramic view of how microRNAs regulate proatherogenic inflammatory genes in a panoramic view and whether upregulation of proatherogenic inflammatory genes is the result of antiinflammatory microRNA downregulation. To address these issues, we hypothesized that a group of anti-inflammatory microRNAs may regulate the expressions of proatherogenic molecules (Virtue, 2011). We then developed a novel database mining approach using three types of databases including the online microRNA target prediction software TargetScan (http://www.targetscan.org/) (Dong et al., 2010; Rosero et al., 2010; Vickers & Remaley, 2010), the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/) (Papadopoulos et al., 2009; Sethupathy et al., 2006), and the online microRNA.org expression database (http://www.microrna.org/ microrna/home.do) (Betel et al., 2008), in concert with a statistical analysis strategy established in our previous database mining publications (Chen et al., 2010; Ng et al., 2004; Shen et al., 2010; Yang et al., 2006b; Yin et al., 2009). Our unique research using database mining yielded several key findings. First, we discovered that the expression of 33 inflammatory genes (mRNAs) is upregulated in atherosclerotic lesions and second, that the mRNAs of those genes contain structural features in their 3'UTR for potential regulation by microRNAs. Furthermore, these structural features are statistically identical to experimentally verified 3'UTR microRNAs binding sites. Third, 21 out of the 33 inflammatory genes (64%) are targeted by highly expressed microRNAs while the remaining 12 inflammatory genes (36%) are targeted by normally expressed microRNAs. Fourth, it was also established that 10 of the 21 highly expressed microRNA-targeted inflammatory genes (48%) were targeted by a single microRNA, suggesting the specificity of microRNA regulation. Meanwhile, 12 out of the 25 highly expressed microRNAs (48%) targeted single inflammatory genes while the other 13 microRNAs targeted multiple inflammatory genes. Finally, it was determined that the microRNAs targeting atherosclerotic inflammatory genes use statistically higher numbers of "poorly conserved" binding interactions than the control group of microRNAs from the confidence interval. These results suggest that the microRNAs regulating atherosclerotic inflammatory genes possess special features (Virtue, 2011).

Previous research has shown that microRNAs participate in modulating atherosclerosisrelated processes including hyperlipidemia (microRNA-33, microRNA-125a-5p), hypertension (microRNA-155), plaque rupture (microRNA-222, microRNA-210), and atherosclerosis itself (microRNA-21, microRNA-126) (Rink & Khanna, 2010). However, whether certain microRNAs play a role in preventing the disease development remains unknown. One of the most interesting findings from our study is that the 25 microRNAs that are highly expressed under normal untreated conditions target 21 out of the 33 (64%) atherosclerosis-upregulated inflammatory genes. The important result suggests a novel mechanism where a group of highly expressed anti-inflammatory microRNAs suppress the upregulation of proatherogenic inflammatory genes under normal physiological conditions. It has been well established that microRNAs play important roles in fine-tuning developmental processes and participate in the development of diseases such as cancer. Our results are the first to suggest that microRNAs may play a protective role by suppressing proatherogenic genes to maintain healthy arteries. Our conclusion is supported by other publications, which show that 7 out of the 20 microRNAs identified in this study were

Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 153

Cheng, Y., N. Tan, J. Yang, X. Liu, X. Cao, P. He, X. Dong, S. Qin, and C. Zhang. "A

Clark, R. E., I. A. Dodi, S. C. Hill, J. R. Lill, G. Aubert, A. R. Macintyre, J. Rojas, A. Bourdon,

Cordes, K. R., N. T. Sheehy, M. P. White, E. C. Berry, S. U. Morton, A. N. Muth, T. H. Lee, J.

Dong, H., M. Paquette, A. Williams, R. T. Zoeller, M. Wade, and C. Yauk. "Thyroid

Egger, M., and G. D. Smith. "Meta-Analysis. Potentials and Promise." *Bmj* 315, no. 7119

Egger, M., G. D. Smith, and A. N. Phillips. "Meta-Analysis: Principles and Procedures." *Bmj*

Elia, L., M. Quintavalle, J. Zhang, R. Contu, L. Cossu, M. V. Latronico, K. L. Peterson, C.

Galperin, M. Y., and G. R. Cochrane. "The 2011 Nucleic Acids Research Database Issue and

Hansson, G. K. "Inflammation, Atherosclerosis, and Coronary Artery Disease." *N Engl J Med*

Haver, V. G., R. H. Slart, C. J. Zeebregts, M. P. Peppelenbosch, and R. A. Tio. "Rupture of

Houle, D., D. R. Govindaraju, and S. Omholt. "Phenomics: The Next Challenge." *Nat Rev* 

Ishii, N., K. Nakahigashi, T. Baba, M. Robert, T. Soga, A. Kanai, T. Hirasawa, M. Naba, K.

Jan, M., S. Meng, N. C. Chen, J. Mai, H. Wang, and X. F. Yang. "Inflammatory and

Ji, R., Y. Cheng, J. Yue, J. Yang, X. Liu, H. Chen, D. B. Dean, and C. Zhang. "MicroRNA

Joyce, A. R., and B. O. Palsson. "The Model Organism as a System: Integrating 'Omics' Data

Muscle Cell Fate and Plasticity." *Nature* 460, no. 7256 (2009): 705-10.

Infarction." *Clin Sci (Lond)* 119, no. 2 (2010): 87-95.

no. 10 (2001): 2887-93.

*One* 5, no. 8 (2010).

315, no. 7121 (1997): 1533-7.

Database issue (2011): D1-6.

352, no. 16 (2005): 1685-95.

*Genet* 11, no. 12 (2010): 855-66.

*Cardiovasc Med* 20, no. 2 (2010): 65-71.

*Biomed Biotechnol* 2010 (2010): 459798.

Perturbations." *Science* 316, no. 5824 (2007): 593-7.

Sets." *Nat Rev Mol Cell Biol* 7, no. 3 (2006): 198-210.

(1997): 1371-4.

(2009): 1590-8.

1579-88.

Translational Study of Circulating Cell-Free MicroRNA-1 in Acute Myocardial

P. L. Bonner, L. Wang, S. E. Christmas, P. J. Travers, C. S. Creaser, R. C. Rees, and J. A. Madrigal. "Direct Evidence That Leukemic Cells Present HLA-Associated Immunogenic Peptides Derived from the Bcr-Abl B3a2 Fusion Protein." *Blood* 98,

M. Miano, K. N. Ivey, and D. Srivastava. "MiR-145 and MiR-143 Regulate Smooth

Hormone May Regulate Mrna Abundance in Liver by Acting on Micrornas." *PLoS* 

Indolfi, D. Catalucci, J. Chen, S. A. Courtneidge, and G. Condorelli. "The Knockout of MiR-143 and -145 Alters Smooth Muscle Cell Maintenance and Vascular Homeostasis in Mice: Correlates with Human Disease." *Cell Death Differ* 16, no. 12

the Online Molecular Biology Database Collection." *Nucleic Acids Res* 39, no.

Vulnerable Atherosclerotic Plaques: MicroRNAs Conducting the Orchestra?" *Trends* 

Hirai, A. Hoque, P. Y. Ho, Y. Kakazu, K. Sugawara, S. Igarashi, S. Harada, T. Masuda, N. Sugiyama, T. Togashi, M. Hasegawa, Y. Takai, K. Yugi, K. Arakawa, N. Iwata, Y. Toya, Y. Nakayama, T. Nishioka, K. Shimizu, H. Mori, and M. Tomita. "Multiple High-Throughput Analyses Monitor the Response of E. Coli to

Autoimmune Reactions in Atherosclerosis and Vaccine Design Informatics." *J* 

Expression Signature and Antisense-Mediated Depletion Reveal an Essential Role of Microrna in Vascular Neointimal Lesion Formation." *Circ Res* 100, no. 11 (2007):

downregulated in the experimental studies by various proatherogenic factors (Chen et al., 2009; Elia et al., 2009; Ji et al., 2007). Together, our studies have demonstrated the principles of database mining in inflammation.

#### **6. Conclusion**

Active research in human and mouse genomes, transcriptomes, microRNAs transcriptomes, proteomes, and antigen-omes in the past decade has generated a tremendous amount of data and established many experimental data-based searchable databases. This provides unprecedented opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the databases and generate new hypotheses. In this chapter, we briefly summarize our pioneering efforts in using our new database mining methods to address important questions in inflammatory and immunological diseases. The new principles and basic methodologies of database mining developed in our laboratories are elucidated in the following studies: 1) stimulation-responsive alternative splicing model for the generation of untolerized autoantigen epitopes; 2) a three-tier model for inflammasome/caspase-1 activation and inflammatory privileges of tissues; and 3) a group of anti-inflammatory microRNAs in inhibiting proatherogenic gene expression during atherogenesis. With recent technological breakthroughs, database mining has provided significant new insights and hypotheses in specifying the novel directions for experimental research.

#### **7. Acknowledgements**

This work was partially supported by the National Institutes of Health Grants HL094451 and HL108910 (XFY), HL67033, HL82774, and HL77288 (HW). FY and IHY contribute equally to this work. Correspondence: Prof. Yang at xfyang@temple.edu. **Disclosures:** none declared.

#### **8. References**


downregulated in the experimental studies by various proatherogenic factors (Chen et al., 2009; Elia et al., 2009; Ji et al., 2007). Together, our studies have demonstrated the principles

Active research in human and mouse genomes, transcriptomes, microRNAs transcriptomes, proteomes, and antigen-omes in the past decade has generated a tremendous amount of data and established many experimental data-based searchable databases. This provides unprecedented opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the databases and generate new hypotheses. In this chapter, we briefly summarize our pioneering efforts in using our new database mining methods to address important questions in inflammatory and immunological diseases. The new principles and basic methodologies of database mining developed in our laboratories are elucidated in the following studies: 1) stimulation-responsive alternative splicing model for the generation of untolerized autoantigen epitopes; 2) a three-tier model for inflammasome/caspase-1 activation and inflammatory privileges of tissues; and 3) a group of anti-inflammatory microRNAs in inhibiting proatherogenic gene expression during atherogenesis. With recent technological breakthroughs, database mining has provided significant new insights and hypotheses in specifying the novel directions for experimental

This work was partially supported by the National Institutes of Health Grants HL094451 and HL108910 (XFY), HL67033, HL82774, and HL77288 (HW). FY and IHY contribute

Baek, D., J. Villen, C. Shin, F. D. Camargo, S. P. Gygi, and D. P. Bartel. "The Impact of

Bartel, D. P. "MicroRNAs: Target Recognition and Regulatory Functions." *Cell* 136, no. 2

Betel, D., M. Wilson, A. Gabow, D. S. Marks, and C. Sander. "The MicroRNA.Org Resource: Targets and Expression." *Nucleic Acids Res* 36, no. Database issue (2008): D149-53. Chen, N. C., F. Yang, L. M. Capecci, Z. Gu, A. I. Schafer, W. Durante, X. F. Yang, and H.

Chen, T., Z. Huang, L. Wang, Y. Wang, F. Wu, S. Meng, and C. Wang. "MicroRNA-125a-5p

Chen, YT. "Serex Review." *Cancer Immunity* http://www.cancerimmunity.org/SEREX/

Wang. "Regulation of Homocysteine Metabolism and Methylation in Human and

Partly Regulates the Inflammatory Response, Lipid Uptake, and Orp9 Expression in oxLDL-Stimulated Monocyte/Macrophages." *Cardiovasc Res* 83, no. 1 (2009): 131-

MicroRNAs on Protein Output." *Nature* 455, no. 7209 (2008): 64-71.

equally to this work. Correspondence: Prof. Yang at xfyang@temple.edu.

Mouse Tissues." *Faseb J* 24, no. 8 (2010): 2804-17.

of database mining in inflammation.

**6. Conclusion** 

research.

**7. Acknowledgements** 

**Disclosures:** none declared.

(2009): 215-33.

**8. References** 

9.

(2004).


Database Mining: Defining the Pathogenesis of Inflammatory and Immunological Diseases 155

Rosner, B. "Estimation of Sample Size and Power for Comparing Two Means." In

Ross, R. "Atherosclerosis." In *Cecil Textbook of Medicine*, edited by JB Wyngaarden, Smith,

Sahin, U., O. Tureci, H. Schmitt, B. Cochlovius, T. Johannes, R. Schmits, F. Stenner, G. Luo, I.

Salimi, N., W. Fleri, B. Peters, and A. Sette. "Design and Utilization of Epitope-Based Databases and Predictive Tools." *Immunogenetics* 62, no. 4 (2010): 185-96. Sethupathy, P., B. Corda, and A. G. Hatzigeorgiou. "Tarbase: A Comprehensive Database of Experimentally Supported Animal Microrna Targets." *Rna* 12, no. 2 (2006): 192-7. Shen, J., Y. Yin, J. Mai, X. Xiong, M. Pansuria, J. Liu, E. Maley, N. U. Saqib, H. Wang, and X.

Shimokawa, K., K. Mogushi, S. Shoji, A. Hiraishi, K. Ido, H. Mizushima, and H. Tanaka.

Shlomchik, M. J., J. E. Craft, and M. J. Mamula. "From T to B and Back Again: Positive

Spasic, I., S. Ananiadou, J. McNaught, and A. Kumar. "Text Mining and Ontologies in Biomedicine: Making Sense of Raw Text." *Brief Bioinform* 6, no. 3 (2005): 239-51. Streilein, J. W., and J. Stein-Streilein. "Does Innate Immune Privilege Exist?" *J Leukoc Biol* 67,

Turunen, M. P., E. Aavik, and S. Yla-Herttuala. "Epigenetics and Atherosclerosis." *Biochim* 

Vickers, K. C., and A. T. Remaley. "MicroRNAs in Atherosclerosis and Lipoprotein Metabolism." *Curr Opin Endocrinol Diabetes Obes* 17, no. 2 (2010): 150-5. Virtue, A, J. Mai, Y. Yin, S. Meng, T. Tran, X. Jiang, H. Wang, and X-F Yang. "Structural

Wang, H., X. Jiang, F. Yang, J. W. Gaubatz, L. Ma, M. J. Magera, X. Yang, P. B. Berger, W.

Warner, E. E., and B. K. Dieckgraefe. "Application of Genome-Wide Gene Expression

———. "'Omic' and Hypothesis-Driven Research in the Molecular Pharmacology of Cancer."

Xiong, Z., A. Shaibani, Y. P. Li, Y. Yan, S. Zhang, Y. Yang, F. Yang, H. Wang, and X. F. Yang.

Inflammatory Bowel Disease." *Inflamm Bowel Dis* 8, no. 2 (2002): 140-57.

Weinstein, J. N. "Fishing Expeditions." *Science* 282, no. 5389 (1998): 628-9.

*Curr Opin Pharmacol* 2, no. 4 (2002): 361-5.

*Pathol* 59, no. 8 (2006): 855-61.

Evidence of Anti-Atherogenic MicroRNAs. " *Frontiers in Bioscience* 17 (2011): 3133-

Durante, H. J. Pownall, and A. I. Schafer. "Hyperhomocysteinemia Accelerates Atherosclerosis in Cystathionine Beta-Synthase and Apolipoprotein E Double Knock-out Mice with and without Dietary Perturbation." *Blood* 101, no. 10 (2003):

Profiling by High-Density DNA Arrays to the Treatment and Study of

"Alternative Splicing Factor Asf/Sf2 Is Down Regulated in Inflamed Muscle." *J Clin* 

Singapore, Spain, United Kingdom, United States, 2000.

View of Disease." *BMC Genomics* 11 Suppl 4 (2010): S19.

W.B. Saunders Company, 1992.

*Atherosclerosis* 210, no. 2 (2010): 422-29.

*Biophys Acta* 1790, no. 9 (2009): 886-91.

(1995): 11810-3.

53.

45.

3901-7.

no. 4 (2000): 479-87.

*Fundamentals of Biostatistics*, edited by B. Rosner, 307-29. Australia, Canada, Mexico,

LH, Bennett, JC., 293-98. Philadelphia, London, Toronto, Montreal, Sydney, Tokyo:

Schobert, and M. Pfreundschuh. "Human Neoplasms Elicit Multiple Specific Immune Responses in the Autologous Host." *Proc Natl Acad Sci U S A* 92, no. 25

F. Yang. "Caspase-1 Recognizes Extended Cleavage Sites in Its Natural Substrates."

"Icod: An Integrated Clinical Omics Database Based on the Systems-Pathology

Feedback in Systemic Autoimmune Disease." *Nat Rev Immunol* 1, no. 2 (2001): 147-


King, J. Y., R. Ferrara, R. Tabibiazar, J. M. Spin, M. M. Chen, A. Kuchinsky, A. Vailaya, R.

Lewin, Benjamin. "Nuclear Splicing." In *Genes Vii*, edited by Benjamin Lewin. Cambridge:

Liang, M., A. W. Cowley, Jr., M. J. Hessner, J. Lazar, D. P. Basile, and J. L. Pietrusz.

Loza, M. J., C. E. McCall, L. Li, W. B. Isaacs, J. Xu, and B. L. Chang. "Assembly of

Maron, B. A., and J. Loscalzo. "The Treatment of Hyperhomocysteinemia." *Annu Rev Med* 60

Mount, DW. "Historical Introduction and Overview." In *Bioinformatics. Sequence and Genome* 

Naeem, H., R. Kuffner, G. Csaba, and R. Zimmer. "Mirsel: Automated Extraction of

Ng, B., F. Yang, D. P. Huston, Y. Yan, Y. Yang, Z. Xiong, L. E. Peterson, H. Wang, and X. F.

Palakal, M., J. Bright, T. Sebastian, and S. Hartanto. "A Comparative Study of Cells in

Pandey, R., R. K. Guru, and D. W. Mount. "Pathway Miner: Extracting Gene Association

Gene Expression Microarray Data." *Bioinformatics* 20, no. 13 (2004): 2156-8. Papadopoulos, G. L., M. Reczko, V. A. Simossis, P. Sethupathy, and A. G. Hatzigeorgiou.

Pinilla-Ibarz, J., K. Cathcart, and D. A. Scheinberg. "CML Vaccines as a Paradigm of the Specific Immunotherapy of Cancer." *Blood Rev* 14, no. 2 (2000): 111-20. Preuss, K. D., C. Zwick, C. Bormann, F. Neumann, and M. Pfreundschuh. "Analysis of the B-

Rasmussen, K. D., S. Simmini, C. Abreu-Goodger, N. Bartonicek, M. Di Giacomo, D. Bilbao-

Rink, C., and S. Khanna. "MicroRNA in Ischemic Stroke Etiology and Pathology." *Physiol* 

Rosero, S., V. Bravo-Egana, Z. Jiang, S. Khuri, N. Tsinoremas, D. Klein, E. Sabates, M.

Tarbase." *Nucleic Acids Res* 37, no. Database issue (2009): D155-8.

Oxford University Press Inc., New York, 2000.

(2005): 103-18.

*Int* 67, no. 6 (2005): 2114-22.

Harbor Laboratory Press, 2004.

*Bioinformatics* 11 (2010): 135.

no. 6 (2004): 1463-70.

14, no. 1 (2007): 67-85.

188 (2002): 43-50.

*Genomics* (2010).

(2010): 509.

1351-8.

no. 10 (2007): e1035.

(2009): 39-54.

Kincaid, A. Tsalenko, D. X. Deng, A. Connolly, P. Zhang, E. Yang, C. Watt, Z. Yakhini, A. Ben-Dor, A. Adler, L. Bruhn, P. Tsao, T. Quertermous, and E. A. Ashley. "Pathway Analysis of Coronary Atherosclerosis." *Physiol Genomics* 23, no. 1

"Transcriptome Analysis and Kidney Research: Toward Systems Biology." *Kidney* 

Inflammation-Related Genes for Pathway-Focused Genetic Analysis." *PLoS One* 2,

*Analysis*, edited by DW Mount, 1-27. Cold Spring Harbor, New York: Cold Spring

Associations between Micrornas and Genes from the Biomedical Literature." *BMC* 

Yang. "Increased Noncanonical Splicing of Autoantigen Transcripts Provides the Structural Basis for Expression of Untolerized Epitopes." *J Allergy Clin Immunol* 114,

Inflammation, Eae and Ms Using Biomedical Literature Data Mining." *J Biomed Sci*

Networks from Molecular Pathways for Predicting the Biological Significance of

"The Database of Experimentally Supported Targets: A Functional Update of

Cell Repertoire against Antigens Expressed by Human Neoplasms." *Immunol Rev*

Cortes, R. Horos, M. Von Lindern, A. J. Enright, and D. O'Carroll. "The MiR-144/451 Locus Is Required for Erythroid Homeostasis." *J Exp Med* 207, no. 7 (2010):

Correa-Medina, C. Ricordi, J. Dominguez-Bendala, J. Diez, and R. L. Pastori. "MicroRNA Signature of the Human Developing Pancreas." *BMC Genomics* 11, no. 1


**8** 

*Tianjin, China* 

Chunsheng Kang1 et al.\*

**Data Mining Pubmed Identifies Core Signalings** 

<sup>1</sup>*Department of Neurosurgery, Laboratory of Neuro-Oncology, Tianjin Medical University General Hospital, Tianjin Key Laboratory of Nerve Injury, Variation and Regeneration,* 

Glioblastoma multiforme (GBM) is the most common form of malignant brain cancer and persist as serious clinical and scientific problems. The current standard of therapy for GBM patients, include surgery, radiotherapy and chemotherapy with temozolomide, produces a median survival of only 14.6 months (Stupp et al., 2005). Now, new intervention is increasingly being tested, particularly with inhibitors of neo-angiogenesis and growth factor receptors, and high throughout profiling studies are leading to the discovery of novel genetic alterations and signaling pathways. The Cancer Genome Atlas Network recently catalogs recurrent genomic abnormalities in GBM, and proposes a molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrates multidimensional genomic data to establish patterns of somatic mutations and DNA copy number (Verhaak et al., 2010). In recent years, microRNAs (miRNAs), small noncoding RNA molecules, have been identified in the progression of various human cancers and used to a notable molecular label to cancers. In glioma, miR-21, miR-221, miR-222, miR-181a and miR-125b have been proven to play critical roles in gliomagenesis and proposed as novel targets for antiglioma therapies (Shi et al., 2008; Shi et al., 2010; Zhang et al., 2009b; Zhang et al., 2010c; Zhou et al., 2010a; Zhou et al., 2010b). Thus, molecular regulation of glioma is

Biomedical literature is growing at a double-exponential pace, with approximately 20 million publications in MEDLINE. Up to now, there have been more than 50 thousand of glioma-related publications in MEDLINE (Pubmed with: glioma). Thus, a massive wealth of information is embedded in the literature and waiting to be discovered and extracted. Literature mining is a promising strategy to utilize this untapped information for knowledge discovery and has been applied successfully to various biological problems including the discovery and characterization of molecular interactions (protein-protein, gene-protein, gene-drug, protein sorting and molecular binding) (Friedman et al., 2001;

Junxia Zhang1, Yingyi Wang2, Ning Liu2, Jilong Liu3, Huazong Zeng3, Tao Jiang4, Yongping You2 and

<sup>2</sup>*Department of Neurosurgery, Jiangsu Provincial People's Hospital, Nanjing, China , 3Shanghai Sensichip Co Ltd, Shanghai, China, 4Department of Neurosurgery, Tiantan Hospital, Capital Medical University, Beijing, China*

comprehensive and still unclear and under further investigation.

**1. Introduction** 

 \*

Peiyu Pu1

**and miRNA Regulatory Module in Glioma** 


### **Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma**

Chunsheng Kang1 et al.\*

<sup>1</sup>*Department of Neurosurgery, Laboratory of Neuro-Oncology, Tianjin Medical University General Hospital, Tianjin Key Laboratory of Nerve Injury, Variation and Regeneration, Tianjin, China* 

#### **1. Introduction**

156 Bioinformatics – Trends and Methodologies

Yan, Y., L. Phan, F. Yang, M. Talpaz, Y. Yang, Z. Xiong, B. Ng, N. A. Timchenko, C. J. Wu, J.

Yang, F., I. H. Chen, Z. Xiong, Y. Yan, H. Wang, and X. F. Yang. "Model of Stimulation-

Tumor Antigens and Autoantigens." *Clin Immunol* 121, no. 2 (2006a): 121-33. Yang, F., and X. F. Yang. "New Concepts in Tumor Antigens: Their Significance in Future Immunotherapies for Tumors." *Cell Mol Immunol* 2, no. 5 (2005): 331-41. Yang, X. F. "Immunology of Stem Cells and Cancer Stem Cells." *Cell Mol Immunol* 4, no. 3

Yang, X. F., D. Mirkovic, S. Zhang, Q. E. Zhang, Y. Yan, Z. Xiong, F. Yang, I. H. Chen, L. Li,

Yang, X. F., C. J. Wu, L. Chen, E. P. Alyea, C. Canning, P. Kantoff, R. J. Soiffer, G. Dranoff,

Yang, X. F., C. J. Wu, S. McLaughlin, A. Chillemi, K. S. Wang, C. Canning, E. P. Alyea, P.

Yang, X. F., Y. Yin, and H. Wang. "Vascular Inflammation and Atherogenesis Are Activated

Yin, Y., Y. Yan, X. Jiang, J. Mai, N. C. Chen, H. Wang, and X. F. Yang. "Inflammasomes Are

Yotnda, P., H. Firat, F. Garcia-Pons, Z. Garcia, G. Gourru, J. P. Vernant, F. A. Lemonnier, V.

Zhang, C. "Micrornas: Role in Cardiovascular Biology and Disease." *Clin Sci (Lond)* 114, no.

Zhang, D., X. Jiang, P. Fang, Y. Yan, J. Song, S. Gupta, A. I. Schafer, W. Durante, W. D.

Zorn, E., Orsini, E., Wu, CJ., Stein, B., Chillemi, A., Canning, C., Alyea, EP, Soiffer, RJ., and

Beta-Synthase-Deficient Mice." *Circulation* 120, no. 19 (2009): 1893-902. Zinkernagel, R. M., and H. Hengartner. "Regulation of the Immune Response by Antigen."

*Immunopathol Pharmacol* 19, no. 4 (2006b): 853-70.

Tumor Cells." *Cancer Res* 62, no. 19 (2002): 5517-22.

*Ther Strateg* 5, no. 2 (2008): 125-42.

*Pharmacol* 22, no. 2 (2009): 311-22.

*Clin Invest* 101, no. 10 (1998): 2290-6.

*Science* 293, no. 5528 (2001): 251-3.

*Transplantation* 71, no. 8 (2001): 1131-7.

12 (2008): 699-706.

172, no. 1 (2004): 651-60.

(2007): 161-71.

Ritz, H. Wang, and X. F. Yang. "A Novel Mechanism of Alternative Promoter and Splicing Regulates the Epitope Generation of Tumor Antigen Cml66-L." *J Immunol*

Responsive Splicing and Strategies in Identification of Immunogenic Isoforms of

and H. Wang. "Processing Sites Are Different in the Generation of HLA-A2.1- Restricted, T Cell Reactive Tumor Antigen Epitopes and Viral Epitopes." *Int J* 

and J. Ritz. "CML28 Is a Broadly Immunogenic Antigen, Which Is Overexpressed in

Kantoff, R. J. Soiffer, G. Dranoff, and J. Ritz. "CML66, a Broadly Immunogenic Tumor Antigen, Elicits a Humoral Immune Response Associated with Remission of Chronic Myelogenous Leukemia." *Proc Natl Acad Sci U S A* 98, no. 13 (2001): 7492-7.

Via Receptors for Pamps and Suppressed by Regulatory T Cells." *Drug Discov Today* 

Differentially Expressed in Cardiovascular and Other Tissues." *Int J Immunopathol* 

Leblond, and P. Langlade-Demoyen. "Cytotoxic T Cell Response against the Chimeric P210 Bcr-Abl Protein in Patients with Chronic Myelogenous Leukemia." *J* 

Kruger, X. Yang, and H. Wang. "Hyperhomocysteinemia Promotes Inflammatory Monocyte Generation and Accelerates Atherosclerosis in Transgenic Cystathionine

Ritz, J. "A CD4+ T Cell Clone Selected from a Cml Patient after Donor Lymphocyte Infusion Recognizes Bcr-Abl Breakpoint Peptides but Not Tumor Cells." Glioblastoma multiforme (GBM) is the most common form of malignant brain cancer and persist as serious clinical and scientific problems. The current standard of therapy for GBM patients, include surgery, radiotherapy and chemotherapy with temozolomide, produces a median survival of only 14.6 months (Stupp et al., 2005). Now, new intervention is increasingly being tested, particularly with inhibitors of neo-angiogenesis and growth factor receptors, and high throughout profiling studies are leading to the discovery of novel genetic alterations and signaling pathways. The Cancer Genome Atlas Network recently catalogs recurrent genomic abnormalities in GBM, and proposes a molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrates multidimensional genomic data to establish patterns of somatic mutations and DNA copy number (Verhaak et al., 2010). In recent years, microRNAs (miRNAs), small noncoding RNA molecules, have been identified in the progression of various human cancers and used to a notable molecular label to cancers. In glioma, miR-21, miR-221, miR-222, miR-181a and miR-125b have been proven to play critical roles in gliomagenesis and proposed as novel targets for antiglioma therapies (Shi et al., 2008; Shi et al., 2010; Zhang et al., 2009b; Zhang et al., 2010c; Zhou et al., 2010a; Zhou et al., 2010b). Thus, molecular regulation of glioma is comprehensive and still unclear and under further investigation.

Biomedical literature is growing at a double-exponential pace, with approximately 20 million publications in MEDLINE. Up to now, there have been more than 50 thousand of glioma-related publications in MEDLINE (Pubmed with: glioma). Thus, a massive wealth of information is embedded in the literature and waiting to be discovered and extracted. Literature mining is a promising strategy to utilize this untapped information for knowledge discovery and has been applied successfully to various biological problems including the discovery and characterization of molecular interactions (protein-protein, gene-protein, gene-drug, protein sorting and molecular binding) (Friedman et al., 2001;

<sup>\*</sup> Junxia Zhang1, Yingyi Wang2, Ning Liu2, Jilong Liu3, Huazong Zeng3, Tao Jiang4, Yongping You2 and Peiyu Pu1

<sup>2</sup>*Department of Neurosurgery, Jiangsu Provincial People's Hospital, Nanjing, China , 3Shanghai Sensichip Co Ltd, Shanghai, China, 4Department of Neurosurgery, Tiantan Hospital, Capital Medical University, Beijing, China*

Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 159

To better understand the biological role of glioma-related genes, the catalogued genes were visualized using Gene Ontology (GO) terms and pathway analysis. The Gene Ontology (GO) provides a structured and controlled ontology for describing gene products in terms of their associated molecular function, biological process, or cellular component in a speciesindependent manner. The molecular function enrichment revealed that 22 GO terms appeared to be significantly enriched and most glioma-associated genes encode for protein binding. In the biological process category,the genes mainly participated in signaling transduction, response stress, cell differentiation and regulation of cell proliferation. Finally, the cellular component category found that products of these genes were active mainly in

> protein binding, protein dimerization activity, signal transducer activity, cytokine activity, enzyme binding, growth factor activity, growth factor binding, receptor activity, protein kinase activity,

glycosaminoglycan binding, kinase binding, enzyme regulator activity, G-protein-coupled receptor binding, carbohydrate binding, peptide receptor activity, collagen binding, polysaccharide binding, kinase activity, ATP binding, enzyme inhibitor activity, transmembrane receptor protein tyrosine kinase activity, adenyl nucleotide binding

response to stress, regulation of cell proliferation, cell differentiation, regulation of phosphorylation, regulation of immune system process, negative regulation of cell proliferation, cell proliferation, antiapoptosis, apoptosis, neurogenesis, response to hormone stimulus, hemopoiesis, regulation of protein kinase activity, immune response,

angiogenesis, cell communication, phosphorylation, gliogenesis, cell

intrinsic to plasma membrane, integral to plasma membrane, plasma membrane, extracellular region, cell projection, vesicle, nucleoplasm,

signal transduction, inflammatory response, cell migration,

cytosol, cell soma, secretory granule, platelet alpha granule, transcription factor complex, apical plasma membrane

To further explore the pathway involved in these genes, we searched KEGG database for their pathway information. 16 pathways whose P-value was less than 0.01 were kept (Table 4). The most top enriched pathway is p53 signaling pathway, including 27 genes and Toll-

To uncover the potential interaction networks or synergistic effects of these glioma-related genes, we employed each gene set as queries and searched for their interaction partners by

Over-represented GO terms were identified after multiple testing adjustments (P-value<0.05).

**2.2 Biological function of glioma-related genes** 

cycle process

Table 3. Set of GO terms with highly enriched genes.

like receptor signaling pathway including 33 genes.

**2.3 Interaction network of glioma-related genes** 

cytoplasm membrane (Table 3).

Category GO term

molecular function

biological process

cellular component

Rindflesch et al., 2000; Sekimizu et al., 1998). As no searchable records are available to efficiently retrieve information relevant to molecular network in glioma, we extracted glioma-related genes and miRNAs by data mining Pubmed abstracts and established glioma associated network based on these genes and miRNAs to identify key signalings and miRNA regulatory module in glioma.

#### **2. Results**

#### **2.1 Identification of glioma-related genes and miRNAs**

For glioma we queried Pubmed with: glioma[title] AND ("1980/01/01"[PDAT] : "2010/04/01"[PDAT]). It led to the identification a total of 670 genes and 14 miRNAs that interacted with glioma. The top 10 glioma-related genes were listed in Table 1. These 14 glioma-related miRNAs were miR-21, miR-34a, miR-221, miR-222, miR-10b, miR-125b, miR-128, miR-146b, miR-15b, miR-181a, miR-196a, miR-26a, miR-451 and miR-9. Additionally, we score the journals describing these genes and miRNAs, and the top 10 journals were listed in Table 2.


Table 1. The top 10 glioma-related genes.


Table 2. The top 10 journals describing glioma-related genes and miRNAs.

#### **2.2 Biological function of glioma-related genes**

158 Bioinformatics – Trends and Methodologies

Rindflesch et al., 2000; Sekimizu et al., 1998). As no searchable records are available to efficiently retrieve information relevant to molecular network in glioma, we extracted glioma-related genes and miRNAs by data mining Pubmed abstracts and established glioma associated network based on these genes and miRNAs to identify key signalings and

For glioma we queried Pubmed with: glioma[title] AND ("1980/01/01"[PDAT] : "2010/04/01"[PDAT]). It led to the identification a total of 670 genes and 14 miRNAs that interacted with glioma. The top 10 glioma-related genes were listed in Table 1. These 14 glioma-related miRNAs were miR-21, miR-34a, miR-221, miR-222, miR-10b, miR-125b, miR-128, miR-146b, miR-15b, miR-181a, miR-196a, miR-26a, miR-451 and miR-9. Additionally, we score the journals describing these genes and miRNAs, and the top 10 journals were

Gene PubMed Count

Journal Count Cancer Res. 133 J. Neurooncol. 103 Oncogene 53 J. Neurochem. 46 J. Neurosurg. 38 Int. J. Cancer 37 J. Biol. Chem. 36 Biochem. Biophys. Res. Commun. 33 Clin. Cancer Res. 33

Table 2. The top 10 journals describing glioma-related genes and miRNAs.

EGFR 130 VEGF 123 GFAP 87 TRAIL 71 CD95 52 JNK 52 ERK 49 IFN 48 PTEN 47 NGF 47

miRNA regulatory module in glioma.

Table 1. The top 10 glioma-related genes.

**2.1 Identification of glioma-related genes and miRNAs** 

**2. Results** 

listed in Table 2.

To better understand the biological role of glioma-related genes, the catalogued genes were visualized using Gene Ontology (GO) terms and pathway analysis. The Gene Ontology (GO) provides a structured and controlled ontology for describing gene products in terms of their associated molecular function, biological process, or cellular component in a speciesindependent manner. The molecular function enrichment revealed that 22 GO terms appeared to be significantly enriched and most glioma-associated genes encode for protein binding. In the biological process category,the genes mainly participated in signaling transduction, response stress, cell differentiation and regulation of cell proliferation. Finally, the cellular component category found that products of these genes were active mainly in cytoplasm membrane (Table 3).


Over-represented GO terms were identified after multiple testing adjustments (P-value<0.05).

Table 3. Set of GO terms with highly enriched genes.

To further explore the pathway involved in these genes, we searched KEGG database for their pathway information. 16 pathways whose P-value was less than 0.01 were kept (Table 4). The most top enriched pathway is p53 signaling pathway, including 27 genes and Tolllike receptor signaling pathway including 33 genes.

#### **2.3 Interaction network of glioma-related genes**

To uncover the potential interaction networks or synergistic effects of these glioma-related genes, we employed each gene set as queries and searched for their interaction partners by

Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 161

Fig. 1. Visualization of glioma-related gene interaction network.

JAK2) are indicated for hub genes.

**2.4 Glioma-related miRNA pathway** 

(A) Connectivity analysis was performed using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) to generate glioma-related gene knowledge-driven network, as described in Methods. Analysis revealed PI3KCA, PI3KCB and JAK2 are hub

knowledge-driven network, and exerted a wide effect on kinds of biological functions and pathways, including signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration. Purple lines correspond to activation, blue lines to inhibition, and yellow lines to association. Red circles (PI3KCA, PI3KCB and

Because each miRNA target prediction program uses a different computer-aided algorithm for prediction, encompassing all these methods will probably produce a more reliable model

genes with P-value<0.001, which had an influential role in network stability. (B) PI3K and JAK2 hub signalings located at the key status of glioma-related gene

accessing the database STRING. STRING integrates different public databases containing information on direct and indirect functional protein–protein associations by benchmarking them against the common reference set, KEGG pathway database. 204 genes had interactions in the database STRING. We next tried to connect these genes into a network to identify biologically informative linker genes which were statistically enriched for connections to member of glioma-related gene list. Figure 1A summarized PIK3CA, PIK3CB and JAK2 three queries served as "hubs" (label with red circle), which has high connection and was an indicator for essentialness in a network. Surprisingly, further analysis found that PIK3CA, PIK3CB and JAK2 were associated with signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration (Figure 1B). Given that PIK3CA and PIK3CB encode the protein PI3K subunit p110α and p110β, respectively, these data suggested that PI3K and JAK2 signalings provided excellent biomarkers for glioma aggressiveness.


Over-represented KEGG pathways were identified after multiple testing adjustments (P-value<0.05). Table 4. Set of signaling pathways with highly enriched genes.

accessing the database STRING. STRING integrates different public databases containing information on direct and indirect functional protein–protein associations by benchmarking them against the common reference set, KEGG pathway database. 204 genes had interactions in the database STRING. We next tried to connect these genes into a network to identify biologically informative linker genes which were statistically enriched for connections to member of glioma-related gene list. Figure 1A summarized PIK3CA, PIK3CB and JAK2 three queries served as "hubs" (label with red circle), which has high connection and was an indicator for essentialness in a network. Surprisingly, further analysis found that PIK3CA, PIK3CB and JAK2 were associated with signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration (Figure 1B). Given that PIK3CA and PIK3CB encode the protein PI3K subunit p110α and p110β, respectively, these data suggested that PI3K and JAK2 signalings provided excellent

Pathway Count Enrichment P-value

p53 signaling pathway 27 0

Toll-like receptor signaling pathway 33 0

Apoptosis 30 9.80E-10

Cytokine-cytokine receptor interaction 65 2.48E-09

Glioma 23 8.80E-08

MAPK signaling pathway 51 1.24E-06

ErbB signaling pathway 24 1.02E-05

Focal adhesion 40 1.12E-05

Cell cycle 28 1.28E-05

T cell receptor signaling pathway 27 1.47E-05

Adipocytokine signaling pathway 20 3.11E-05

Chemokine signaling pathway 38 3.22E-05

Neurotrophin signaling pathway 29 4.10E-05

VEGF signaling pathway 18 0.003731

Adherens junction 18 0.003731

Table 4. Set of signaling pathways with highly enriched genes.

Over-represented KEGG pathways were identified after multiple testing adjustments (P-value<0.05).

biomarkers for glioma aggressiveness.

Fig. 1. Visualization of glioma-related gene interaction network.

(A) Connectivity analysis was performed using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) to generate glioma-related gene knowledge-driven network, as described in Methods. Analysis revealed PI3KCA, PI3KCB and JAK2 are hub genes with P-value<0.001, which had an influential role in network stability. (B) PI3K and JAK2 hub signalings located at the key status of glioma-related gene knowledge-driven network, and exerted a wide effect on kinds of biological functions and pathways, including signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration. Purple lines correspond to activation, blue lines to inhibition, and yellow lines to association. Red circles (PI3KCA, PI3KCB and JAK2) are indicated for hub genes.

#### **2.4 Glioma-related miRNA pathway**

Because each miRNA target prediction program uses a different computer-aided algorithm for prediction, encompassing all these methods will probably produce a more reliable model

Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 163

The network was visualized with Medusa software. Blue quadrangles represent gliomarelated miRNAs. Red circles represent miRNA targets that were overlapped by glioma-

The overall utility of our data mining approach, including the strategy for constructing interaction networks, is to explore biological mechanisms involved in glioma progression. In this study, we obtained 670 genes and 14 miRNAs that interacted with glioma and generated interaction networks from abstract-based text mining. Importantly, our analysis

By integration of PubMed text mining, homology prediction, gene neighbor, protein-protein interaction, gene fusion and other data sources, we constructed glioma-related genes knowledge-driven network. Further analysis revealed that PI3K and JAK2 hub signalings that had an influential role in network stability, located at the key status of glioma-related genes knowledge-driven network. These signaling exerted a wide effect on kinds of biological functions and pathways, including signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration. Further, integrating GO and pathway analysis, data revealed that proliferation without control and

PI3Ks are heterodimers comprised of a regulatory subunit (p85) and a catalytic subunit (p110). Activated receptor tyrosine kinases recruit the PI3 kinase complex to the membrane via the p85 regulatory subunit, thereby activating the catalytic subunit p110, which then phosphorylates phosphatidylinositol-4,5-bisphosphate (PIP2) to phosphatidylinositol-3,4,5 trisphosphate (PIP3). PIP3 recruits protein AKT to the plasma membrane where AKT is phosphorylated at Thr308 and Ser473 (Cheng et al., 2009). A high frequency of mutations in PIK3CA, the gene encoding the p110alpha subunit of PI3K, was found in glioblastoma (Gallia et al., 2006; Kita et al., 2007). Our recent data showed that PI3K activity were greatly increased with the ascending of tumor grade and correlated positively with AKT2 expression (Wang et al., 2010). Activation of PI3K/Akt signaling cascade results in cell survival and proliferation as well as inhibition of cell apoptosis through regulating downstream targets. AKT contributes to glioma cell migration and invasion by regulating the formation of cytoskeleton, influencing adhesion and MMP2/9 expression (Pu et al., 2004; Zhang et al., 2009a; Zhang et al., 2010d). AKT promotes the cell cycle progression by suppression of cyclin-dependent kinase inhibitors p21 and p27 and increase of Cyclin D1 (Guillard et al., 2009; Koul et al., 2010; Pu et al., 2006). AKT inhibits cell apoptosis by inactivation of caspase pathway, and activation of BCL2, NFκB and mTOR signaling cascade (Jiang et al., 2009; Ruano et al., 2008; Zhang et al., 2010d). Further, prosurvival signaling by PI3K contributes to therapeutic resistance in the setting of established antiglioma therapies. Several studies have shown that PI3K inhibition sensitizes glioma cells to radiation and chemical therapy (Opel et al., 2008; Prevo et al., 2008). Additionally, our study recently has showed that co-suppression of PI3K and AKT exerts significant proliferation and invasion inhibition effects on glioma cells (Fu et al., 2009). In the current study, we found that is PI3K is a molecular hub in glioma-related genes knowledge-driven network, and associated with a wide variety of cell biological functions and signaling pathways. Therefore, it is urgent to

develop novel therapies for targeting PI3K/AKT signaling in glioma treatment.

identified PI3K and JAK2 hub signalings and miRNA regulatory module in glioma.

invasive growth were the essential characteristic of glioma.

related geges.

**3. Discussion** 

**3.1 Core signalings in glioma** 

of target prediction. Thus, a union target gene list of 14 glioma-related miRNAs was generated from 3 target prediction programs (PicTar, TargetScan and miRanda). To further explore the signaling pathway in these target genes, pathway analysis was performed. Table 5 showed that p53 signaling pathway, Apoptosis, Focal adhesion, MAPK signaling pathway, Toll-like receptor signaling pathway and Cell cycle pathways were significantly overrepresented. Actually, these 6 pathways were included in the pathways of glioma-related genes. These findings imply that glioma-related genes and miRNAs prefer a common set of signaling pathways.


Over-represented KEGG pathways were identified after multiple testing adjustments (P-value<0.05).

Table 5. Set of signaling pathways with highly enriched microRNA targets.

In order to construct the network between glioma-related miRNAs and the signaling pathway, integrated analysis of the targets of glioma-related miRNAs was performed. This procedure obtained 6 miRNA-pathway networks. For instance, p53 signaling pathway network contained 12 miRNAs (miR-21, miR-34a, miR-221, miR-222, et al) and 19 genes (PTEN, CDK6, BBC3, et al) (Fig.2).

Fig. 2. Visualization of miRNA-p53 pathway network in glioma.

The network was visualized with Medusa software. Blue quadrangles represent gliomarelated miRNAs. Red circles represent miRNA targets that were overlapped by gliomarelated geges.

#### **3. Discussion**

162 Bioinformatics – Trends and Methodologies

of target prediction. Thus, a union target gene list of 14 glioma-related miRNAs was generated from 3 target prediction programs (PicTar, TargetScan and miRanda). To further explore the signaling pathway in these target genes, pathway analysis was performed. Table 5 showed that p53 signaling pathway, Apoptosis, Focal adhesion, MAPK signaling pathway, Toll-like receptor signaling pathway and Cell cycle pathways were significantly overrepresented. Actually, these 6 pathways were included in the pathways of glioma-related genes. These findings imply that glioma-related genes and miRNAs prefer a common set of

Pathway Count Enrichment P-value

Over-represented KEGG pathways were identified after multiple testing adjustments (P-value<0.05).

In order to construct the network between glioma-related miRNAs and the signaling pathway, integrated analysis of the targets of glioma-related miRNAs was performed. This procedure obtained 6 miRNA-pathway networks. For instance, p53 signaling pathway network contained 12 miRNAs (miR-21, miR-34a, miR-221, miR-222, et al) and 19 genes

p53 signaling pathway 19 8.67E-09 Apoptosis 21 2.4E-08 Focal adhesion 26 0.000123 MAPK signaling pathway 30 0.000499 Toll-like receptor signaling pathway 15 0.005773 Cell cycle 16 0.006547

Table 5. Set of signaling pathways with highly enriched microRNA targets.

Fig. 2. Visualization of miRNA-p53 pathway network in glioma.

signaling pathways.

(PTEN, CDK6, BBC3, et al) (Fig.2).

The overall utility of our data mining approach, including the strategy for constructing interaction networks, is to explore biological mechanisms involved in glioma progression. In this study, we obtained 670 genes and 14 miRNAs that interacted with glioma and generated interaction networks from abstract-based text mining. Importantly, our analysis identified PI3K and JAK2 hub signalings and miRNA regulatory module in glioma.

#### **3.1 Core signalings in glioma**

By integration of PubMed text mining, homology prediction, gene neighbor, protein-protein interaction, gene fusion and other data sources, we constructed glioma-related genes knowledge-driven network. Further analysis revealed that PI3K and JAK2 hub signalings that had an influential role in network stability, located at the key status of glioma-related genes knowledge-driven network. These signaling exerted a wide effect on kinds of biological functions and pathways, including signaling transduction, MAPK pathway, growth factor, cell apoptosis, cell proliferation, cell adhesion and cell migration. Further, integrating GO and pathway analysis, data revealed that proliferation without control and invasive growth were the essential characteristic of glioma.

PI3Ks are heterodimers comprised of a regulatory subunit (p85) and a catalytic subunit (p110). Activated receptor tyrosine kinases recruit the PI3 kinase complex to the membrane via the p85 regulatory subunit, thereby activating the catalytic subunit p110, which then phosphorylates phosphatidylinositol-4,5-bisphosphate (PIP2) to phosphatidylinositol-3,4,5 trisphosphate (PIP3). PIP3 recruits protein AKT to the plasma membrane where AKT is phosphorylated at Thr308 and Ser473 (Cheng et al., 2009). A high frequency of mutations in PIK3CA, the gene encoding the p110alpha subunit of PI3K, was found in glioblastoma (Gallia et al., 2006; Kita et al., 2007). Our recent data showed that PI3K activity were greatly increased with the ascending of tumor grade and correlated positively with AKT2 expression (Wang et al., 2010). Activation of PI3K/Akt signaling cascade results in cell survival and proliferation as well as inhibition of cell apoptosis through regulating downstream targets. AKT contributes to glioma cell migration and invasion by regulating the formation of cytoskeleton, influencing adhesion and MMP2/9 expression (Pu et al., 2004; Zhang et al., 2009a; Zhang et al., 2010d). AKT promotes the cell cycle progression by suppression of cyclin-dependent kinase inhibitors p21 and p27 and increase of Cyclin D1 (Guillard et al., 2009; Koul et al., 2010; Pu et al., 2006). AKT inhibits cell apoptosis by inactivation of caspase pathway, and activation of BCL2, NFκB and mTOR signaling cascade (Jiang et al., 2009; Ruano et al., 2008; Zhang et al., 2010d). Further, prosurvival signaling by PI3K contributes to therapeutic resistance in the setting of established antiglioma therapies. Several studies have shown that PI3K inhibition sensitizes glioma cells to radiation and chemical therapy (Opel et al., 2008; Prevo et al., 2008). Additionally, our study recently has showed that co-suppression of PI3K and AKT exerts significant proliferation and invasion inhibition effects on glioma cells (Fu et al., 2009). In the current study, we found that is PI3K is a molecular hub in glioma-related genes knowledge-driven network, and associated with a wide variety of cell biological functions and signaling pathways. Therefore, it is urgent to develop novel therapies for targeting PI3K/AKT signaling in glioma treatment.

Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 165

GBM, and describes a gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Despite of the differences of two studies, our data showed another

miRNAs are a new class of small, non-coding RNAs located in noncoding regions or the introns of the genome, and regulate gene expression by binding to the 3'-untranslated region (3'-UTR) of specific mRNAs. Extensive studies have indicated that miRNAs could function as oncogenic miRNAs or tumor suppressor miRNAs, playing crucial roles in carcinogenesis. Expression profiling of glioma has unveiled miRNA signatures that not only distinguish glioma from normal tissues, but can also differentiate histotypes or molecular subtypes with altered genetic pathways (Ciafre et al., 2005; Lavon et al., 2010). Our data mining analysis showed that 6 pathways involved in 14 glioma-related miRNAs, in line with the pathway analysis of glioma-related genes, indicating that glioma-related genes and miRNAs exert an effect on a common set of signaling pathways. Moreover, we found that the pathway regulatory control mediated by miRNAs differs from pathway to pathway and the targets of a specific miRNA are significantly enriched in multiple pathways. In p53 signaling pathway network, 12 miRNAs and 19 genes are involved. Among these miRNA and target gene relationships, MDM2, CDK6, CDKN2A and CCNE1 are successfully identified as direct targets of miR-221, miR-34a, miR-125b and miR-15b, respectively (Kim et al., 2010; Pogue et al., 2010; Sun et al., 2008; Xia et al., 2009). Our data recently showed that miR-221 and miR-222 directly modulate PTEN expression via targeting PTEN 3'-UTR (Zhang et al., 2010a). In addition, we have also evidenced that BBC3, also named p53 upregulated modulator of apoptosis (PUMA), is a new target of miR-221, consistent with bioinformatics analysis (Zhang et al., 2010b). Further, a recent publication revealed that miR-21 can impair p53-mediated apoptosis in response to chemotherapeutic (doxorubicin) induced DNA damage, therefore contributing to drug resistance in glioblastoma cells (Papagiannakopoulos et al., 2008). Thus, modulation of these p53-related targets by miR-21 may potentially explain previous observation that p53 signaling pathway were up-regulated in response to miR-21 knockdown (Frankel et al., 2008). These exciting results prompt us to further elucidate the intricacy of the interaction between miRNAs and the signaling

In conclusion, using data mining analysis, we construct glioma-related genes knowledgedriven network and show that PI3K and JAK2 hub signalings are key steps leading to oncogenesis in glioma, and further propose miRNA regulatory module in glioma. These data demonstrate the power of data mining strategies as tools for biological discovery and identify core signalings and miRNA regulatory module in glioma, suggesting that the application of this strategy to consolidate all existing data for other diseases may yield

Medline/PubMed is used as information source for bioinformatics text mining. Medline abstracts were retrieved using National Center for Biotechnology Information (NCBI)

approach to explore the mechanism involved in glioma using existing data.

**3.2 MiRNA regulatory module in glioma** 

important discoveries in disease pathogenesis.

**4.1 Natural language processing (NLP) system** 

**4. Experimental procedures** 

pathway.

In our case, network analysis also identifies a new candidate hub gene JAK2 in glioblastoma. JAKs, which have four members, JAK1, JAK2, JAK3 and Tyrosine kinase 2 (Tyk2) in mammals, are non-receptor tyrosine kinases involved in upstream intracellular signaling pathways that become activated after extracellular ligand binding to a variety of cytokine and growth-factor receptors (Pesu et al., 2008). JAK2 is known to be able to phosphorylate members of the signal transducers and activators of the transcription (STAT) protein family, subsequently leading them to translocate to the nucleus and bind to specific DNA sequences in the promoters of multiple responsive genes (Ghoreschi et al., 2009; Rane and Reddy, 2000). STAT family has been reported to be involved in the development of glioma. Of note, STAT3, is aberrantly activated in human glioblastoma tissues, and this activation is implicated in controlling critical cellular events thought to be involved in gliomagenesis, such as cell cycle progression, apoptosis and angiogenesis (Brantley and Benveniste, 2008). Recently, a glioma-specific regulatory network has revealed the transcriptional module that activates expression of mesenchymal genes in malignant glioma and STAT3 is one of key transcription factors necessary in human glioma cells for mesenchymal transformation (Carro et al., 2010). Additionally, nuclear staining of phospho-STAT5 is overexpressed in glioma tissues, and cytoplasm staining of STAT5b is markedly increased in glioblastoma multiforme compared with that in normal brain (Kondyli et al., 2010; Liang et al., 2009). Reduction of STAT5b inhibits glioma cell growth, cell cycle progression, invasion and migration through regulation of gene expression, such as Bcl-2, p21, p27 and VEGF (Liang et al., 2009). As another member of STAT family, STAT1 is upregulated in the majority of glioblastomas (Haybaeck et al., 2007). Little evidence exists to show the mechanism of JAK2 (upstream regulator of STAT family) involved in glomagenesis. However, data mining analysis displays that JAK2 occupies a core regulatory node of glioma-related genes knowledge-driven network. These data indicate that modulation of the mechanism responsible for JAK2 in glioma would help us to elucidate the development of glioma and inhibition of JAK2/STAT signaling could be used as a new therapeutic strategy to treatment glioma. The JAK/STAT pathway plays a central role in principal cell fate decisions, regulating the processes of innate immunity, adaptive immunity, cell proliferation, differentiation, and apoptosis.

In addition, we found the gene CTNNB1 (encoding β-catenin) at the lower right corner of Fig.1 would warrant further investigation. β-catenin and Tcf-4 are the core components of the canonical Wnt/β-catenin/Tcf pathway, which is a crucial factor in the development of many cancers (MacDonald et al., 2009; Ying and Tao, 2009). β-catenin accumulates in the nucleus, where it interacts with coregulators of transcription including Tcf-4 and Lef-1 to form a β-catenin/Tcf/Lef complex. This complex regulates transcription of multiple genes involved in cellular proliferation, differentiation, survival and apoptosis, including Fra-1, cmyc and Cyclin D (Wang et al., 2002; Yochum et al., 2008). Recently several reports have showed that aberrant activation of Wnt/β-catenin/Tcf signaling pathway is an important contributing factor in gliomas (Liu et al., 2010; Pu et al., 2009; Sareddy et al., 2009). β-catenin and Tcf-4 were up-regulated in glioma tissues in comparison to normal brain tissues. Knockdown of β-catenin by siRNA in human glioma cells inhibited cell proliferation and invasive ability, induced apoptotic cell death and delayed the tumor growth (Pu et al., 2009). However, up to now, little direct evidence exists to show the mechanism of β-catenin and Tcf-4 involved in gliomagenesis.

Actually, our data doses not well confirm the update results of The Cancer Genome Atlas Network (TCGA) (Verhaak et al., 2010). TCGA catalogs recurrent genomic abnormalities in GBM, and describes a gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Despite of the differences of two studies, our data showed another approach to explore the mechanism involved in glioma using existing data.

#### **3.2 MiRNA regulatory module in glioma**

164 Bioinformatics – Trends and Methodologies

In our case, network analysis also identifies a new candidate hub gene JAK2 in glioblastoma. JAKs, which have four members, JAK1, JAK2, JAK3 and Tyrosine kinase 2 (Tyk2) in mammals, are non-receptor tyrosine kinases involved in upstream intracellular signaling pathways that become activated after extracellular ligand binding to a variety of cytokine and growth-factor receptors (Pesu et al., 2008). JAK2 is known to be able to phosphorylate members of the signal transducers and activators of the transcription (STAT) protein family, subsequently leading them to translocate to the nucleus and bind to specific DNA sequences in the promoters of multiple responsive genes (Ghoreschi et al., 2009; Rane and Reddy, 2000). STAT family has been reported to be involved in the development of glioma. Of note, STAT3, is aberrantly activated in human glioblastoma tissues, and this activation is implicated in controlling critical cellular events thought to be involved in gliomagenesis, such as cell cycle progression, apoptosis and angiogenesis (Brantley and Benveniste, 2008). Recently, a glioma-specific regulatory network has revealed the transcriptional module that activates expression of mesenchymal genes in malignant glioma and STAT3 is one of key transcription factors necessary in human glioma cells for mesenchymal transformation (Carro et al., 2010). Additionally, nuclear staining of phospho-STAT5 is overexpressed in glioma tissues, and cytoplasm staining of STAT5b is markedly increased in glioblastoma multiforme compared with that in normal brain (Kondyli et al., 2010; Liang et al., 2009). Reduction of STAT5b inhibits glioma cell growth, cell cycle progression, invasion and migration through regulation of gene expression, such as Bcl-2, p21, p27 and VEGF (Liang et al., 2009). As another member of STAT family, STAT1 is upregulated in the majority of glioblastomas (Haybaeck et al., 2007). Little evidence exists to show the mechanism of JAK2 (upstream regulator of STAT family) involved in glomagenesis. However, data mining analysis displays that JAK2 occupies a core regulatory node of glioma-related genes knowledge-driven network. These data indicate that modulation of the mechanism responsible for JAK2 in glioma would help us to elucidate the development of glioma and inhibition of JAK2/STAT signaling could be used as a new therapeutic strategy to treatment glioma. The JAK/STAT pathway plays a central role in principal cell fate decisions, regulating the processes of innate immunity, adaptive

In addition, we found the gene CTNNB1 (encoding β-catenin) at the lower right corner of Fig.1 would warrant further investigation. β-catenin and Tcf-4 are the core components of the canonical Wnt/β-catenin/Tcf pathway, which is a crucial factor in the development of many cancers (MacDonald et al., 2009; Ying and Tao, 2009). β-catenin accumulates in the nucleus, where it interacts with coregulators of transcription including Tcf-4 and Lef-1 to form a β-catenin/Tcf/Lef complex. This complex regulates transcription of multiple genes involved in cellular proliferation, differentiation, survival and apoptosis, including Fra-1, cmyc and Cyclin D (Wang et al., 2002; Yochum et al., 2008). Recently several reports have showed that aberrant activation of Wnt/β-catenin/Tcf signaling pathway is an important contributing factor in gliomas (Liu et al., 2010; Pu et al., 2009; Sareddy et al., 2009). β-catenin and Tcf-4 were up-regulated in glioma tissues in comparison to normal brain tissues. Knockdown of β-catenin by siRNA in human glioma cells inhibited cell proliferation and invasive ability, induced apoptotic cell death and delayed the tumor growth (Pu et al., 2009). However, up to now, little direct evidence exists to show the mechanism of β-catenin and

Actually, our data doses not well confirm the update results of The Cancer Genome Atlas Network (TCGA) (Verhaak et al., 2010). TCGA catalogs recurrent genomic abnormalities in

immunity, cell proliferation, differentiation, and apoptosis.

Tcf-4 involved in gliomagenesis.

miRNAs are a new class of small, non-coding RNAs located in noncoding regions or the introns of the genome, and regulate gene expression by binding to the 3'-untranslated region (3'-UTR) of specific mRNAs. Extensive studies have indicated that miRNAs could function as oncogenic miRNAs or tumor suppressor miRNAs, playing crucial roles in carcinogenesis. Expression profiling of glioma has unveiled miRNA signatures that not only distinguish glioma from normal tissues, but can also differentiate histotypes or molecular subtypes with altered genetic pathways (Ciafre et al., 2005; Lavon et al., 2010). Our data mining analysis showed that 6 pathways involved in 14 glioma-related miRNAs, in line with the pathway analysis of glioma-related genes, indicating that glioma-related genes and miRNAs exert an effect on a common set of signaling pathways. Moreover, we found that the pathway regulatory control mediated by miRNAs differs from pathway to pathway and the targets of a specific miRNA are significantly enriched in multiple pathways. In p53 signaling pathway network, 12 miRNAs and 19 genes are involved. Among these miRNA and target gene relationships, MDM2, CDK6, CDKN2A and CCNE1 are successfully identified as direct targets of miR-221, miR-34a, miR-125b and miR-15b, respectively (Kim et al., 2010; Pogue et al., 2010; Sun et al., 2008; Xia et al., 2009). Our data recently showed that miR-221 and miR-222 directly modulate PTEN expression via targeting PTEN 3'-UTR (Zhang et al., 2010a). In addition, we have also evidenced that BBC3, also named p53 upregulated modulator of apoptosis (PUMA), is a new target of miR-221, consistent with bioinformatics analysis (Zhang et al., 2010b). Further, a recent publication revealed that miR-21 can impair p53-mediated apoptosis in response to chemotherapeutic (doxorubicin) induced DNA damage, therefore contributing to drug resistance in glioblastoma cells (Papagiannakopoulos et al., 2008). Thus, modulation of these p53-related targets by miR-21 may potentially explain previous observation that p53 signaling pathway were up-regulated in response to miR-21 knockdown (Frankel et al., 2008). These exciting results prompt us to further elucidate the intricacy of the interaction between miRNAs and the signaling pathway.

In conclusion, using data mining analysis, we construct glioma-related genes knowledgedriven network and show that PI3K and JAK2 hub signalings are key steps leading to oncogenesis in glioma, and further propose miRNA regulatory module in glioma. These data demonstrate the power of data mining strategies as tools for biological discovery and identify core signalings and miRNA regulatory module in glioma, suggesting that the application of this strategy to consolidate all existing data for other diseases may yield important discoveries in disease pathogenesis.

#### **4. Experimental procedures**

#### **4.1 Natural language processing (NLP) system**

Medline/PubMed is used as information source for bioinformatics text mining. Medline abstracts were retrieved using National Center for Biotechnology Information (NCBI)

Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 167

Three computational tools: TargetScan v5.1 (http://www.targetscan.org/), miRanda v5 (http://microrna.sanger.ac.uk/), PicTar ver. March 26, 2007 (http://pictar.mdc- berlin.de/) were utilized to identify miRNA targets in 3'-UTR of genes. The union of these results was

The overlap of target genes of glioma-related miRNAs predicted by computational tools and glioma related genes derived from NLP analysis was calculated. A bipartite network of microRNAs and corresponding target genes was constructed. The network was displayed in

This work was supported by China National Natural Scientific Fund (30971136, 30872657, 81072078), Tianjin Science and Technology Committee (09JCZDJC17600), Program for New Century Excellent Talents in University (NCET-07-0615), Jiangsu Province's "333" Key

Brantley, EC, Benveniste, EN. (2008). Signal transducer and activator of transcription-3: a

Carro, MS, Lim, WK, Alvarez, MJ, Bollo, RJ, Zhao, X, Snyder, EY, Sulman, EP, Anne, SL,

Cheng, CK, Fan, QW, Weiss, WA. (2009). PI3K signaling in glioma--animal models and

Ciafre, SA, Galardi, S, Mangiola, A, Ferracin, M, Liu, CG, Sabatino, G, Negrini, M, Maira, G,

Frankel, LB, Christoffersen, NR, Jacobsen, A, Lindow, M, Krogh, A, Lund, AH. (2008).

Friedman, C, Kra, P, Yu, H, Krauthammer, M, Rzhetsky, A. (2001). GENIES: a natural-

Fu, Y, Zhang, Q, Kang, C, Zhang, J, Zhang, K, Pu, P, Wang, G, Wang, T. (2009). Inhibitory

tumor cells in vitro and in vivo. *Cancer Biol Ther*. Vol.8, NO.11, pp. 1002-1009.

therapeutic challenges. *Brain Pathol*. Vol.19, NO.1, pp. 112-120.

data using Cytoscape. *Nat Protoc*. Vol.2, NO.10, pp. 2366-2382.

articles. *Bioinformatics*. Vol.17 Suppl 1, S74-82.

molecular hub for signaling pathways in gliomas. *Mol Cancer Res*. Vol.6, NO.5, pp.

Doetsch, F, Colman, H, Lasorella, A, Aldape, K, Califano, A, Iavarone, A. (2010). The transcriptional network for mesenchymal transformation of brain tumours.

Croce, CM, Farace, MG. (2005). Extensive modulation of a set of microRNAs in primary glioblastoma. *Biochem Biophys Res Commun*. Vol.334, NO.4, pp. 1351-1358. Cline, MS, Smoot, M, Cerami, E, Kuchinsky, A, Landys, N, Workman, C, Christmas, R, Avila-

Campilo, I, Creech, M, Gross, B, Hanspers, K, Isserlin, R, Kelley, R, Killcoyne, S, Lotia, S, Maere, S, Morris, J, Ono, K, Pavlovic, V, Pico, AR, Vailaya, A, Wang, PL, Adler, A, Conklin, BR, Hood, L, Kuiper, M, Sander, C, Schmulevich, I, Schwikowski, B, Warner, GJ, Ideker, T, Bader, GD. (2007). Integration of biological networks and gene expression

Programmed cell death 4 (PDCD4) is an important functional target of the microRNA miR-21 in breast cancer cells. *J Biol Chem*. Vol.283, NO.2, pp. 1026-1033.

language processing system for the extraction of molecular pathways from journal

effects of adenovirus mediated Akt1 and PIK3R1 shRNA on the growth of malignant

listed for further analysis. These targets were used to analyze KEGG pathways.

**4.5 Target gene prediction and pathway analysis of miRNAs** 

**4.6 MiRNA-target network analysis** 

separated pathways.

**6. References** 

**5. Acknowledgments** 

Talent Foundation (0508RS08).

675-684.

*Nature*. Vol.463, NO.7279, pp. 318-325.

PubMed portal. We queried Pubmed with: glioma[title] AND ("1980/01/01"[PDAT] : "2010/04/01"[PDAT]). All abstracts were downloaded as HTML text without images and then converted into XML documents. Sentence tokenization was performed with Lingpipe tools. Subsequent analysis is based on the sentence as the basic units. Gene mentions were tagged using ABNER (Settles, 2005). To solve the matter of plethora of gene aliases, all gene mentions were normalized to Entrez gene (http://www.ncbi.nlm.nih.gov/Entrez/) official gene symbols. Only sentences with glioma, the genes were selected.

In order to test the null hypothesis 'the relationship between glioma and the gene is random', hypergeometric distribution test was employed. Let N be the total number of PubMed abstracts and m, n be the number mentions in PubMed for glioma and a related gene, respectively.

$$p = 1 - \sum\_{i=0}^{k-1} p(i \mid n, m, N)$$

Where

$$p(i \mid n, m, N) = \frac{n!(N-n)!m!(N-m)!}{(n-i)!i!(n-m)!(N-n-m+i)!N!}$$

The "glioma-gene" relations with P-value<0.05 were then summarized and subjected to a relational database for further analysis.

#### **4.2 Gene ontology analysis**

Gene ontology analysis was performed by GSEA Base package of BioConductor (http://www.bioconductor.org/). The glioma-related genes were performed a gene set enrichment analysis based on the gene ontology (GO) categories.

#### **4.3 Pathway analysis**

Expression Analysis Systematic Explorer (EASE) (Hosack et al., 2003) was used to analyze KEGG pathways. Over representation of genes in a KEGG pathway is present if a larger fraction of genes within that pathway is differentially expressed compared with all genes in the genome. The "glioma-gene" relationships retrieved by our NLP system were filtered by pathway enrichment analysis. The links between glioma and related genes were then visualized in Cytoscape software (Cline et al., 2007) (http://www.cytoscape.org/). Genes were grouped according to pathways. Genes that involves in multiple pathways are assigned to a single pathway with the smallest enrichment P-value.

#### **4.4 Gene network analysis**

Integrating PubMed text mining, homology prediction, gene neighbor, protein-protein interaction, gene fusion and other data sources through the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), we created glioma-related genes knowledge-driven network (von Mering et al., 2005). Linker genes below a P-value threshold of 0.01 were identified as "hubs". The results from the search are saved in data files describing links between two genes and then handled in Medusa software.

#### **4.5 Target gene prediction and pathway analysis of miRNAs**

Three computational tools: TargetScan v5.1 (http://www.targetscan.org/), miRanda v5 (http://microrna.sanger.ac.uk/), PicTar ver. March 26, 2007 (http://pictar.mdc- berlin.de/) were utilized to identify miRNA targets in 3'-UTR of genes. The union of these results was listed for further analysis. These targets were used to analyze KEGG pathways.

#### **4.6 MiRNA-target network analysis**

The overlap of target genes of glioma-related miRNAs predicted by computational tools and glioma related genes derived from NLP analysis was calculated. A bipartite network of microRNAs and corresponding target genes was constructed. The network was displayed in separated pathways.

#### **5. Acknowledgments**

This work was supported by China National Natural Scientific Fund (30971136, 30872657, 81072078), Tianjin Science and Technology Committee (09JCZDJC17600), Program for New Century Excellent Talents in University (NCET-07-0615), Jiangsu Province's "333" Key Talent Foundation (0508RS08).

#### **6. References**

166 Bioinformatics – Trends and Methodologies

PubMed portal. We queried Pubmed with: glioma[title] AND ("1980/01/01"[PDAT] : "2010/04/01"[PDAT]). All abstracts were downloaded as HTML text without images and then converted into XML documents. Sentence tokenization was performed with Lingpipe tools. Subsequent analysis is based on the sentence as the basic units. Gene mentions were tagged using ABNER (Settles, 2005). To solve the matter of plethora of gene aliases, all gene mentions were normalized to Entrez gene (http://www.ncbi.nlm.nih.gov/Entrez/) official

In order to test the null hypothesis 'the relationship between glioma and the gene is random', hypergeometric distribution test was employed. Let N be the total number of PubMed abstracts and m, n be the number mentions in PubMed for glioma and a related

1

*k*

*i p pi nmN* −

= = −

0 1 (| , , )

!( )! !( )! (| , , ) !! ! ! ! *nN nmN m pi nmN niinm N nmiN* − − <sup>=</sup> − − −− +

The "glioma-gene" relations with P-value<0.05 were then summarized and subjected to a

Gene ontology analysis was performed by GSEA Base package of BioConductor (http://www.bioconductor.org/). The glioma-related genes were performed a gene set

Expression Analysis Systematic Explorer (EASE) (Hosack et al., 2003) was used to analyze KEGG pathways. Over representation of genes in a KEGG pathway is present if a larger fraction of genes within that pathway is differentially expressed compared with all genes in the genome. The "glioma-gene" relationships retrieved by our NLP system were filtered by pathway enrichment analysis. The links between glioma and related genes were then visualized in Cytoscape software (Cline et al., 2007) (http://www.cytoscape.org/). Genes were grouped according to pathways. Genes that involves in multiple pathways are

Integrating PubMed text mining, homology prediction, gene neighbor, protein-protein interaction, gene fusion and other data sources through the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), we created glioma-related genes knowledge-driven network (von Mering et al., 2005). Linker genes below a P-value threshold of 0.01 were identified as "hubs". The results from the search are saved in data files describing links

( ) ( )( )

gene symbols. Only sentences with glioma, the genes were selected.

enrichment analysis based on the gene ontology (GO) categories.

assigned to a single pathway with the smallest enrichment P-value.

between two genes and then handled in Medusa software.

gene, respectively.

relational database for further analysis.

**4.2 Gene ontology analysis** 

**4.3 Pathway analysis** 

**4.4 Gene network analysis** 

Where


Data Mining Pubmed Identifies Core Signalings and miRNA Regulatory Module in Glioma 169

Pogue, AI, Cui, JG, Li, YY, Zhao, Y, Culicchia, F, Lukiw, WJ. (2010). Micro RNA-125b

Prevo, R, Deutsch, E, Sampson, O, Diplexcito, J, Cengel, K, Harper, J, O'Neill, P, McKenna,

Pu, P, Kang, C, Li, J, Jiang, H. (2004). Antisense and dominant-negative AKT2 cDNA inhibits

Pu, P, Kang, C, Li, J, Jiang, H, Cheng, J. (2006). The effects of antisense AKT2 RNA on the

Pu, P, Zhang, Z, Kang, C, Jiang, R, Jia, Z, Wang, G, Jiang, H. (2009). Downregulation of Wnt2

Rane, SG, Reddy, EP. (2000). Janus kinases: components of multiple signaling pathways.

Rindflesch, TC, Tanabe, L, Weinstein, JN, Hunter, L. (2000). EDGAR: extraction of drugs, genes and relations from the biomedical literature. *Pac Symp Biocomput*. 517-528. Ruano, Y, Mollejo, M, Camacho, FI, Rodriguez de Lope, A, Fiano, C, Ribalta, T, Martinez, P,

Sareddy, GR, Panigrahi, M, Challa, S, Mahadevan, A, Babu, PP. (2009). Activation of

Sekimizu, T, Park, HS, Tsujii, J. (1998). Identifying the Interaction between Genes and Gene

Settles, B. (2005). ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. *Bioinformatics*. Vol.21, NO.14, pp. 3191-3192. Shi, L, Cheng, Z, Zhang, J, Li, R, Zhao, P, Fu, Z, You, Y. (2008). hsa-mir-181a and hsa-mir-181b function as tumor suppressors in human glioma cells. *Brain Res*. Vol.1236, 185-193. Shi, L, Zhang, J, Pan, T, Zhou, J, Gong, W, Liu, N, Fu, Z, You, Y. (2010). MiR-125b is critical

Stupp, R, Mason, WP, van den Bent, MJ, Weller, M, Fisher, B, Taphoorn, MJ, Belanger, K,

Sun, F, Fu, H, Liu, Q, Tie, Y, Zhu, J, Xing, R, Sun, Z, Zheng, X. (2008). Downregulation of

Verhaak, RG, Hoadley, KA, Purdom, E, Wang, V, Qi, Y, Wilkerson, MD, Miller, CR, Ding, L,

glioblastoma. *N Engl J Med*. Vol.352, NO.10, pp. 987-996.

glioma cell invasion. *Tumour Biol*. Vol.25, NO.4, pp. 172-178.

Vol.476, NO.1, pp. 18-22.

NO.1, pp. 1-11.

*Cancer Res*. Vol.68, NO.14, pp. 5915-5923.

*Ther*. Vol.16, NO.4, pp. 351-361.

Vol.55, NO.5, pp. 307-317.

Vol.1312, 120-126.

pp. 1564-1568.

*Workshop Genome Inform*. Vol.9, 62-71.

*Oncogene*. Vol.19, NO.49, pp. 5662-5679.

multiforme. *Cancer*. Vol.112, NO.7, pp. 1575-1584.

(miRNA-125b) function in astrogliosis and glial cell proliferation. *Neurosci Lett*.

WG, Patel, S, Bernhard, EJ. (2008). Class I PI3 kinase inhibition by the pyridinylfuranopyrimidine inhibitor PI-103 enhances tumor radiosensitivity.

inhibition of malignant glioma cell growth in vitro and in vivo. *J Neurooncol*. Vol.76,

and beta-catenin by siRNA suppresses malignant glioma cell growth. *Cancer Gene* 

Hernandez-Moneo, JL, Melendez, B. (2008). Identification of survival-related genes of the phosphatidylinositol 3'-kinase signaling pathway in glioblastoma

Wnt/beta-catenin/Tcf signaling pathway in human astrocytomas. *Neurochem Int*.

Products Based on Frequently Seen Verbs in Medline Abstracts. *Genome Inform Ser* 

for the suppression of human U251 glioma stem cell proliferation. *Brain Res*.

Brandes, AA, Marosi, C, Bogdahn, U, Curschmann, J, Janzer, RC, Ludwin, SK, Gorlia, T, Allgeier, A, Lacombe, D, Cairncross, JG, Eisenhauer, E, Mirimanoff, RO. (2005). Radiotherapy plus concomitant and adjuvant temozolomide for

CCND1 and CDK6 by miR-34a induces cell cycle arrest. *FEBS Lett*. Vol.582, NO.10,

Golub, T, Mesirov, JP, Alexe, G, Lawrence, M, O'Kelly, M, Tamayo, P, Weir, BA, Gabriel, S, Winckler, W, Gupta, S, Jakkula, L, Feiler, HS, Hodgson, JG, James, CD, Sarkaria, JN, Brennan, C, Kahn, A, Spellman, PT, Wilson, RK, Speed, TP, Gray, JW,


Gallia, GL, Rand, V, Siu, IM, Eberhart, CG, James, CD, Marie, SK, Oba-Shinjo, SM, Carlotti,

Ghoreschi, K, Laurence, A, O'Shea, JJ. (2009). Janus kinases in immune cell signaling.

Guillard, S, Clarke, PA, Te Poele, R, Mohri, Z, Bjerke, L, Valenti, M, Raynaud, F, Eccles, SA,

Haybaeck, J, Obrist, P, Schindler, CU, Spizzo, G, Doppler, W. (2007). STAT-1 expression in

Hosack, DA, Dennis, G, Jr., Sherman, BT, Lane, HC, Lempicki, RA. (2003). Identifying

Jiang, H, Shang, X, Wu, H, Gautam, SC, Al-Holou, S, Li, C, Kuo, J, Zhang, L, Chopp, M.

Kim, D, Song, J, Jin, EJ. (2010). MicroRNA-221 regulates chondrogenic differentiation through promoting proteosomal degradation of slug by targeting mdm2. *J Biol Chem*. Kita, D, Yonekawa, Y, Weller, M, Ohgaki, H. (2007). PIK3CA alterations in primary (de novo) and secondary glioblastomas. *Acta Neuropathol*. Vol.113, NO.3, pp. 295-302. Kondyli, M, Gatzounis, G, Kyritsis, A, Varakis, J, Assimakopoulou, M. (2010).

Koul, N, Sharma, V, Dixit, D, Ghosh, S, Sen, E. (2010). Bicyclic triterpenoid Iripallidal

Lavon, I, Zrihan, D, Granit, A, Einstein, O, Fainstein, N, Cohen, MA, Zelikovitch, B,

Liang, QC, Xiong, H, Zhao, ZW, Jia, D, Li, WX, Qin, HZ, Deng, JP, Gao, L, Zhang, H, Gao,

Liu, X, Wang, L, Zhao, S, Ji, X, Luo, Y, Ling, F. (2010). beta-Catenin overexpression in malignant glioma and its role in proliferation and apoptosis in glioblastma cells. *Med Oncol*. MacDonald, BT, Tamai, K, He, X. (2009). Wnt/beta-catenin signaling: components,

Opel, D, Westhoff, MA, Bender, A, Braun, V, Debatin, KM, Fulda, S. (2008).

Pesu, M, Laurence, A, Kishore, N, Zwillich, SH, Chan, G, O'Shea, JJ. (2008). Therapeutic

neural precursor cells. *Neuro Oncol*. Vol.12, NO.5, pp. 422-433.

multiforme cells. *Cancer Lett*. Vol.273, NO.1, pp. 164-171.

mechanisms, and diseases. *Dev Cell*. Vol.17, NO.1, pp. 9-26.

targeting of Janus kinases. *Immunol Rev*. Vol.223, 132-142.

inhibition in human glioma. *Cell Cycle*. Vol.8, NO.3, pp. 443-453.

U251 glioma cells. *J Exp Ther Oncol*. Vol.8, NO.1, pp. 25-33.

*Mol Cancer Res*. Vol.4, NO.10, pp. 709-714.

*Immunol Rev*. Vol.228, NO.1, pp. 273-287.

3829-3835.

tumors. *J Neurooncol*.

Vol.10, NO.1, pp. 328.

pp. 8164-8172.

CG, Caballero, OL, Simpson, AJ, Brock, MV, Massion, PP, Carson, BS, Sr., Riggins, GJ. (2006). PIK3CA gene mutations in pediatric and adult glioblastoma multiforme.

Workman, P. (2009). Molecular pharmacology of phosphatidylinositol 3-kinase

human glioblastoma and peritumoral tissue. *Anticancer Res*. Vol.27, NO.6B, pp.

biological themes within lists of genes with EASE. *Genome Biol*. Vol.4, NO.10, pp. R70.

(2009). Resveratrol downregulates PI3K/Akt/mTOR signaling pathways in human

Immunohistochemical detection of phosphorylated JAK-2 and STAT-5 proteins and correlation with erythropoietin receptor (EpoR) expression status in human brain

induces apoptosis and inhibits Akt/mTOR pathway in glioma cells. *BMC Cancer*.

Shoshan, Y, Spektor, S, Reubinoff, BE, Felig, Y, Gerlitz, O, Ben-Hur, T, Smith, Y, Siegal, T. (2010). Gliomas display a microRNA expression profile reminiscent of

GD. (2009). Inhibition of transcription factor STAT5b suppresses proliferation, induces G1 cell cycle arrest and reduces tumor cell invasion in human glioblastoma

Phosphatidylinositol 3-kinase inhibition broadly sensitizes glioblastoma cells to death receptor- and drug-induced apoptosis. *Cancer Res*. Vol.68, NO.15, pp. 6271-6280. Papagiannakopoulos, T, Shapiro, A, Kosik, KS. (2008). MicroRNA-21 targets a network of

key tumor-suppressive pathways in glioblastoma cells. *Cancer Res*. Vol.68, NO.19,


**Part 4** 

**Sequence Analysis and Evolution** 

Meyerson, M, Getz, G, Perou, CM, Hayes, DN. (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. *Cancer Cell*. Vol.17, NO.1, pp. 98-110.


## **Part 4**

**Sequence Analysis and Evolution** 

170 Bioinformatics – Trends and Methodologies

in PDGFRA, IDH1, EGFR, and NF1. *Cancer Cell*. Vol.17, NO.1, pp. 98-110. von Mering, C, Jensen, LJ, Snel, B, Hooper, SD, Krupp, M, Foglierini, M, Jouffre, N, Huynen,

Wang, G, Kang, C, Pu, P. (2010). Increased expression of Akt2 and activity of PI3K and cell

Wang, HL, Wang, J, Xiao, SY, Haydon, R, Stoiber, D, He, TC, Bissonnette, M, Hart, J. (2002).

Xia, H, Qi, Y, Ng, SS, Chen, X, Chen, S, Fang, M, Li, D, Zhao, Y, Ge, R, Li, G, Chen, Y, He,

Ying, Y, Tao, Q. (2009). Epigenetic disruption of the WNT/beta-catenin signaling pathway

Yochum, GS, Cleland, R, Goodman, RH. (2008). A genome-wide screen for beta-catenin

Zhang, B, Gu, F, She, C, Guo, H, Li, W, Niu, R, Fu, L, Zhang, N, Ma, Y. (2009a). Reduction of Akt2 inhibits migration and invasion of glioma cells. *Int J Cancer*. Vol.125, NO.3, pp. 585-595. Zhang, C, Kang, C, You, Y, Pu, P, Yang, W, Zhao, P, Wang, G, Zhang, A, Jia, Z, Han, L, Jiang, H.

and radioresistance by targeting PTEN. *BMC Cancer*. Vol.10, NO.1, pp. 367. Zhang, CZ, Zhang, JX, Zhang, AL, Shi, ZD, Han, L, Jia, ZF, Yang, WD, Wang, GX, Jiang, T,

to induce cell survival in glioblastoma. *Mol Cancer*. Vol.9, NO.1, pp. 229. Zhang, J, Han, L, Ge, Y, Zhou, X, Zhang, A, Zhang, C, Zhong, Y, You, Y, Pu, P, Kang, C.

Zhang, J, Han, L, Zhang, A, Wang, Y, Yue, X, You, Y, Pu, P, Kang, C. (2010d). AKT2

Zhou, X, Ren, Y, Moore, L, Mei, M, You, Y, Xu, P, Wang, B, Wang, G, Jia, Z, Pu, P, Zhang, W,

Zhou, X, Zhang, J, Jia, Q, Ren, Y, Wang, Y, Shi, L, Liu, N, Wang, G, Pu, P, You, Y, Kang, C.

of the Akt pathway. *Int J Oncol*. Vol.36, NO.4, pp. 913-920.

survival and invasion. *Oncol Rep*. Vol.24, NO.1, pp. 65-72.

*Lab Invest*. Vol.90, NO.2, pp. 144-155.

and 3. *Oncol Rep*. Vol.24, NO.1, pp. 195-201.

in human cancers. *Epigenetics*. Vol.4, NO.5, pp. 307-312.

expression. *Mol Cell Biol*. Vol.28, NO.24, pp. 7368-7379.

NO.Database issue, pp. D433-437.

*Neurosurg*. Vol.112, NO.4, pp. 324-327.

*Cancer*. Vol.101, NO.4, pp. 301-310.

Vol.380, NO.2, pp. 205-210.

Meyerson, M, Getz, G, Perou, CM, Hayes, DN. (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities

MA, Bork, P. (2005). STRING: known and predicted protein-protein associations, integrated and transferred across organisms. *Nucleic Acids Res*. Vol.33,

proliferation with the ascending of tumor grade of human gliomas. *Clin Neurol* 

Elevated protein expression of cyclin D1 and Fra-1 but decreased expression of c-Myc in human colorectal adenocarcinomas overexpressing beta-catenin. *Int J* 

ML, Kung, HF, Lai, L, Lin, MC. (2009). MicroRNA-15b regulates cell cycle progression by targeting cyclins in glioma cells. *Biochem Biophys Res Commun*.

binding sites identifies a downstream enhancer element that controls c-Myc gene

(2009b). Co-suppression of miR-221/222 cluster suppresses human glioma cell growth by targeting p27kip1 in vitro and in vivo. *Int J Oncol*. Vol.34, NO.6, pp. 1653-1660. Zhang, CZ, Han, L, Zhang, AL, Fu, YC, Yue, X, Wang, GX, Jia, ZF, Pu, PY, Zhang, QY, Kang, CS.

(2010a). MicroRNA-221 and microRNA-222 regulate gastric carcinoma cell proliferation

You, YP, Pu, PY, Cheng, JQ, Kang, CS. (2010b). MiR-221 and miR-222 target PUMA

(2010c). miR-221/222 promote malignant progression of glioma through activation

expression is associated with glioma malignant progression and required for cell

Kang, C. (2010a). Downregulation of miR-21 inhibits EGFR pathway and suppresses the growth of human glioblastoma cells independent of PTEN status.

(2010b). Reduction of miR-21 induces glioma cell apoptosis via activating caspase 9

**0**

**9**

Grégory Nuel

*France*

**Significance Score of Motifs**

*Institute for Mathematical Sciences (INSMI), CNRS, Paris*

<sup>−</sup> log10 **<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) if *<sup>n</sup>* <sup>&</sup>gt; **<sup>E</sup>**[*N*] (1)

*Department of Applied Mathematics (MAP5), University of Paris Descartes*

In Bioinformatics, it is common to search biological sequences (DNA, RNA, proteins) for functional motifs such as cross-over hotspot instigators (chi), restriction sites, regulation motifs, binding sites, active sites in proteins, etc. (Beaudoing et al., 2000; Brazma et al., 1998; El Karoui et al., 1999; Frith et al., 2002; Hampson et al., 2002; Karlin et al., 1992; Leonardo Marino-Ramírez & Landsman, 2004; van Helden et al., 1998). Due to evolution pressure, functional motifs are likely to be more conserved than non-functional motifs. As a consequence, it is a natural strategy to search biological sequences for motifs which are

Given M a motif of interest (from simple strings to complex regular expressions), a recurrent question is: "how surprising is it to observe *n* occurrences of M in my dataset ". In statistical terms, this is equivalent to compute the *p*-value of observation *n* in respect with a relevant reference model. More precisely, if *X*1:� = *X*<sup>1</sup> ... *X*� is a length � random sequence generated by our reference model, and if *N* denotes the random number of occurrences of M in *X*1:�, for

*<sup>S</sup>*(*n*) = <sup>+</sup> log10 **<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) if *<sup>n</sup>* **<sup>E</sup>**[*N*]

this score representing the *p*-value in a decimal log-scale, negative (resp. positive) values

In order to compute such a score for a given motif M and a given dataset, one needs two

We can see on Fig. 1 various examples of the kind of biological motifs we usually deal with in Bioinformatics. In most cases, these motifs are built from a set of active sequences (putative

**1) Counting:** count the observed number *n* of occurrences of motif M in the dataset; **2) Significance:** compute the p-value of observation *n* with respect to a reference model. In this chapter, we give all the necessary details to perform these two steps using state of the

any *n* 0 our objective is to compute the significance score of observation *n*:

statistically exceptional (ex: over- or under-represented).

being associated to under- (resp. over-) representation events.

art approaches including some unpublished results.

**1. Introduction**

essential steps:

**2. Counting motifs 2.1 Biological motifs** **in Biological Sequences**

### **Significance Score of Motifs in Biological Sequences**

Grégory Nuel

*Institute for Mathematical Sciences (INSMI), CNRS, Paris Department of Applied Mathematics (MAP5), University of Paris Descartes France*

#### **1. Introduction**

In Bioinformatics, it is common to search biological sequences (DNA, RNA, proteins) for functional motifs such as cross-over hotspot instigators (chi), restriction sites, regulation motifs, binding sites, active sites in proteins, etc. (Beaudoing et al., 2000; Brazma et al., 1998; El Karoui et al., 1999; Frith et al., 2002; Hampson et al., 2002; Karlin et al., 1992; Leonardo Marino-Ramírez & Landsman, 2004; van Helden et al., 1998). Due to evolution pressure, functional motifs are likely to be more conserved than non-functional motifs. As a consequence, it is a natural strategy to search biological sequences for motifs which are statistically exceptional (ex: over- or under-represented).

Given M a motif of interest (from simple strings to complex regular expressions), a recurrent question is: "how surprising is it to observe *n* occurrences of M in my dataset ". In statistical terms, this is equivalent to compute the *p*-value of observation *n* in respect with a relevant reference model. More precisely, if *X*1:� = *X*<sup>1</sup> ... *X*� is a length � random sequence generated by our reference model, and if *N* denotes the random number of occurrences of M in *X*1:�, for any *n* 0 our objective is to compute the significance score of observation *n*:

$$S(n) = \begin{cases} +\log\_{10} \mathbb{P}(N \le n) \text{ if } n \le \mathbb{E}[N] \\ -\log\_{10} \mathbb{P}(N \ge n) \text{ if } n > \mathbb{E}[N] \end{cases} \tag{1}$$

this score representing the *p*-value in a decimal log-scale, negative (resp. positive) values being associated to under- (resp. over-) representation events.

In order to compute such a score for a given motif M and a given dataset, one needs two essential steps:

**1) Counting:** count the observed number *n* of occurrences of motif M in the dataset;

**2) Significance:** compute the p-value of observation *n* with respect to a reference model.

In this chapter, we give all the necessary details to perform these two steps using state of the art approaches including some unpublished results.

#### **2. Counting motifs**

#### **2.1 Biological motifs**

We can see on Fig. 1 various examples of the kind of biological motifs we usually deal with in Bioinformatics. In most cases, these motifs are built from a set of active sequences (putative

in Biological Sequences 3

Significance Score of Motifs in Biological Sequences 175

(e) NFA for (C|T)

*N*(M; *X*1:) =

Fig. 2. Glushkov's construction for <sup>A</sup>(C|T)<sup>|</sup>G. (a), (b), (c), and (d) are singletons; (e) results from the union of (b) and (d); (f) results from the Kleene's closure of (e); (g) results from the

> ∑ *i*=1

<sup>A</sup><sup>M</sup> being the set of all finite sequences over <sup>A</sup> ending with one element of <sup>W</sup> (this notation will be explained in the next section), and where **1***<sup>A</sup>* is the indicator function of event *A*. In the particular case where M contains no strings that are included into each other (which is a common assumption), the number *N* of matching position corresponds exactly to the number of occurrences. However, there is no need to put any restriction on M as long as we

From now on, if the sequence *X*1: is observed, we denote by the number of matching positions by *n*, and if the sequence *X*1: is random, we simply denote by *N* the random number of

Let us denote by <sup>A</sup> the set of all finite sequences over <sup>A</sup>. Any subset L⊂A is then called <sup>a</sup> *language* over <sup>A</sup>. We denote by <sup>P</sup>(<sup>A</sup>) the set of all possible languages over <sup>A</sup>. We denote by *<sup>ε</sup>* ∈ A the empty sequence, and for the sake of simplicity, the singletons of <sup>P</sup>(<sup>A</sup>) will be simply denoted by their element. Ex: A instead of {A}, TGC instead of {TGC}, *ε* instead of {*ε*}.

**Union (**|**):** for all <sup>L</sup>1,L<sup>2</sup> ∈ P(<sup>A</sup>), <sup>L</sup>1|L<sup>2</sup> <sup>=</sup> <sup>L</sup><sup>1</sup> ∪ L2. The neutral element of the binary

 

 

concatenation of (a) and (f); (h) results from the union of (g) and (c).

are interested in the number of matching positions like we do.

We define on these languages three *regular operations*:

operator | is ∅. Ex: {AT, GA}|{T, GA, TT} = {AT, T, GA, TT}.

 

 

(b) NFA for C




 

   


(h) NFA for <sup>A</sup>(C|T)<sup>|</sup><sup>G</sup>

**<sup>1</sup>***X*1:*i*∈A<sup>M</sup> (2)

 

 

(c) NFA for G


 

(f) NFA for (C|T)




(g) NFA for <sup>A</sup>(C|T)

matching positions of M in *X*1:, is defined by

matching positions.

**2.2 Regular languages**

(d) NFA for T

(a) NFA for A

Fig. 1. Various kind of biological motifs. From top to bottom: strings in IUPAC (Cornish-Bowden, 1985) alphabet (DNA), multiple alignment (proteins), sequence logo (proteins), consensus pattern (proteins), and frequency matrix (DNA). Various sources including ReBase (Roberts et al., 2010), PROSITE (Sigrist et al., 2010), and JASPAR databases (Bryne et al., 2008).

or confirmed by experiments) in the form of a multiple alignment or a frequency matrix from which can be derived a consensus. This consensus could sometimes be a simple string (ex: AGCGG the chi site of *B. subtilis*) but in most cases it is a degenerated pattern (ex: CAYNNNNNRTG a restriction site in the IUPAC alphabet, PROSITE signatures). In all cases however, it is possible to consider our biological motif M as a (possibly large) set of strings.

Formally, let M be a finite set of strings over a finite alphabet A. Ex: A = {A, C, G, T} for DNA sequences; this is the alphabet we are going to use from now on in our examples. Let *X*1: = *X*<sup>1</sup> ... *X* be an observed sequence of length over A. Then the number *N*(M; *X*1:) of 2 Will-be-set-by-IN-TECH

AGCGG GSTGGTGG CCWGG CTGCAG GAATTC CAYNNNNNRTG CPRRRGRQTYTRFQTLELEKEFHF...........NHYLTRRRRIEIAHAL......... CPRRRGRQTYTRFQTLELEKEFHF...........NHYLTRRRRIEIAHAL......... VSVRKKRKPYSKFQTLELEKEFLF...........NAYVSKQKRWELARNL......... -----------------LTKYFNK...........QPYPTRREIEKLAASL......... -----------------LTKYFNK...........QPYPTRREIEKLAASL......... -----------------LTKYFNK...........QPYPTRREIEKLAASL......... SGKRRRRGNLPKESVQILRDWLYEhr........yNAYPSEQEKVLLSRQT......... SKKRRHRTTFTSLQLEELEKVFQK...........THYPDVYVREQLALRT......... SKKRRHRTTFTSLQLEELEKVFQK...........THYPDVYVREQLALRT.........

H-D-[LIVMFY]-x-H-x-[AG]-x(2)-[NQ]-x-[LIVMFY]

A [ 3 21 25 0 0 24 1 0 ] C [13 1 0 0 5 0 0 0 ] G [4 0 0 0 0 1 0 2] T [ 5 3 0 25 20 0 24 23 ]

Fig. 1. Various kind of biological motifs. From top to bottom: strings in IUPAC

to consider our biological motif M as a (possibly large) set of strings.

(Bryne et al., 2008).

(Cornish-Bowden, 1985) alphabet (DNA), multiple alignment (proteins), sequence logo (proteins), consensus pattern (proteins), and frequency matrix (DNA). Various sources including ReBase (Roberts et al., 2010), PROSITE (Sigrist et al., 2010), and JASPAR databases

or confirmed by experiments) in the form of a multiple alignment or a frequency matrix from which can be derived a consensus. This consensus could sometimes be a simple string (ex: AGCGG the chi site of *B. subtilis*) but in most cases it is a degenerated pattern (ex: CAYNNNNNRTG a restriction site in the IUPAC alphabet, PROSITE signatures). In all cases however, it is possible

Formally, let M be a finite set of strings over a finite alphabet A. Ex: A = {A, C, G, T} for DNA sequences; this is the alphabet we are going to use from now on in our examples. Let *X*1: = *X*<sup>1</sup> ... *X* be an observed sequence of length over A. Then the number *N*(M; *X*1:) of

Fig. 2. Glushkov's construction for <sup>A</sup>(C|T)<sup>|</sup>G. (a), (b), (c), and (d) are singletons; (e) results from the union of (b) and (d); (f) results from the Kleene's closure of (e); (g) results from the concatenation of (a) and (f); (h) results from the union of (g) and (c).

matching positions of M in *X*1:, is defined by

$$N(\mathcal{M}; X\_{1:\ell}) = \sum\_{i=1}^{\ell} \mathbf{1}\_{X\_{1:i} \in \mathcal{A}^\* \mathcal{M}} \tag{2}$$

<sup>A</sup><sup>M</sup> being the set of all finite sequences over <sup>A</sup> ending with one element of <sup>W</sup> (this notation will be explained in the next section), and where **1***<sup>A</sup>* is the indicator function of event *A*.

In the particular case where M contains no strings that are included into each other (which is a common assumption), the number *N* of matching position corresponds exactly to the number of occurrences. However, there is no need to put any restriction on M as long as we are interested in the number of matching positions like we do.

From now on, if the sequence *X*1: is observed, we denote by the number of matching positions by *n*, and if the sequence *X*1: is random, we simply denote by *N* the random number of matching positions.

#### **2.2 Regular languages**

Let us denote by <sup>A</sup> the set of all finite sequences over <sup>A</sup>. Any subset L⊂A is then called <sup>a</sup> *language* over <sup>A</sup>. We denote by <sup>P</sup>(<sup>A</sup>) the set of all possible languages over <sup>A</sup>. We denote by *<sup>ε</sup>* ∈ A the empty sequence, and for the sake of simplicity, the singletons of <sup>P</sup>(<sup>A</sup>) will be simply denoted by their element. Ex: A instead of {A}, TGC instead of {TGC}, *ε* instead of {*ε*}. We define on these languages three *regular operations*:

#### **Union (**|**):** for all <sup>L</sup>1,L<sup>2</sup> ∈ P(<sup>A</sup>), <sup>L</sup>1|L<sup>2</sup> <sup>=</sup> <sup>L</sup><sup>1</sup> ∪ L2. The neutral element of the binary operator | is ∅. Ex: {AT, GA}|{T, GA, TT} = {AT, T, GA, TT}.

in Biological Sequences 5

Significance Score of Motifs in Biological Sequences 177

Mohri, 2006). This construction provides a simple way to build the NFA directly from the regular expression of the language. The idea is to treat the regular expression as any algebraic expression with a stack of operands (NFAs) and a stack of operators (regular operations). Since a regular expression is by definition built from singleton elements of <sup>A</sup> and the three regular operations, we only need to give the construction of a NFA corresponding to singleton

**Singleton:** for any *<sup>X</sup>*1: ∈ A we build the NFA (A, <sup>Q</sup>, *<sup>σ</sup>*, <sup>F</sup>, *<sup>δ</sup>*) with <sup>Q</sup> <sup>=</sup> {0, 1, . . . , }, *<sup>σ</sup>* <sup>=</sup> 0,

**Union:** the union (A, Q, *σ*, F, *δ*) of two NFAs (A, Q1, *σ*1, F1, *δ*1) and (A, Q, *σ*2, F2, *δ*2) is given

**Concatenation:** the concatenation (A, Q, *σ*, F, *δ*) of two NFAs (A, Q1, *σ*1, F1, *δ*1) and

**Kleene's closure:** the Kleene's closure (A, Q, *σ*, F, *δ*) of NFA (A, Q1, *σ*1, F1, *δ*1) is given by:

*<sup>δ</sup>*(*q*, *<sup>a</sup>*) = � *<sup>δ</sup>*1(*q*, *<sup>a</sup>*) if *<sup>q</sup>* ∈ Q<sup>1</sup> \ F<sup>1</sup>

Using Glushkov's construction, it is then possible to build a NFA whose language correspond to the regular expression of our choice. However in general, this construction is not optimal in terms of number of states. Fortunately, the reduction algorithm (Algorithm 1) due to Hopcroft provides a (partial) solution to this problem. Note that finding a minimal NFA for a given regular expression is a difficult task in general, but that Hopcroft's reduction is a good heuristic (we will see later that in the case of DFA, Hopcroft's reduction is indeed a

NFAs provide with Algorithm 2 an extremely efficient way to look for matching positions of any motif M (in fact, any regular expression) in a sequence *X*1:. The algorithm directly

Let us illustrate this algorithm with a toy example: how to find all matching positions of M = G(G|C)G in *X*1:12 = AGCGGTGGGCGA ? We first use Glushkov's construction and Algorithm 1 to obtain on Fig. 3 a minimal NFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G. Then we directly apply

*δ*1(*σ*1, *a*) if *q* ∈ F<sup>1</sup>

*δ*1(*σ*1, *a*) ∪ *δ*1(*σ*2, *a*) if *q* = *σ*<sup>1</sup>

*δ*1(*q*, *a*) if *q* ∈ Q<sup>1</sup> \ {*σ*1} *δ*2(*q*, *a*) if *q* ∈ Q<sup>2</sup> \ {*σ*2}

*δ*1(*q*, *a*) if *q* ∈ Q<sup>1</sup> \ F<sup>1</sup>

*δ*2(*q*, *a*) if *q* ∈ Q<sup>2</sup> \ {*σ*2}

*δ*2(*σ*2, *a*) if *q* ∈ F<sup>1</sup>

. (3)

. (4)

. (5)

elements, and the constructions corresponding to the regular operations.

⎧ ⎪⎪⎨

⎪⎪⎩

*δ*(*q*, *a*) =

(A, Q, *σ*2, F2, *δ*2) is given by: Q = Q<sup>1</sup> ∪ Q<sup>2</sup> \ {*σ*2}, *σ* = *σ*1, F = F<sup>2</sup> and

⎧ ⎪⎪⎨

⎪⎪⎩

F = {}, and *δ*(*i* − 1, *Xi*) = {*i*} for all 1 *i* .

by: Q = Q<sup>1</sup> ∪ Q<sup>2</sup> \ {*σ*2}, *σ* = *σ*1, F = F<sup>1</sup> ∪ F<sup>2</sup> and

*δ*(*q*, *a*) =

Q = Q1, *σ* = *σ*1, F = F<sup>1</sup> ∪ {*σ*1} and

results from the definition of the language of a NFA.

minimization).

**2.4 Counting with NFA**

Algorithm 2 starting with S = {0}: • *i* = 1, *X*<sup>1</sup> = A, S ← *δ*({0}, A) = {0}; • *i* = 2, *X*<sup>2</sup> = G, S ← *δ*({0}, G) = {0, 1};

**Require:** remove first all states that are not reachable from *σ* or that cannot reach any element of F 1: W ← {F, Q\F} and P ← {F, Q\F} 2: **while** W is not empty **do** 3: select and remove V from W 4: **for all** *a* ∈ A **do** 5: S = {*q* ∈ Q, *δ*(*q*, *a*) ∈ V} 6: **for all** R∈P such as R∩S �= ∅ and R S **do** 7: replace R in P by R<sup>1</sup> ←R∩S and R<sup>2</sup> ←R\R<sup>1</sup> 8: **if** R∈W **then** 9: replace R in P by R<sup>1</sup> and R<sup>2</sup> 10: **else** 11: **if** |R1| |R2| **then** add R<sup>1</sup> to W **else** add R<sup>2</sup> to W **end if** 12: **end if** 13: **end for** 14: **end for** 15: **end while**

Algorithm 1. Performs Hopcroft's reduction on NFA (A, Q, *σ*, F, *δ*). W (working set) and P (partition set) are two sets of set of NFA states. The resulting complexity is *O*(|Q| log |Q|).


The precedence rule of these operations is: <sup>|</sup> (lowest precedence), · (associative operator), (highest precedence). Ex: <sup>A</sup>|<sup>C</sup> · <sup>T</sup> = (A|(C(·<sup>T</sup>)), TT · <sup>A</sup>|<sup>C</sup> · <sup>G</sup> = ((TT · <sup>A</sup>)|((<sup>C</sup>) · <sup>G</sup>)). We call *regular expression* over <sup>A</sup> any algebric expression over <sup>P</sup>(<sup>A</sup>) defined from singleton

elements and a finite number of regular operations. The resulting language is called a *regular language*. Ex: any finite language is a regular language, <sup>A</sup> is a regular language, (A|C|G|T)GGATG is a regular language, {AG, AAGG, AAAGGG,...} is not a regular language.

#### **2.3 Non-deterministic finite automaton**

A *Non-deterministic Finite Automaton* (NFA) is defined as a 5-tuple (A, Q, *σ*, F, *δ*) where: A is a finite alphabet, Q is a finite *state space*, *σ* ∈ Q is the *starting state*, F⊂Q is the set of *final states*, and *<sup>δ</sup>* : Q×A→P(Q) is the *transition function*. An element *<sup>X</sup>*1: ∈ A is *accepted* by this NFA if and only if it exists a *path* from the starting state to one of the final state that sequentially use the letters *X*1: in the transitions. More formally, it means that it exists a sequence of states (ie: elements of Q) *<sup>q</sup>*<sup>0</sup> = *<sup>σ</sup>*, *<sup>q</sup>*1, *<sup>q</sup>*<sup>2</sup> ..., *<sup>q</sup>*<sup>−</sup>1, *<sup>q</sup>* ∈ F such as *qi* ∈ *<sup>δ</sup>*(*qi*−1, *Xi*) for all 1 *<sup>i</sup>* . The *language of a NFA* is the set of all elements of <sup>A</sup> it accepts.

**Theorem 1.** For any language L∈P(<sup>A</sup>): <sup>L</sup> regular ⇐⇒ it exists a NFA whose language is L.

We admit that the language of a NFA is always regular (see Hopcroft et al., 2001, for the formal proof) but we will prove the reciprocal with the Glushkov's construction (Allauzen & 4 Will-be-set-by-IN-TECH

**Require:** remove first all states that are not reachable from *σ* or that cannot reach any element

Algorithm 1. Performs Hopcroft's reduction on NFA (A, Q, *σ*, F, *δ*). W (working set) and P (partition set) are two sets of set of NFA states. The resulting complexity is *O*(|Q| log |Q|).

**Concatenation (**·**):** for all L1,L<sup>2</sup> ∈ P(A∗), L<sup>1</sup> · L<sup>2</sup> = {*xy*, *x* ∈ L1, *y* ∈ L2}. The neutral element of the binary operator · is *<sup>ε</sup>*. For all L∈P(<sup>A</sup>), <sup>L</sup><sup>0</sup> <sup>=</sup> *<sup>ε</sup>* (convention), <sup>L</sup><sup>1</sup> <sup>=</sup> <sup>L</sup>, <sup>L</sup><sup>2</sup> <sup>=</sup> L·L and the notation extends recursively to <sup>L</sup>*<sup>k</sup>* for any *<sup>k</sup>* 3. Ex: {G, GA}·{AT, <sup>T</sup>} <sup>=</sup> {GAT, GT, GAAT}; {G, GA}<sup>3</sup> <sup>=</sup> {GGG, GGGA, GGAG, GGAGA, GAGG, GAGGA, GAGAG, GAGAGA}. For the sake of simplicity, · is implicitly used when the operator is omitted.. Ex: AL means A·L. **Kleene's closure ():** For all L∈P(<sup>A</sup>), <sup>L</sup> <sup>=</sup> <sup>∑</sup>*<sup>k</sup>*<sup>0</sup> <sup>L</sup>*k*. Ex: {AT} <sup>=</sup>

The precedence rule of these operations is: <sup>|</sup> (lowest precedence), · (associative operator),

We call *regular expression* over <sup>A</sup> any algebric expression over <sup>P</sup>(<sup>A</sup>) defined from singleton elements and a finite number of regular operations. The resulting language is called a *regular language*. Ex: any finite language is a regular language, <sup>A</sup> is a regular language, (A|C|G|T)GGATG is a regular language, {AG, AAGG, AAAGGG,...} is not a regular language.

A *Non-deterministic Finite Automaton* (NFA) is defined as a 5-tuple (A, Q, *σ*, F, *δ*) where: A is a finite alphabet, Q is a finite *state space*, *σ* ∈ Q is the *starting state*, F⊂Q is the set of *final states*, and *<sup>δ</sup>* : Q×A→P(Q) is the *transition function*. An element *<sup>X</sup>*1: ∈ A is *accepted* by this NFA if and only if it exists a *path* from the starting state to one of the final state that sequentially use the letters *X*1: in the transitions. More formally, it means that it exists a sequence of states (ie: elements of Q) *<sup>q</sup>*<sup>0</sup> = *<sup>σ</sup>*, *<sup>q</sup>*1, *<sup>q</sup>*<sup>2</sup> ..., *<sup>q</sup>*<sup>−</sup>1, *<sup>q</sup>* ∈ F such as *qi* ∈ *<sup>δ</sup>*(*qi*−1, *Xi*) for all 1 *<sup>i</sup>* . The

**Theorem 1.** For any language L∈P(<sup>A</sup>): <sup>L</sup> regular ⇐⇒ it exists a NFA whose language is

We admit that the language of a NFA is always regular (see Hopcroft et al., 2001, for the formal proof) but we will prove the reciprocal with the Glushkov's construction (Allauzen &

(highest precedence). Ex: <sup>A</sup>|<sup>C</sup> · <sup>T</sup> = (A|(C(·<sup>T</sup>)), TT · <sup>A</sup>|<sup>C</sup> · <sup>G</sup> = ((TT · <sup>A</sup>)|((<sup>C</sup>) · <sup>G</sup>)).

of F

1: W ← {F, Q\F} and P ← {F, Q\F}

9: replace R in P by R<sup>1</sup> and R<sup>2</sup>

6: **for all** R∈P such as R∩S �= ∅ and R S **do** 7: replace R in P by R<sup>1</sup> ←R∩S and R<sup>2</sup> ←R\R<sup>1</sup>

11: **if** |R1| |R2| **then** add R<sup>1</sup> to W **else** add R<sup>2</sup> to W **end if**

2: **while** W is not empty **do** 3: select and remove V from W

5: S = {*q* ∈ Q, *δ*(*q*, *a*) ∈ V}

4: **for all** *a* ∈ A **do**

8: **if** R∈W **then**

{*ε*, AT, ATAT, ATATAT,...}.

**2.3 Non-deterministic finite automaton**

L.

*language of a NFA* is the set of all elements of <sup>A</sup> it accepts.

10: **else**

12: **end if** 13: **end for** 14: **end for** 15: **end while**

Mohri, 2006). This construction provides a simple way to build the NFA directly from the regular expression of the language. The idea is to treat the regular expression as any algebraic expression with a stack of operands (NFAs) and a stack of operators (regular operations). Since a regular expression is by definition built from singleton elements of <sup>A</sup> and the three regular operations, we only need to give the construction of a NFA corresponding to singleton elements, and the constructions corresponding to the regular operations.


$$\delta(q, a) = \begin{cases} \delta\_1(\sigma\_1, a) \cup \delta\_1(\sigma\_2, a) & \text{if } q = \sigma\_1 \\ \delta\_1(q, a) & \text{if } q \in \mathcal{Q}\_1 \\ \delta\_2(q, a) & \text{if } q \in \mathcal{Q}\_2 \nmid \{\sigma\_2\} \end{cases} \tag{3}$$

**Concatenation:** the concatenation (A, Q, *σ*, F, *δ*) of two NFAs (A, Q1, *σ*1, F1, *δ*1) and (A, Q, *σ*2, F2, *δ*2) is given by: Q = Q<sup>1</sup> ∪ Q<sup>2</sup> \ {*σ*2}, *σ* = *σ*1, F = F<sup>2</sup> and

$$\delta(q, a) = \begin{cases} \delta\_1(q, a) & \text{if } q \in \mathcal{Q}\_1 \backslash \mathcal{F}\_1 \\\\ \delta\_2(\sigma\_2, a) & \text{if } q \in \mathcal{F}\_1 \\\\ \delta\_2(q, a) & \text{if } q \in \mathcal{Q}\_2 \backslash \{\sigma\_2\} \end{cases} \tag{4}$$

**Kleene's closure:** the Kleene's closure (A, Q, *σ*, F, *δ*) of NFA (A, Q1, *σ*1, F1, *δ*1) is given by: Q = Q1, *σ* = *σ*1, F = F<sup>1</sup> ∪ {*σ*1} and

$$\delta(q, a) = \begin{cases} \delta\_1(q, a) & \text{if } q \in \mathcal{Q}\_1 \backslash \mathcal{F}\_1 \\ \delta\_1(\sigma\_1, a) & \text{if } q \in \mathcal{F}\_1 \end{cases} \tag{5}$$

Using Glushkov's construction, it is then possible to build a NFA whose language correspond to the regular expression of our choice. However in general, this construction is not optimal in terms of number of states. Fortunately, the reduction algorithm (Algorithm 1) due to Hopcroft provides a (partial) solution to this problem. Note that finding a minimal NFA for a given regular expression is a difficult task in general, but that Hopcroft's reduction is a good heuristic (we will see later that in the case of DFA, Hopcroft's reduction is indeed a minimization).

#### **2.4 Counting with NFA**

NFAs provide with Algorithm 2 an extremely efficient way to look for matching positions of any motif M (in fact, any regular expression) in a sequence *X*1:. The algorithm directly results from the definition of the language of a NFA.

Let us illustrate this algorithm with a toy example: how to find all matching positions of M = G(G|C)G in *X*1:12 = AGCGGTGGGCGA ? We first use Glushkov's construction and Algorithm 1 to obtain on Fig. 3 a minimal NFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G. Then we directly apply Algorithm 2 starting with S = {0}:


in Biological Sequences 7

Significance Score of Motifs in Biological Sequences 179

The choice of a reference model is obviously a key point. Since biological sequences like DNA or proteins are known to have unbalanced letter compositions, it is hence clear that our model should at least take into account this source of bias. A natural parametric approach <sup>1</sup> is hence to model *X*1: as a i.i.d. sequence with **P**(*Xi* = *a*) = *π*(*a*) ∀*a* ∈ A with all *π*(*a*) ∈ [0, 1] and

For example, in the complete genome of HIV1 (Genbank AF033819) we observe the following counts: 3272 A, 1642 C, 2225 G, and 2042 T. The maximum likelihood estimates of a M0 model based on this observation is then: *<sup>π</sup>*(A) = 3272/9181 � 35.64%, *<sup>π</sup>*(C) = 1642/9181 � 17.88%,

But if we look now to the frequencies of di-nucleotides on the same HIV1 genome, we observe

AA 1087 AC 524 AG 971 AT 690 CA 754 CC 378 CG 82 CT 427 GA 769 GC 425 GG 625 GT 406 TA 662 TC 315 TG 546 TT 519 For example, we observe 971/3272 = 29.68% of G after a A, but a G occurs after a C only 82/1641 = 16.41% of the time. This phenomenon is directly explained by the fact that the di-nucleotide CG tend to be easily methylated (see CpG island, Fatemi et al., 2005). Is hence tempting to take into account the frequencies of di-nucleotides in our reference model, or

For any *d* 0, we denote by M*d* the (homogeneous) Markov model of order *d* defined for any

where *π* denotes the *transition matrix* of M*d*. This model is clearly defined conditionally to

*<sup>π</sup>*(*a*, *<sup>b</sup>*) = *nab*

When working with Markov model and biological sequences, a recurrent question is: what order *d* should I choose for my reference model ? This is a classical model selection problem which can easily be solved using penalized likelihood criteria like BIC or AIC (Liddle, 2007). For example, using the BIC criterion, one would select *d* = 1 for the complete genome of HIV1 ( � 10kb), and *d* = 5 for the complete genome of *E. coli* ( � 4.6Mb). However, since our objective is the significance of motifs counts rather than the modelization of biological

First, it is critical to realize than working with a model M*d* as reference model allows to take into account the sequence composition bias in (*d* + 1)-mers. Hence, with *d* = 1 one takes into account the composition bias in di-nucleotides, and with *d* = 5, one takes into account the composition bias in hexa-nucleotides. The decision could then be based on the information one wishes to include in the reference model; working on coding sequences, one might wish to take into account at least the codon bias hence resulting in the choice of *d* 2. On the other

<sup>1</sup> An alternative non-parametric approach, the *shuffling*, consists in performing uniformly a random

<sup>∑</sup>*b*�∈A *nab*�

The maximum likelihood estimator *<sup>π</sup>* is then given for all *<sup>a</sup>* ∈ A*d*, and *<sup>b</sup>* ∈ A by:

**<sup>P</sup>**(*Xi* = *<sup>b</sup>*|*Xi*−*d*:*i*−<sup>1</sup> = *<sup>a</sup>*) = *<sup>π</sup>*(*a*, *<sup>b</sup>*) (6)

(7)

<sup>∑</sup>*a*∈A *<sup>π</sup>*(*a*) = 1. This model is called model M0 with parameter *<sup>π</sup>*.

*<sup>π</sup>*(G) = 2225/9181 � 24.23%, and *<sup>π</sup>*(T) = 2042/9181 � 22.24%.

tri-nucleotides, or more, which naturally leads to Markov models.

where *nab* are the observed counts of word *ab* in the training dataset.

permutation of the original sequence; this approach is not treated here.

sequence in itself, we suggest a different approach.

**3.1 Reference model**

considerable bias as well:

*<sup>i</sup> <sup>d</sup>* <sup>+</sup> 1, *<sup>a</sup>* ∈ A*d*, and *<sup>b</sup>* ∈ A by:

*X*1:*d*.

**Require:** (A, <sup>Q</sup>, *<sup>σ</sup>*, <sup>F</sup>, *<sup>δ</sup>*) be a (minimal) NFA whose language is <sup>A</sup><sup>M</sup> 1: S←{*σ*} 2: **for** *i* = 1... **do** 3: S←∪*q*∈S *<sup>δ</sup>*(*q*, *Xi*) 4: **if** S∩F �= ∅ **then** 5: report *i* as a matching position 6: **end if** 7: **end for**

Algorithm 2. NFA pattern matching. Returns all matching positions of motif M in *X*1:. Complexity is *O*(|Q| × ).

Fig. 3. Minimal NFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G.


We hence return three matching positions: 4, 9 and 11.

One should note in this example that in twice occasions, we need to recompute a previously computed transition (*i* = 7 and *i* = 11). Obviously, this kind of event is likely to appear very often when working with longer sequences. It is hence a natural idea to store in memory previously computed transitions. This approach, known as *lazy determinization* (Green et al., 2004), speeds up considerably pattern matching (reducing the complexity from *O*(|Q| × ) to *O*()) at the expense of a higher memory usage. We will see later that the amount of memory needed can increase exponentially with the NFA size |Q|; this problem is usually addressed by allocating a fixed amount of memory to a buffer of computed transitions which is flushed when full.

#### **3. Significance**

Since we now have efficient algorithms to count the number of occurrence of a motif M in a sequence *X*1:, let us deal with the significance of an observation *n*.

#### **3.1 Reference model**

6 Will-be-set-by-IN-TECH

Algorithm 2. NFA pattern matching. Returns all matching positions of motif M in *X*1:.

 

One should note in this example that in twice occasions, we need to recompute a previously computed transition (*i* = 7 and *i* = 11). Obviously, this kind of event is likely to appear very often when working with longer sequences. It is hence a natural idea to store in memory previously computed transitions. This approach, known as *lazy determinization* (Green et al., 2004), speeds up considerably pattern matching (reducing the complexity from *O*(|Q| × ) to *O*()) at the expense of a higher memory usage. We will see later that the amount of memory needed can increase exponentially with the NFA size |Q|; this problem is usually addressed by allocating a fixed amount of memory to a buffer of computed transitions which is flushed

Since we now have efficient algorithms to count the number of occurrence of a motif M in a

**Require:** (A, <sup>Q</sup>, *<sup>σ</sup>*, <sup>F</sup>, *<sup>δ</sup>*) be a (minimal) NFA whose language is <sup>A</sup><sup>M</sup>

• *i* = 4, *X*<sup>4</sup> = G, S ← *δ*({0, 2}, G) = {0, 1, 3}, matching position;

• *i* = 9, *X*<sup>9</sup> = G, S ← *δ*({0, 1, 2}, G) = {0, 1, 2, 3}, matching position;

• *i* = 11, *X*<sup>11</sup> = G, S ← *δ*({0, 2}, G) = {0, 1, 3}, matching position;

sequence *X*1:, let us deal with the significance of an observation *n*.

 


Fig. 3. Minimal NFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G.

• *i* = 3, *X*<sup>3</sup> = C, S ← *δ*({0, 1}, C) = {0, 2};

• *i* = 5, *X*<sup>5</sup> = G, S ← *δ*({0, 1, 3}, G) = {0, 1, 2}; • *i* = 6, *X*<sup>6</sup> = T, S ← *δ*({0, 1, 2}, T) = {0}; • *i* = 7, *X*<sup>7</sup> = G, S ← *δ*({0}, G) = {0, 1}; • *i* = 8, *X*<sup>8</sup> = G, S ← *δ*({0, 1}, G) = {0, 1, 2};

• *i* = 10, *X*<sup>10</sup> = C, S ← *δ*({0, 1, 2, 3}, C) = {0, 2}.

We hence return three matching positions: 4, 9 and 11.

• *i* = 12, *X*<sup>12</sup> = A, S ← *δ*({0, 1, 3}, A) = {0}.

1: S←{*σ*} 2: **for** *i* = 1... **do** 3: S←∪*q*∈S *<sup>δ</sup>*(*q*, *Xi*) 4: **if** S∩F �= ∅ **then**

6: **end if** 7: **end for**

when full.

**3. Significance**

Complexity is *O*(|Q| × ).

5: report *i* as a matching position

The choice of a reference model is obviously a key point. Since biological sequences like DNA or proteins are known to have unbalanced letter compositions, it is hence clear that our model should at least take into account this source of bias. A natural parametric approach <sup>1</sup> is hence to model *X*1: as a i.i.d. sequence with **P**(*Xi* = *a*) = *π*(*a*) ∀*a* ∈ A with all *π*(*a*) ∈ [0, 1] and <sup>∑</sup>*a*∈A *<sup>π</sup>*(*a*) = 1. This model is called model M0 with parameter *<sup>π</sup>*.

For example, in the complete genome of HIV1 (Genbank AF033819) we observe the following counts: 3272 A, 1642 C, 2225 G, and 2042 T. The maximum likelihood estimates of a M0 model based on this observation is then: *<sup>π</sup>*(A) = 3272/9181 � 35.64%, *<sup>π</sup>*(C) = 1642/9181 � 17.88%, *<sup>π</sup>*(G) = 2225/9181 � 24.23%, and *<sup>π</sup>*(T) = 2042/9181 � 22.24%.

But if we look now to the frequencies of di-nucleotides on the same HIV1 genome, we observe considerable bias as well:


For example, we observe 971/3272 = 29.68% of G after a A, but a G occurs after a C only 82/1641 = 16.41% of the time. This phenomenon is directly explained by the fact that the di-nucleotide CG tend to be easily methylated (see CpG island, Fatemi et al., 2005). Is hence tempting to take into account the frequencies of di-nucleotides in our reference model, or tri-nucleotides, or more, which naturally leads to Markov models.

For any *d* 0, we denote by M*d* the (homogeneous) Markov model of order *d* defined for any *<sup>i</sup> <sup>d</sup>* <sup>+</sup> 1, *<sup>a</sup>* ∈ A*d*, and *<sup>b</sup>* ∈ A by:

$$\mathbb{P}(X\_i = b | X\_{i-d:i-1} = a) = \pi(a, b) \tag{6}$$

where *π* denotes the *transition matrix* of M*d*. This model is clearly defined conditionally to *X*1:*d*.

The maximum likelihood estimator *<sup>π</sup>* is then given for all *<sup>a</sup>* ∈ A*d*, and *<sup>b</sup>* ∈ A by:

$$\hat{\pi}(a,b) = \frac{n\_{ab}}{\sum\_{b' \in \mathcal{A}} n\_{ab'}} \tag{7}$$

where *nab* are the observed counts of word *ab* in the training dataset.

When working with Markov model and biological sequences, a recurrent question is: what order *d* should I choose for my reference model ? This is a classical model selection problem which can easily be solved using penalized likelihood criteria like BIC or AIC (Liddle, 2007). For example, using the BIC criterion, one would select *d* = 1 for the complete genome of HIV1 ( � 10kb), and *d* = 5 for the complete genome of *E. coli* ( � 4.6Mb). However, since our objective is the significance of motifs counts rather than the modelization of biological sequence in itself, we suggest a different approach.

First, it is critical to realize than working with a model M*d* as reference model allows to take into account the sequence composition bias in (*d* + 1)-mers. Hence, with *d* = 1 one takes into account the composition bias in di-nucleotides, and with *d* = 5, one takes into account the composition bias in hexa-nucleotides. The decision could then be based on the information one wishes to include in the reference model; working on coding sequences, one might wish to take into account at least the codon bias hence resulting in the choice of *d* 2. On the other

<sup>1</sup> An alternative non-parametric approach, the *shuffling*, consists in performing uniformly a random permutation of the original sequence; this approach is not treated here.

in Biological Sequences 9

Significance Score of Motifs in Biological Sequences 181

 

 

 

*ni* and *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup>

*r*

*r* ∑ *i*=1

(*ni* <sup>−</sup> *<sup>μ</sup>*)2. (9)

 

 

or, alternatively, one might use this sample to derive empirical expectation, variance, and

*r* ∑ *i*=1

If this approach is quite simple, it suffers several drawbacks: 1) it is slow; 2) sample size must be large to obtain accurate results. Indeed, if the true *<sup>p</sup>*-value is *<sup>p</sup>*, then *<sup>p</sup>* ∼ B(*r*, *<sup>p</sup>*) where *<sup>r</sup>* is the sample size. The following table gives a 90% upper bound confidence for *<sup>p</sup>* for several

*r* 103 10<sup>4</sup> 10<sup>5</sup> 106 107 108 bound 1.00 <sup>×</sup> <sup>10</sup>−<sup>3</sup> 1.00 <sup>×</sup> <sup>10</sup>−<sup>4</sup> 3.00 <sup>×</sup> <sup>10</sup>−<sup>5</sup> 1.50 <sup>×</sup> <sup>10</sup>−<sup>5</sup> 1.14 <sup>×</sup> <sup>10</sup>−<sup>5</sup> 1.04 <sup>×</sup> <sup>10</sup>−<sup>5</sup>

we clearly see that it requires at least *<sup>r</sup>* <sup>=</sup> <sup>10</sup><sup>6</sup> samples to obtain the first accurate digit in *<sup>p</sup>*, and a prohibitive *r* = 108 samples for the second digit. Considering that very small *p*-value are easily encountered in motif significance (ex: 10−20, 10−50, 10−100), it is clear that empirical

Empirical z-score does not suffer the same drawback but makes the implicit assumption that

For completeness, let us point out that *importance sampling* techniques might solve the estimation problem by sampling *N* using a tailored dataset distribution (Chan et al., 2010). However, these sophisticated numerical techniques are slow and requires a good skills to be

The key to perform any motif significance computation if first to embed the original problem into an order 1 Markov chain taking into account all the combinatoric complexity. This technique, called *Markov chain embedding* have been used by many authors in the context of motif significance Antzoulakos (2001); Boeva et al. (2005); Chang (2005); Fu (1996); Nuel (2006a), but it is only recently that its connexion to NFA and Deterministic Finite Automata (DFA) have been pointed out (Crochemore & Stefanov, 2003; Lladser, 2007; Nicodème et al.,

*N* has a Gaussian distribution which is highly questionable as we will see later on.

*r*

with *<sup>μ</sup>* <sup>=</sup> <sup>1</sup>


z-score:

implemented.

**3.3 Markov chain embedding**

 

 

Fig. 4. Minimal DFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G.

*<sup>Z</sup>*(*n*) = *<sup>n</sup>* <sup>−</sup> *<sup>μ</sup> σ*

value of *r* in the case where *p* = 10−5:

*p*-value have a limited interest in this context.

2002; Nuel, 2008a; Nuel & Prum, 2007; Ribeca & Raineri, 2008).

```
Require: (A, Q1, σ, F1, δ1) a NFA
 1: q0 ← {σ}, L ← 1, Q2 ← {q0}, F2 ← ∅
 2: for i = 0... L − 1 do
 3: for all a ∈ A do
 4: S ← δ1(qi, a)
 5: if ∃j, qj = S then
 6: δ2(qi, a) = qj
 7: else
 8: qL ← S, L ← L + 1, Q2 ← Q2 ∪ {qL}
 9: if S∩F∞ then F2 ← F2 ∪ {qL} end if
10: end if
11: end for
12: end for
Output: return (A, Q2, q0, F2, δ2)
```
Algorithm 3. Determinization. Build a DFA which recognizes the same language than the original NFA.

hand, it would obviously be pointless to use a reference model of order *d* = 7 to study a motif of length 8 or less.

Another critical point to keep in mind is that motif significance is by nature very sensitive to the parameters of the reference model. In order to convince us, let us a consider the following simple example with M = GGATG, a reference model M0 of parameter *π*, and = 1, 000, 000. If *<sup>π</sup>*(A) = *<sup>π</sup>*(T) = 0.10 and *<sup>π</sup>*(C) = *<sup>π</sup>*(G) = 0.40 we get **<sup>E</sup>**[*<sup>N</sup>*] = <sup>×</sup> 0.403 <sup>×</sup> 0.102 � 640.0. Now if *<sup>π</sup>*(A) = *<sup>π</sup>*(T) = 0.08 and *<sup>π</sup>*(C) = *<sup>π</sup>*(G) = 0.42 then **<sup>E</sup>**[*<sup>N</sup>*] = <sup>×</sup> 0.423 <sup>×</sup> 0.082 � 474.2. If we admit that the standard deviation of *N* is roughly equal to *σ* = 25 (we will see later on how to perform such computation), an observation of *n* = 550 could be interpreted as a significant over-representation with the first parameters, and a significant under-representation with the second parameters (observation *n* deviates from the expectation by more than three standard deviations in both cases). The reason behind this is that parameter values are typically involved in complex products when evaluating the significance of an observation, and that such operations usually increase small variations rather than averaging them (like with sums). This problem have been investigated in Nuel (2006c) where it is shown that unwise choices of *d* might lead to many false positive results.

#### **3.2 Monte-Carlo simulations**

Since the theoretical distribution of *N* not easy to obtain, it is tempting to study it from the empirical point of view by performing simple simulations. The approach is quite straightforward:


Once a reference sample have been obtained, we can derive the empirical *p*-value of the observation *n* using:

$$\hat{\mathbb{P}}(N \leqslant n) = \frac{\sum\_{i=1}^{r} \mathbf{1}\_{n\_i \leqslant n}}{r} \quad \text{or} \quad \hat{\mathbb{P}}(N \geqslant n) = \frac{\sum\_{i=1}^{r} \mathbf{1}\_{n\_i \geqslant n}}{r} \tag{8}$$

8 Will-be-set-by-IN-TECH

Algorithm 3. Determinization. Build a DFA which recognizes the same language than the

hand, it would obviously be pointless to use a reference model of order *d* = 7 to study a motif

Another critical point to keep in mind is that motif significance is by nature very sensitive to the parameters of the reference model. In order to convince us, let us a consider the following simple example with M = GGATG, a reference model M0 of parameter *π*, and = 1, 000, 000. If *<sup>π</sup>*(A) = *<sup>π</sup>*(T) = 0.10 and *<sup>π</sup>*(C) = *<sup>π</sup>*(G) = 0.40 we get **<sup>E</sup>**[*<sup>N</sup>*] = <sup>×</sup> 0.403 <sup>×</sup> 0.102 � 640.0. Now if *<sup>π</sup>*(A) = *<sup>π</sup>*(T) = 0.08 and *<sup>π</sup>*(C) = *<sup>π</sup>*(G) = 0.42 then **<sup>E</sup>**[*<sup>N</sup>*] = <sup>×</sup> 0.423 <sup>×</sup> 0.082 � 474.2. If we admit that the standard deviation of *N* is roughly equal to *σ* = 25 (we will see later on how to perform such computation), an observation of *n* = 550 could be interpreted as a significant over-representation with the first parameters, and a significant under-representation with the second parameters (observation *n* deviates from the expectation by more than three standard deviations in both cases). The reason behind this is that parameter values are typically involved in complex products when evaluating the significance of an observation, and that such operations usually increase small variations rather than averaging them (like with sums). This problem have been investigated in Nuel (2006c) where it is shown that unwise choices of

Since the theoretical distribution of *N* not easy to obtain, it is tempting to study it from the empirical point of view by performing simple simulations. The approach is quite

Once a reference sample have been obtained, we can derive the empirical *p*-value of the

or **<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) = <sup>∑</sup>*<sup>r</sup>*

*<sup>i</sup>*=<sup>1</sup> **1***ni<sup>n</sup>*

*<sup>r</sup>* (8)

*<sup>i</sup>*=<sup>1</sup> **1***ni<sup>n</sup> r*

**Require:** (A, Q1, *σ*, F1, *δ*1) a NFA

**Output:** return (A, Q2, *q*0, F2, *δ*2)

*d* might lead to many false positive results.

1) generate a random dataset *i* according to the reference model;

2) count the number of occurrence *ni* of M in the dataset; 3) repeat 1) and 2) until we have a sample *n*1, *n*2,..., *nr*.

**<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) = <sup>∑</sup>*<sup>r</sup>*

**3.2 Monte-Carlo simulations**

straightforward:

observation *n* using:

2: **for** *i* = 0... *L* − 1 **do** 3: **for all** *a* ∈ A **do** 4: S ← *δ*1(*qi*, *a*) 5: **if** ∃*j*, *qj* = S **then** 6: *δ*2(*qi*, *a*) = *qj*

7: **else**

10: **end if** 11: **end for** 12: **end for**

original NFA.

of length 8 or less.

1: *q*<sup>0</sup> ← {*σ*}, *L* ← 1, Q<sup>2</sup> ← {*q*0}, F<sup>2</sup> ← ∅

8: *qL* ← S, *L* ← *L* + 1, Q<sup>2</sup> ← Q<sup>2</sup> ∪ {*qL*} 9: **if** S∩F<sup>∞</sup> **then** F<sup>2</sup> ← F<sup>2</sup> ∪ {*qL*} **end if**

Fig. 4. Minimal DFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G.

or, alternatively, one might use this sample to derive empirical expectation, variance, and z-score:

$$
\widehat{Z}(n) = \frac{n - \widehat{\mu}}{\sigma} \quad \text{with} \quad \widehat{\mu} = \frac{1}{r} \sum\_{i=1}^{r} n\_i \quad \text{and} \quad \widehat{\sigma}^2 = \frac{1}{r} \sum\_{i=1}^{r} (n\_i - \widehat{\mu})^2. \tag{9}
$$

If this approach is quite simple, it suffers several drawbacks: 1) it is slow; 2) sample size must be large to obtain accurate results. Indeed, if the true *<sup>p</sup>*-value is *<sup>p</sup>*, then *<sup>p</sup>* ∼ B(*r*, *<sup>p</sup>*) where *<sup>r</sup>* is the sample size. The following table gives a 90% upper bound confidence for *<sup>p</sup>* for several value of *r* in the case where *p* = 10−5:

$$\begin{array}{c|cccccc} r & 10^3 & 10^4 & 10^5 & 10^6 & 10^7 & 10^8\\ \hline \text{bound} & 1.00 \times 10^{-3} \; 1.00 \times 10^{-4} \; 3.00 \times 10^{-5} \; 1.50 \times 10^{-5} \; 1.14 \times 10^{-5} \; 1.04 \times 10^{-5} & & & & \\ \end{array}$$

we clearly see that it requires at least *<sup>r</sup>* <sup>=</sup> <sup>10</sup><sup>6</sup> samples to obtain the first accurate digit in *<sup>p</sup>*, and a prohibitive *r* = 108 samples for the second digit. Considering that very small *p*-value are easily encountered in motif significance (ex: 10−20, 10−50, 10−100), it is clear that empirical *p*-value have a limited interest in this context.

Empirical z-score does not suffer the same drawback but makes the implicit assumption that *N* has a Gaussian distribution which is highly questionable as we will see later on.

For completeness, let us point out that *importance sampling* techniques might solve the estimation problem by sampling *N* using a tailored dataset distribution (Chan et al., 2010). However, these sophisticated numerical techniques are slow and requires a good skills to be implemented.

#### **3.3 Markov chain embedding**

The key to perform any motif significance computation if first to embed the original problem into an order 1 Markov chain taking into account all the combinatoric complexity. This technique, called *Markov chain embedding* have been used by many authors in the context of motif significance Antzoulakos (2001); Boeva et al. (2005); Chang (2005); Fu (1996); Nuel (2006a), but it is only recently that its connexion to NFA and Deterministic Finite Automata (DFA) have been pointed out (Crochemore & Stefanov, 2003; Lladser, 2007; Nicodème et al., 2002; Nuel, 2008a; Nuel & Prum, 2007; Ribeca & Raineri, 2008).

in Biological Sequences 11

Significance Score of Motifs in Biological Sequences 183

 

 

Fig. 5. Minimal order 1 DFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G. The order 1 past of each state is indicated in the state itself. Diamond-shaped states correspond to the elements of

of Figure 5, *d* = 1, and with *X*<sup>1</sup> = A, we see that the Markov chain *Zd*: is defined on

*π*(A, A) *π*(A, C) *π*(A, G) *π*(A, T) 0000 *π*(C, A) *π*(C, C) *π*(C, G) *π*(C, T) 0000 *π*(G, A) 0 0 *π*(G, T) *π*(G, C) *π*(G, G) 0 0 *π*(T, A) *π*(T, C) *π*(T, G) *π*(T, T) 0000 *π*(C, A) *π*(C, C) 0 *π*(C, T) 0 0 *π*(C, G) 0 *π*(G, A) 0 0 *π*(G, T) *π*(G, C) 0 0 *π*(G, G) *π*(G, A) 0 0 *π*(G, T) *π*(G, C) *π*(G, G) 0 0 *π*(G, A) 0 0 *π*(G, T) *π*(G, C) 0 0 *π*(G, G)

From now on, we assume that our motif problem with M*d* reference model is embedded into the Markov chain *Zd*: whose transition matrix is decomposed into **T** = **P** + **Q** where matrices

We present here the main results that are then used to derive exact computations and various approximations of *S*(*n*). In all this section, we assume that *N* is the random number of occurrences of M in *X*1:, a sequence generated by a M*d* model (*X*1:*<sup>d</sup>* being fixed) with *d* 0. we denote by **T** = **P** + **Q** be the transition (*L* × *L*) matrix of the Markov chain embedding of the corresponding problem. We also introduce two vectors: **u** a 1 × *L* vector filled with '0' and

**Proposition 4** (probability generating function)**.** If we denote by *G*(*y*) = **E**[*yN*] the probability

**P**(*N* = *n*)*y<sup>n</sup>* = **u**(**P** + *y***Q**)<sup>−</sup>*d***v**. (12)

**<sup>P</sup>** and **<sup>Q</sup>** are defined for all *<sup>p</sup>*, *<sup>q</sup>* by: **<sup>P</sup>**(*p*, *<sup>q</sup>*) = **<sup>T</sup>**(*p*, *<sup>q</sup>*)**1***q*∈F/ , and **<sup>Q</sup>**(*p*, *<sup>q</sup>*) = **<sup>T</sup>**(*p*, *<sup>q</sup>*)**1***q*∈F .

having a '1' in the position corresponding to *X*1:*d*, and **v** a *L* × 1 vector of '1'.

 

 

 

 

 

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ .

 

 

 


*<sup>δ</sup>*(0, <sup>A</sup>1).

**T** =

**3.4 Main results**

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

generating function (pgf) of *N*, then we have:

*G*(*y*) = ∑

*n*0

 

{1, 2, 3, 4, 5, 6, 7, 8} by *Z*<sup>1</sup> = 1 and the following transition matrix:

 

 

We start with a NFA whose language is <sup>A</sup>�<sup>M</sup> from which we build a DFA (A, <sup>Q</sup>, *<sup>q</sup>*0, <sup>F</sup>, *<sup>δ</sup>*) using the determinization algorithm (Algorithm 3). A DFA differs from an NFA only by the definition of its transition function: *δ* : Q×A → P(Q) for a NFA, and *δ* : Q×A → Q for a DFA. For example, we can see on Figure 4, a (minimal) DFA whose language is (A|C|G|T)�G(G|C)G. This DFA has more states (6) than the corresponding NFA (4). In fact, since the state space Q<sup>2</sup> of a DFA corresponds to a subset of the parts of the original NFA state space Q1, we have |Q2| 2|Q1<sup>|</sup> . Fortunately, this upper bound is seldom reached in practice.

**Theorem 2** (Markov chain embedding for Model M0)**.** Let (A, Q, *σ*, F, *δ*) be a (minimal) DFA whose language is <sup>A</sup>�M. Let *<sup>X</sup>*1:� be a random sequence generated by the M0 model of parameter *π*. We consider the sequence *Z*0:� recursively defined by *Z*<sup>0</sup> = *σ*, and *Zi* = *<sup>δ</sup>*(*Zi*−1, *Xi*) for all 1 *<sup>i</sup>* �. Then *<sup>Z</sup>*0:� is an order 1 Markov chain whose transition matrix **<sup>T</sup>** is defined for all *p*, *q* ∈ Q by:

$$\mathbf{T}(p,q) = \sum\_{a \in \mathcal{A}\mathcal{A}(p,a) = q} \pi(a) \tag{10}$$

and having the following property for all 1 *<sup>i</sup>* �: *<sup>X</sup>*1:*<sup>i</sup>* ∈ A�M ⇐⇒ *Zi* ∈ F.

For example, if we consider the DNA motif G(G|C)G and the corresponding DFA of Figure 4, we get the following transition matrix:

$$\mathbf{T} = \begin{pmatrix} \pi(\mathbf{A}) + \pi(\mathbf{C}) + \pi(\mathbf{T}) & \pi(\mathbf{G}) & 0 & 0 & 0 & 0 \\ \pi(\mathbf{A}) + \pi(\mathbf{T}) & 0 & \pi(\mathbf{C}) & \pi(\mathbf{G}) & 0 & 0 \\ \pi(\mathbf{A}) + \pi(\mathbf{C}) + \pi(\mathbf{T}) & 0 & 0 & 0 & \pi(\mathbf{G}) & 0 \\ \pi(\mathbf{A}) + \pi(\mathbf{T}) & 0 & \pi(\mathbf{C}) & 0 & 0 & \pi(\mathbf{G}) \\ \pi(\mathbf{A}) + \pi(\mathbf{T}) & 0 & \pi(\mathbf{C}) & \pi(\mathbf{G}) & 0 & 0 \\ \pi(\mathbf{A}) + \pi(\mathbf{T}) & 0 & \pi(\mathbf{C}) & 0 & 0 & \pi(\mathbf{G}) \end{pmatrix}.$$

In order to extend Theorem 2 to order M*d* with *d* > 0 it is necessary to build DFA (A, <sup>Q</sup>, *<sup>σ</sup>*, <sup>F</sup>, *<sup>δ</sup>*) be a (minimal) DFA whose language is <sup>A</sup>�<sup>M</sup> and with the property that for all *<sup>q</sup>* ∈ Q, past(*q*) = {*<sup>a</sup>* ∈ A*d*, <sup>∃</sup>*<sup>p</sup>* ∈ Q, *<sup>δ</sup>*(*p*, *<sup>a</sup>*) = *<sup>q</sup>*} is either empty or a singleton. A DFA having this property is called a order *d* DFA by Lladser (2007), and is called non *d*-ambiguous by Nuel (2008a). The construction of such a (minimal) DFA is not very complicated but is a bit technical. A possible approach suggested by Nuel (2008a) consists in starting from a DFA without this property and duplicating any "ambiguous" state. Another more straightforward approach consists in adding the elements of <sup>A</sup>�A*<sup>d</sup>* to the original language with a specific label for the final states corresponding to each elements of <sup>A</sup>*d*, and to keep these labels during minimization and determinization algorithms.

**Theorem 3** (Markov chain embedding for Model M*d*)**.** Let (A, Q, *σ*, F, *δ*) be a (minimal) order *<sup>d</sup>* DFA whose language is <sup>A</sup>�M. Let *<sup>X</sup>*1:� be a random sequence generated by the M*<sup>d</sup>* model of parameter *π*. We consider the sequence *Zd*:� recursively defined by *Zd* = *δ*(*σ*, *X*1:*d*), and *Zi* = *<sup>δ</sup>*(*Zi*−1, *Xi*) for all 1 *<sup>i</sup>* �. Then *Zd*:� is an order 1 Markov chain whose transition matrix **T** is defined for all *p*, *q* ∈ Q by:

$$\mathbf{T}(p,q) = \sum\_{a \in \mathcal{A}, \delta(p,a) = q} \pi(\mathbf{past}(p), a) \tag{11}$$

and having the following property for all 1 *<sup>i</sup>* �: *<sup>X</sup>*1:*<sup>i</sup>* ∈ A�M ⇐⇒ *Zi* ∈ F.

One should note that *Zd*:� is defined on *<sup>δ</sup>*(*σ*, <sup>A</sup>*d*A�) which could be slightly smaller than <sup>Q</sup>. This subset corresponds to the states of Q having a order *d* past. If we consider the DFA 10 Will-be-set-by-IN-TECH

We start with a NFA whose language is <sup>A</sup>�<sup>M</sup> from which we build a DFA (A, <sup>Q</sup>, *<sup>q</sup>*0, <sup>F</sup>, *<sup>δ</sup>*) using the determinization algorithm (Algorithm 3). A DFA differs from an NFA only by the definition of its transition function: *δ* : Q×A → P(Q) for a NFA, and *δ* : Q×A → Q for a DFA. For example, we can see on Figure 4, a (minimal) DFA whose language is (A|C|G|T)�G(G|C)G. This DFA has more states (6) than the corresponding NFA (4). In fact, since the state space Q<sup>2</sup> of a DFA corresponds to a subset of the parts of the original NFA state

**Theorem 2** (Markov chain embedding for Model M0)**.** Let (A, Q, *σ*, F, *δ*) be a (minimal) DFA whose language is <sup>A</sup>�M. Let *<sup>X</sup>*1:� be a random sequence generated by the M0 model of parameter *π*. We consider the sequence *Z*0:� recursively defined by *Z*<sup>0</sup> = *σ*, and *Zi* = *<sup>δ</sup>*(*Zi*−1, *Xi*) for all 1 *<sup>i</sup>* �. Then *<sup>Z</sup>*0:� is an order 1 Markov chain whose transition matrix **<sup>T</sup>**

*a*∈A,*δ*(*p*,*a*)=*q*

For example, if we consider the DNA motif G(G|C)G and the corresponding DFA of Figure 4,

*π*(A) + *π*(C) + *π*(T) *π*(G) 0000 *π*(A) + *π*(T) 0 *π*(C) *π*(G) 0 0 *π*(A) + *π*(C) + *π*(T) 000 *π*(G) 0 *π*(A) + *π*(T) 0 *π*(C) 0 0 *π*(G) *π*(A) + *π*(T) 0 *π*(C) *π*(G) 0 0 *π*(A) + *π*(T) 0 *π*(C) 0 0 *π*(G)

In order to extend Theorem 2 to order M*d* with *d* > 0 it is necessary to build DFA (A, <sup>Q</sup>, *<sup>σ</sup>*, <sup>F</sup>, *<sup>δ</sup>*) be a (minimal) DFA whose language is <sup>A</sup>�<sup>M</sup> and with the property that for all *<sup>q</sup>* ∈ Q, past(*q*) = {*<sup>a</sup>* ∈ A*d*, <sup>∃</sup>*<sup>p</sup>* ∈ Q, *<sup>δ</sup>*(*p*, *<sup>a</sup>*) = *<sup>q</sup>*} is either empty or a singleton. A DFA having this property is called a order *d* DFA by Lladser (2007), and is called non *d*-ambiguous by Nuel (2008a). The construction of such a (minimal) DFA is not very complicated but is a bit technical. A possible approach suggested by Nuel (2008a) consists in starting from a DFA without this property and duplicating any "ambiguous" state. Another more straightforward approach consists in adding the elements of <sup>A</sup>�A*<sup>d</sup>* to the original language with a specific label for the final states corresponding to each elements of <sup>A</sup>*d*, and to keep these labels during

**Theorem 3** (Markov chain embedding for Model M*d*)**.** Let (A, Q, *σ*, F, *δ*) be a (minimal) order *<sup>d</sup>* DFA whose language is <sup>A</sup>�M. Let *<sup>X</sup>*1:� be a random sequence generated by the M*<sup>d</sup>* model of parameter *π*. We consider the sequence *Zd*:� recursively defined by *Zd* = *δ*(*σ*, *X*1:*d*), and *Zi* = *<sup>δ</sup>*(*Zi*−1, *Xi*) for all 1 *<sup>i</sup>* �. Then *Zd*:� is an order 1 Markov chain whose transition

*a*∈A,*δ*(*p*,*a*)=*q*

One should note that *Zd*:� is defined on *<sup>δ</sup>*(*σ*, <sup>A</sup>*d*A�) which could be slightly smaller than <sup>Q</sup>. This subset corresponds to the states of Q having a order *d* past. If we consider the DFA

**T**(*p*, *q*) = ∑

and having the following property for all 1 *<sup>i</sup>* �: *<sup>X</sup>*1:*<sup>i</sup>* ∈ A�M ⇐⇒ *Zi* ∈ F.

**T**(*p*, *q*) = ∑

and having the following property for all 1 *<sup>i</sup>* �: *<sup>X</sup>*1:*<sup>i</sup>* ∈ A�M ⇐⇒ *Zi* ∈ F.

. Fortunately, this upper bound is seldom reached in practice.

*π*(*a*) (10)

⎞

⎟⎟⎟⎟⎟⎠ .

*π*(past(*p*), *a*) (11)

space Q1, we have |Q2| 2|Q1<sup>|</sup>

is defined for all *p*, *q* ∈ Q by:

we get the following transition matrix:

⎛

⎜⎜⎜⎜⎜⎝

minimization and determinization algorithms.

matrix **T** is defined for all *p*, *q* ∈ Q by:

**T** =

Fig. 5. Minimal order 1 DFA whose language is (A|C|G|T)<sup>G</sup>(G|C)G. The order 1 past of each state is indicated in the state itself. Diamond-shaped states correspond to the elements of *<sup>δ</sup>*(0, <sup>A</sup>1).

of Figure 5, *d* = 1, and with *X*<sup>1</sup> = A, we see that the Markov chain *Zd*: is defined on {1, 2, 3, 4, 5, 6, 7, 8} by *Z*<sup>1</sup> = 1 and the following transition matrix:


From now on, we assume that our motif problem with M*d* reference model is embedded into the Markov chain *Zd*: whose transition matrix is decomposed into **T** = **P** + **Q** where matrices **<sup>P</sup>** and **<sup>Q</sup>** are defined for all *<sup>p</sup>*, *<sup>q</sup>* by: **<sup>P</sup>**(*p*, *<sup>q</sup>*) = **<sup>T</sup>**(*p*, *<sup>q</sup>*)**1***q*∈F/ , and **<sup>Q</sup>**(*p*, *<sup>q</sup>*) = **<sup>T</sup>**(*p*, *<sup>q</sup>*)**1***q*∈F .

#### **3.4 Main results**

We present here the main results that are then used to derive exact computations and various approximations of *S*(*n*). In all this section, we assume that *N* is the random number of occurrences of M in *X*1:, a sequence generated by a M*d* model (*X*1:*<sup>d</sup>* being fixed) with *d* 0. we denote by **T** = **P** + **Q** be the transition (*L* × *L*) matrix of the Markov chain embedding of the corresponding problem. We also introduce two vectors: **u** a 1 × *L* vector filled with '0' and having a '1' in the position corresponding to *X*1:*d*, and **v** a *L* × 1 vector of '1'.

**Proposition 4** (probability generating function)**.** If we denote by *G*(*y*) = **E**[*yN*] the probability generating function (pgf) of *N*, then we have:

$$G(\boldsymbol{y}) = \sum\_{n\geq 0} \mathbb{P}(N=n)\boldsymbol{y}^{\boldsymbol{n}} = \mathbf{u}(\mathbf{P} + \boldsymbol{y}\mathbf{Q})^{\ell-d}\mathbf{v}.\tag{12}$$

in Biological Sequences 13

Significance Score of Motifs in Biological Sequences 185

**Corollary 6** (characteristics moments)**.** If we denote by *<sup>μ</sup>* <sup>=</sup> *<sup>κ</sup>*<sup>1</sup> the *expectation* of *<sup>N</sup>*, by *<sup>σ</sup>* <sup>=</sup> <sup>√</sup>*κ*<sup>2</sup> the *standard deviation* of *N*, by *γ*<sup>1</sup> = *κ*3/*σ*<sup>3</sup> the *skewness* of *N*, and by *γ*<sup>1</sup> = *κ*4/*σ*<sup>4</sup> the *excess*

*Proof.* On just need to compute the derivatives Λ(1)(0), Λ(2)(0), Λ(3)(0), and Λ(4)(0).

If we consider again M = G(G|C)G and *X*1:12 generated by a M0 model with parameters

= 1 + 1.28*y* + 1.01683*y*<sup>2</sup> + 0.61211*y*<sup>3</sup> + 0.29709*y*<sup>4</sup> + 0.11835*y*<sup>5</sup>

From this result, we can get all factorial moments of *N*: **E**(1) = *F*<sup>0</sup> = 1, **E**(*N*) = *F*<sup>1</sup> = 1.28, **E**(*N*(*N* − 1)) = *F*<sup>2</sup> = 2.033664, **E**(*N*(*N* − 1)(*N* − 2)) = *F*<sup>3</sup> = 3.6726374, **E**(*N*(*N* − 1)(*N* − 2)(*N* − 3)) = *F*<sup>4</sup> = 7.1302266, ..., **E**(*N*!/(*N* − 10)!) = *F*<sup>10</sup> = 60.881161. Thanks to Corollary 6 we get the following characteristic moments: *μ* = 1.28, *σ* = 1.294320, *γ*<sup>1</sup> = 1.163783, *γ*<sup>2</sup> =

As we have seen above, Proposition 4 provides a way to obtain the whole distribution of *N* by computing *G*(*y*) = **u**(**P** + *y***Q**)�−*d***v** from which we can easily derive *S*(*n*) for any *n* 0:

*<sup>k</sup>*=0[*yk*]*G*(*y*)

*<sup>k</sup>*=*n*[*yk*]*G*(*y*)

∑*n*

210 <sup>+</sup> <sup>2</sup><sup>6</sup> <sup>+</sup> 23 <sup>+</sup> 20. We then just have to recursively compute **<sup>D</sup>***k*(*y*)=(**<sup>P</sup>** <sup>+</sup> *<sup>y</sup>***Q**)2*<sup>k</sup>*

∑+<sup>∞</sup>

From the algorithmic point of view, there are basically two approaches to compute *S*(*n*) using Expression (12). The first one, called *power*, consists in computing (**P** + *y***Q**)�−*<sup>d</sup>* using the power method and a binary decomposition of � − *d*. Ex: if � − *d* = 1097 then � − *d* =

relation **D***k*+1(*y*) = **D***k*(*y*) × **D***k*(*y*) for all *k* 0. Since in the computation of *S*(*n*) we are only interested in terms of degree *n* or less (or *n* or more), we can easily truncate <sup>2</sup> all polynomials at degree *n* thus dramatically reducing the computational costs of polynomial products. We end

<sup>2</sup> In the case of over-representation, all contributions of degree *n* or more are summed into the term of

1 ,

<sup>1</sup> <sup>+</sup> *<sup>F</sup>*<sup>3</sup> <sup>−</sup> <sup>3</sup>*F*1*F*<sup>2</sup> <sup>+</sup> <sup>2</sup>*F*<sup>3</sup>

<sup>1</sup> <sup>+</sup> *<sup>F</sup>*<sup>4</sup> <sup>−</sup> <sup>4</sup>*F*1*F*<sup>3</sup> <sup>−</sup> <sup>3</sup>*F*<sup>2</sup>

0.6 0.4 0 0 0 0 0.2 0 0.4 0.4 0 0 0.6 0 0 0 0.4 + 0.4*y* 0 0.2 0 0.4 0 0 0.4 + 0.4*y* 0.2 0 0.4 0.4 0 0 0.2 0 0.4 0 0 0.4 + 0.4*y*

+0.03845*y*<sup>6</sup> + 0.00992*y*<sup>7</sup> + 0.00193*y*<sup>8</sup> + 0.00025*y*<sup>9</sup> + 0.00002*y*10. (22)

�

�

if *n* **E**[*N*]

if *n* > **E**[*N*]

.

<sup>1</sup> + *F*<sup>1</sup>

<sup>2</sup> <sup>+</sup> <sup>12</sup>*F*<sup>2</sup>

*<sup>σ</sup>*<sup>3</sup> , (19)

*<sup>σ</sup>*<sup>4</sup> . (20)

<sup>1</sup> *<sup>F</sup>*<sup>2</sup> <sup>−</sup> <sup>6</sup>*F*<sup>4</sup>

⎞

12

×

⎛

⎞

⎟⎟⎟⎟⎟⎠

(21)

using the

⎜⎜⎜⎜⎜⎝

⎟⎟⎟⎟⎟⎠

<sup>1</sup> + *F*<sup>1</sup>

*kurtosis* of *<sup>N</sup>*, then we get: *<sup>μ</sup>* <sup>=</sup> *<sup>F</sup>*1, *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> *<sup>F</sup>*<sup>2</sup> <sup>+</sup> *<sup>F</sup>*<sup>1</sup> <sup>−</sup> *<sup>F</sup>*<sup>2</sup>

100000 �

*S*(*n*) =

⎧ ⎨ ⎩ <sup>+</sup> log10 �

<sup>−</sup> log10 �

and

*<sup>γ</sup>*<sup>2</sup> <sup>=</sup> <sup>7</sup>*F*<sup>2</sup> <sup>−</sup> <sup>7</sup>*F*<sup>2</sup>

∑ *k*0

1.492661.

degree *n*.

**3.5 Exact computations**

*Fk <sup>k</sup>*! <sup>=</sup> � *<sup>γ</sup>*<sup>1</sup> <sup>=</sup> <sup>3</sup>*F*<sup>2</sup> <sup>−</sup> <sup>3</sup>*F*<sup>2</sup>

<sup>1</sup> <sup>+</sup> <sup>6</sup>*F*<sup>3</sup> <sup>−</sup> <sup>18</sup>*F*1*F*<sup>2</sup> <sup>+</sup> <sup>12</sup>*F*<sup>3</sup>

*π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40. Eq. (18) hence gives:

⎛

⎜⎜⎜⎜⎜⎝

×

*Proof.* The first equality derives directly from the definition of *G*(*y*). For the second equality now, it is clear that **u**(**P** + **Q**)<sup>−</sup>*<sup>d</sup>* gives the marginal distribution of *Z*. We then connect this distribution to *N* by counting the number of times we use the transitions of **Q** with the dummy variable *y* so that **u**(**P** + *y***Q**)<sup>−</sup>*<sup>d</sup>* gives the joint distribution of (*Z*, *N*). Finally, we sum up the contributions of all states using the product with **v**.

For example, let us consider M = G(G|C)G and *X*1:12 generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40. Proposition 4 hence gives:

$$G(y) = \begin{pmatrix} 1 \ 0 \ 0 \ 0 \ 0 \ 0 \end{pmatrix} \times \begin{pmatrix} 0.6 \ 0.4 & 0 & 0 & 0 & 0\\ 0.2 & 0 & 0.4 \ 0.4 & 0 & 0\\ 0.6 & 0 & 0 & 0 & 0.4y & 0\\ 0.2 & 0 & 0.4 & 0 & 0 & 0.4y\\ 0.2 & 0 & 0.4 & 0 & 0 & 0.4y\\ 0.2 & 0 & 0.4 & 0 & 0 & 0.4y \end{pmatrix}^{12} \times \begin{pmatrix} 1\\ 1\\ 1\\ 1\\ 1\\ 1 \end{pmatrix} \tag{13}$$
 
$$= 0.33369 + 0.31148y + 0.19357y^2 + 0.09681y^3 + 0.04140y^4 + 0.01569y^5$$

$$1 + 0.00528y^6 + 0.00157y^7 + 0.00042y^8 + 0.00008y^9 + 0.00002y^{10}.\tag{14}$$

From this result, we have the whole distribution of *N*: support is {0, 1, . . . , 10}, **P**(*N* = 0) = 0.33369, **P**(*N* = 1) = 0.31148, ..., **P**(*N* = 10) = 0.00002. We can also easily derive moments of *N* from this distribution: **E**[*N*] = 1.28, *σ*[*N*] = 1.29.

**Lemma 5** (derivatives of the pgf)**.** For any *k* 0, the order *k* derivative of the pgf *G* is given by:

$$G^{(k)}(y) = k! [z^k] \mathbf{u}(\mathbf{P} + y\mathbf{Q} + z\mathbf{Q})^{\ell - d} \mathbf{v} \tag{15}$$

where the [*zk*] operator denotes the extraction of the coefficient of *z<sup>k</sup>* in the expression.

*Proof.* The formal proof can be found in Nuel (2010) in a slightly less general case. Here we prove it only for the first two derivatives in the particular case where − *d* = 3. Starting from *G*(*y*) = **u**(**P** + *y***Q**)3**v** we get:

$$G'(y) = \mathbf{u} \left( \mathbf{Q} (\mathbf{P} + y\mathbf{Q})^2 + (\mathbf{P} + y\mathbf{Q})\mathbf{Q} (\mathbf{P} + y\mathbf{Q}) + (\mathbf{P} + y\mathbf{Q})^2 \mathbf{Q} \right) \mathbf{v} \tag{16}$$

and

$$\mathbf{G}'''(y) = 2\mathbf{u}\left(\mathbf{Q}^2(\mathbf{P} + y\mathbf{Q}) + \mathbf{Q}(\mathbf{P} + y\mathbf{Q})\mathbf{Q} + (\mathbf{P} + y\mathbf{Q})\mathbf{Q}^2\right)\mathbf{v} \tag{17}$$

which are easily connected to the terms coefficients of *z*<sup>1</sup> and *z*<sup>2</sup> in **u**(**P** + *y***Q** + *z***Q**)<sup>−</sup>*d***v**.

If we denote for all *k* 0 the *k*-th *factorial moment* of *N* by *Fk* = **E**[*N*!/(*N* − *k*)!], then, by the definition of the pgf, it is clear that *Fk* = *G*(*k*)(0), and thanks to Lemma 5 we get:

$$F\_k = k! [z^k] \mathbf{u}(\mathbf{T} + z\mathbf{Q})^{\ell - d} \mathbf{v}. \tag{18}$$

And if we now denote the moment generating function (mgf) of *N* by *M*(*t*) = **E**[*etN*] = *G*(*e<sup>t</sup>* ), and the cumulant generating function (cgf) of *N* by Λ(*t*) = log **E**[*etN*] = log *M*(*t*) = log *G*(*e<sup>t</sup>* ), we get directly the *k*-th moment of *N*: **E**[*Nk*] = *M*(*k*)(0); and the *k*-th cumulant of *N*: *κ<sup>k</sup>* = Λ(*k*)(0).

**Corollary 6** (characteristics moments)**.** If we denote by *<sup>μ</sup>* <sup>=</sup> *<sup>κ</sup>*<sup>1</sup> the *expectation* of *<sup>N</sup>*, by *<sup>σ</sup>* <sup>=</sup> <sup>√</sup>*κ*<sup>2</sup> the *standard deviation* of *N*, by *γ*<sup>1</sup> = *κ*3/*σ*<sup>3</sup> the *skewness* of *N*, and by *γ*<sup>1</sup> = *κ*4/*σ*<sup>4</sup> the *excess kurtosis* of *<sup>N</sup>*, then we get: *<sup>μ</sup>* <sup>=</sup> *<sup>F</sup>*1, *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> *<sup>F</sup>*<sup>2</sup> <sup>+</sup> *<sup>F</sup>*<sup>1</sup> <sup>−</sup> *<sup>F</sup>*<sup>2</sup> 1 ,

$$\gamma\_1 = \frac{3F\_2 - 3F\_1^2 + F\_3 - 3F\_1 F\_2 + 2F\_1^3 + F\_1}{\sigma^3},\tag{19}$$

and

12 Will-be-set-by-IN-TECH

*Proof.* The first equality derives directly from the definition of *G*(*y*). For the second equality now, it is clear that **u**(**P** + **Q**)<sup>−</sup>*<sup>d</sup>* gives the marginal distribution of *Z*. We then connect this distribution to *N* by counting the number of times we use the transitions of **Q** with the dummy variable *y* so that **u**(**P** + *y***Q**)<sup>−</sup>*<sup>d</sup>* gives the joint distribution of (*Z*, *N*). Finally, we sum up the

For example, let us consider M = G(G|C)G and *X*1:12 generated by a M0 model with

0.6 0.4 0 0 0 0 0.2 0 0.4 0.4 0 0 0.6 0 0 0 0.4*y* 0 0.2 0 0.4 0 0 0.4*y* 0.2 0 0.4 0.4 0 0 0.2 0 0.4 0 0 0.4*y*

= 0.33369 + 0.31148*y* + 0.19357*y*<sup>2</sup> + 0.09681*y*<sup>3</sup> + 0.04140*y*<sup>4</sup> + 0.01569*y*<sup>5</sup>

From this result, we have the whole distribution of *N*: support is {0, 1, . . . , 10}, **P**(*N* = 0) = 0.33369, **P**(*N* = 1) = 0.31148, ..., **P**(*N* = 10) = 0.00002. We can also easily derive moments

**Lemma 5** (derivatives of the pgf)**.** For any *k* 0, the order *k* derivative of the pgf *G* is given

*Proof.* The formal proof can be found in Nuel (2010) in a slightly less general case. Here we prove it only for the first two derivatives in the particular case where − *d* = 3. Starting from

**Q**(**P** + *y***Q**)<sup>2</sup> + (**P** + *y***Q**)**Q**(**P** + *y***Q**)+(**P** + *y***Q**)2**Q**

**Q**2(**P** + *y***Q**) + **Q**(**P** + *y***Q**)**Q** + (**P** + *y***Q**)**Q**<sup>2</sup>

where the [*zk*] operator denotes the extraction of the coefficient of *z<sup>k</sup>* in the expression.

which are easily connected to the terms coefficients of *z*<sup>1</sup> and *z*<sup>2</sup> in **u**(**P** + *y***Q** + *z***Q**)<sup>−</sup>*d***v**.

definition of the pgf, it is clear that *Fk* = *G*(*k*)(0), and thanks to Lemma 5 we get:

If we denote for all *k* 0 the *k*-th *factorial moment* of *N* by *Fk* = **E**[*N*!/(*N* − *k*)!], then, by the

And if we now denote the moment generating function (mgf) of *N* by *M*(*t*) = **E**[*etN*] = *G*(*e<sup>t</sup>*

and the cumulant generating function (cgf) of *N* by Λ(*t*) = log **E**[*etN*] = log *M*(*t*) = log *G*(*e<sup>t</sup>*

we get directly the *k*-th moment of *N*: **E**[*Nk*] = *M*(*k*)(0); and the *k*-th cumulant of *N*: *κ<sup>k</sup>* =

+0.00528*y*<sup>6</sup> + 0.00157*y*<sup>7</sup> + 0.00042*y*<sup>8</sup> + 0.00008*y*<sup>9</sup> + 0.00002*y*10. (14)

⎞

12

×

(*y*) = *k*![*zk*]**u**(**P** + *y***Q** + *z***Q**)<sup>−</sup>*d***v** (15)

*Fk* = *<sup>k</sup>*![*zk*]**u**(**<sup>T</sup>** + *<sup>z</sup>***Q**)<sup>−</sup>*d***v**. (18)

⎛

⎞

⎟⎟⎟⎟⎟⎠

�

�

**v** (16)

**v** (17)

),

),

(13)

⎜⎜⎜⎜⎜⎝

⎟⎟⎟⎟⎟⎠

parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40. Proposition 4 hence gives:

contributions of all states using the product with **v**.

100000 �

of *N* from this distribution: **E**[*N*] = 1.28, *σ*[*N*] = 1.29.

*G*(*k*)

×

⎛

⎜⎜⎜⎜⎜⎝

*G*(*y*) = �

*G*(*y*) = **u**(**P** + *y***Q**)3**v** we get:

*G*�

(*y*) = **u**

*G*��(*y*) = 2**u**

�

�

by:

and

Λ(*k*)(0).

$$\gamma\_2 = \frac{7F\_2 - 7F\_1^2 + 6F\_3 - 18F\_1F\_2 + 12F\_1^3 + F\_4 - 4F\_1F\_3 - 3F\_2^2 + 12F\_1^2F\_2 - 6F\_1^4 + F\_1}{\sigma^4}.\tag{20}$$

*Proof.* On just need to compute the derivatives Λ(1)(0), Λ(2)(0), Λ(3)(0), and Λ(4)(0).

If we consider again M = G(G|C)G and *X*1:12 generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40. Eq. (18) hence gives:

$$\sum\_{k<0} \frac{F\_k}{k!} = \begin{pmatrix} 1 \ 0 \ 0 \ 0 \ 0 \ 0 \end{pmatrix} \times \begin{pmatrix} 0.6 \ 0.4 & 0 \ 0 & 0 & 0\\ 0.2 & 0 \ 0.4 \ 0.4 & 0 & 0\\ 0.6 & 0 & 0 & 0.4 + 0.4y & 0\\ 0.2 & 0 & 0.4 & 0 & 0 & 0.4 + 0.4y\\ 0.2 & 0 & 0.4 \ 0.4 & 0 & 0\\ 0.2 & 0 & 0.4 & 0 & 0 & 0.4 + 0.4y \end{pmatrix}^{12} \times \begin{pmatrix} 1\\ 1\\ 1\\ 1\\ 1 \end{pmatrix} \tag{21}$$

$$= 1 + 1.28y + 1.01683y^2 + 0.61211y^3 + 0.29709y^4 + 0.11835y^5$$

$$+ 0.03845y^6 + 0.00992y^7 + 0.00193y^8 + 0.00025y^9 + 0.00002y^{10}. \tag{22}$$

From this result, we can get all factorial moments of *N*: **E**(1) = *F*<sup>0</sup> = 1, **E**(*N*) = *F*<sup>1</sup> = 1.28, **E**(*N*(*N* − 1)) = *F*<sup>2</sup> = 2.033664, **E**(*N*(*N* − 1)(*N* − 2)) = *F*<sup>3</sup> = 3.6726374, **E**(*N*(*N* − 1)(*N* − 2)(*N* − 3)) = *F*<sup>4</sup> = 7.1302266, ..., **E**(*N*!/(*N* − 10)!) = *F*<sup>10</sup> = 60.881161. Thanks to Corollary 6 we get the following characteristic moments: *μ* = 1.28, *σ* = 1.294320, *γ*<sup>1</sup> = 1.163783, *γ*<sup>2</sup> = 1.492661.

#### **3.5 Exact computations**

As we have seen above, Proposition 4 provides a way to obtain the whole distribution of *N* by computing *G*(*y*) = **u**(**P** + *y***Q**)�−*d***v** from which we can easily derive *S*(*n*) for any *n* 0:

$$S(n) = \begin{cases} +\log\_{10}\left(\sum\_{k=0}^{n} [y^k] G(y)\right) \text{ if } n \leqslant \mathbb{E}[N] \\ -\log\_{10}\left(\sum\_{k=n}^{+\infty} [y^k] G(y)\right) \text{ if } n > \mathbb{E}[N] \end{cases}$$

.

From the algorithmic point of view, there are basically two approaches to compute *S*(*n*) using Expression (12). The first one, called *power*, consists in computing (**P** + *y***Q**)�−*<sup>d</sup>* using the power method and a binary decomposition of � − *d*. Ex: if � − *d* = 1097 then � − *d* = 210 <sup>+</sup> <sup>2</sup><sup>6</sup> <sup>+</sup> 23 <sup>+</sup> 20. We then just have to recursively compute **<sup>D</sup>***k*(*y*)=(**<sup>P</sup>** <sup>+</sup> *<sup>y</sup>***Q**)2*<sup>k</sup>* using the relation **D***k*+1(*y*) = **D***k*(*y*) × **D***k*(*y*) for all *k* 0. Since in the computation of *S*(*n*) we are only interested in terms of degree *n* or less (or *n* or more), we can easily truncate <sup>2</sup> all polynomials at degree *n* thus dramatically reducing the computational costs of polynomial products. We end

<sup>2</sup> In the case of over-representation, all contributions of degree *n* or more are summed into the term of degree *n*.

in Biological Sequences 15

Significance Score of Motifs in Biological Sequences 187

Table 1. Characteristic moments the number *N* of occurrences of motif M = G(G|C)G in a sequence *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and

Since the random count *N* is basically defined by Eq. (2) as large sum of Bernouilli variables, the idea of approximating the distribution of *N* using Gaussian approximation sounds appealing. Indeed, Gaussian approximations are historically the first ones to have been suggested for this problem (Cowan, 1991; Kleffe & Borodovski, 1997; Pevzner et al., 1989; Prum et al., 1995). From the theoretical point of view, Central Limit Theorems (CLT) for weakly dependent variables ensure that *N* is asymptotically normal distributed. On Table 1, we can see the characteristic moments of *N* for motif M = G(G|C)G and various value of the sequence lengths . According to theory, we observe that the skewness and excess kurtosis both decease toward 0 when grows (a normal distribution has null skewness and excess kurtosis). But it is also clear that *N* is not normally distributed for small values of . As a consequence, the quality of a Gaussian approximation for *S*(*n*) is expected to be questionable

In order to overcome this issue, Nuel (2010) suggested to consider near Gaussian approximations instead of simple Gaussian approximations for this problem. The idea is simply to perform a higher order asymptotic development that exploits more than the two first moments of *N*. This technique is known as the Edgeworth's expansion. Blinnikov & Moessner (1998) gives a general (and rather complicated) formula for this expansion. For

probability distribution function (pdf) of a standard Gaussian, then for all *n* 0 we have

<sup>6</sup> *<sup>H</sup>*3(*z*) *<sup>C</sup>*2(*z*) = *<sup>S</sup>*<sup>4</sup>

where *<sup>μ</sup>* <sup>=</sup> **<sup>E</sup>**[*N*], *<sup>σ</sup>* <sup>=</sup> **<sup>V</sup>**[*N*], *<sup>z</sup>* = (*<sup>n</sup>* <sup>−</sup> *<sup>μ</sup>*)/*σ*, *Sk* <sup>=</sup> *<sup>κ</sup>k*/*σ*2*k*−<sup>2</sup> for all *<sup>k</sup>* 1, and where *Hk*(*z*) are the Hermite polynomials defined recursively by *<sup>H</sup>*0(*z*) = 1 and *Hk*(*z*) = *zHk*−1(*z*) −

*C*0(*z*) + *σC*1(*z*) + *σ*2*C*2(*z*) + *σ*3*C*3(*z*)

<sup>144</sup> *<sup>H</sup>*7(*z*) + *<sup>S</sup>*<sup>3</sup>

<sup>24</sup> *<sup>H</sup>*4(*z*) + *<sup>S</sup>*<sup>2</sup>

3

3

<sup>√</sup>2*<sup>π</sup>* the

(24)

<sup>1296</sup> *<sup>H</sup>*9(*z*) (26)

<sup>72</sup> *<sup>H</sup>*6(*z*) (25)

*π*(C) = *π*(G) = 0.40. Computation performed using the power approach.

practical purpose, we present the result only up to order 3 expansions.

*σ* 

**<sup>P</sup>**(*<sup>N</sup>* <sup>=</sup> *<sup>n</sup>*) � *<sup>ϕ</sup>*(*z*)

*<sup>C</sup>*0(*z*) = <sup>1</sup> *<sup>C</sup>*1(*z*) = *<sup>S</sup>*<sup>3</sup>

*<sup>C</sup>*3(*z*) = *<sup>S</sup>*<sup>5</sup>

**Proposition 7** (Edgeworth's expansion)**.** If we denote by *<sup>ϕ</sup>*(*z*) = exp(−*z*2/2)/

<sup>120</sup> *<sup>H</sup>*5(*z*) + *<sup>S</sup>*3*S*<sup>4</sup>

**3.6 Near-Gaussian approximations**

at finite distance.

with

*H*�

the following approximation:

*<sup>k</sup>*−1(*z*) for all *<sup>k</sup>* 1.

 expectation std. dev. skewness e. kurtosis time (s) 12 1.280000 1.294320 1.163783 1.492661 0.01 120 15.104000 4.585724 0.361328 0.149974 0.02 1200 153.344000 14.648033 0.113920 0.014936 0.03 12000 1535.744000 46.367282 0.036014 0.001492 0.04 120000 15359.744000 146.640798 0.011394 −0.000410 0.05

up with a *<sup>O</sup>*(log2 <sup>×</sup> *<sup>n</sup>*<sup>2</sup> <sup>×</sup> *<sup>L</sup>*3) complexity in time where *<sup>L</sup>* is the order of the transition matrix **<sup>T</sup>** <sup>=</sup> **<sup>P</sup>** <sup>+</sup> **<sup>Q</sup>**. The corresponding memory complexity is *<sup>O</sup>*(log2 <sup>×</sup> *<sup>n</sup>* <sup>×</sup> *<sup>L</sup>*2). Since the length of the dataset appears in a logarithmic scale in these complexity, the power approach is obviously suitable for large datasets (ex: = 10<sup>6</sup> or = 109). Unfortunately, the cubic complexity with *L* (quadratic in memory) prevents the approach to deal with complex motifs with high *L*. One should also note that the quadratic complexity in *n* could really be a problem when dealing with frequent motifs and/or large datasets. In order to overcome this problem, Ribeca & Raineri (2008) suggested to use fast Fourier transforms (FFT) to perform all polynomial product hence replacing *<sup>n</sup>*<sup>2</sup> by *<sup>n</sup>* log2 *<sup>n</sup>* in the time complexity. However appealing at first glance, this approach is not recommended in practice since the FFT products in floating-point arithmetics induce numerical instabilities that make totally unreliable the smallest coefficients of the polynomials. And unfortunately, these coefficients are precisely the one needed to study the tail distribution of *N*.

Another interesting approach called *full recursion*, consists in computing **v***<sup>i</sup>* = (**P** + *y***Q**)*<sup>i</sup>* **v** for all 0 *i* − *d* recursively using the relation **v***i*+<sup>1</sup> = (**P** + *y***Q**)**v***i*. There are two main interests for this approach: 1) we have only products between polynomials of degree 1 and polynomials of degree *n* (by dropping terms of degree greater than *n* like in the power approach); 2) we can take full advantage of the sparse structure (only *L* × |A| non-zero terms in the worst case) of the transition matrix **T** = **P** + **Q**. The resulting complexity is *O*( × *L* × |A| × *n*) in time, and *O*(*L* × *n*) in memory. Since these complexities are linear with *L*, this approach is able to handle very complex motifs. The drawback is that the approach can be very slow when dealing with large and *n*. It exists a sophisticated version of this recursion called *partial recursion* (see Nuel & Dumas, 2010) which allows to replace <sup>×</sup> *<sup>n</sup>* by log <sup>×</sup> *<sup>n</sup>*<sup>2</sup> in the time complexity. However, the quadratic complexity in *n* and numerical instabilities in floating-point arithmetic restrains its use to small *n* (ex: *n* 10).

For completeness, let us point out another approach to the problem. The idea is that we can derive from Expression (12) the following expression:

$$G(y, z) = \sum\_{n \ge 0} \sum\_{\ell \ge d} \mathbb{P}(N\_{\ell} = n) y^n z^{\ell} = \mathbf{u} z^d (\mathbf{I} - \mathbf{P}z + yz \mathbf{Q})^{-1} \mathbf{v} \tag{23}$$

where **I** is the identity matrix and *N* the number of matching position in *X*1:. It is then possible to obtain **P**(*N* = *n*) for any and *n* using (fast) Taylor expansions of *G*(*y*, *z*). For the mathematician, this approach is so "natural" that it is often referred as the "golden" approach to the problem of motif significance (Nicodème et al., 2002). However, this approach suffers several severe drawbacks that dramatically limits its practical interest: 1) the approach needs sophisticated computer algebra systems to be implemented (rather than simple floating point arithmetic for the previous approaches); 2) the explicit computation of (**<sup>I</sup>** <sup>−</sup> **<sup>P</sup>***<sup>z</sup>* <sup>+</sup> *yz***Q**)−<sup>1</sup> could be very time (and memory) consuming; 3) even if the explicit computation of the inverse matrix is avoided (which is highly advisable), the coefficient extraction using state of the art techniques (like high-order lifting for example) is often slower than the much simpler alternative developed above (see Nuel & Dumas, 2010, for details).

Considering either the power or the recursion approaches we obtain easy to implement algorithms allowing to compute the exact value of *S*(*n*) in all cases except when dealing with high complexity motifs (large *L*) and/or frequent motifs (large *n*). But even if we stick to more tractable cases, exact computations could be slow. The question hence is: is it possible to compute fast and reliable approximations of *S*(*n*) ?

14 Will-be-set-by-IN-TECH

up with a *<sup>O</sup>*(log2 <sup>×</sup> *<sup>n</sup>*<sup>2</sup> <sup>×</sup> *<sup>L</sup>*3) complexity in time where *<sup>L</sup>* is the order of the transition matrix **<sup>T</sup>** <sup>=</sup> **<sup>P</sup>** <sup>+</sup> **<sup>Q</sup>**. The corresponding memory complexity is *<sup>O</sup>*(log2 <sup>×</sup> *<sup>n</sup>* <sup>×</sup> *<sup>L</sup>*2). Since the length of the dataset appears in a logarithmic scale in these complexity, the power approach is obviously suitable for large datasets (ex: = 10<sup>6</sup> or = 109). Unfortunately, the cubic complexity with *L* (quadratic in memory) prevents the approach to deal with complex motifs with high *L*. One should also note that the quadratic complexity in *n* could really be a problem when dealing with frequent motifs and/or large datasets. In order to overcome this problem, Ribeca & Raineri (2008) suggested to use fast Fourier transforms (FFT) to perform all polynomial product hence replacing *<sup>n</sup>*<sup>2</sup> by *<sup>n</sup>* log2 *<sup>n</sup>* in the time complexity. However appealing at first glance, this approach is not recommended in practice since the FFT products in floating-point arithmetics induce numerical instabilities that make totally unreliable the smallest coefficients of the polynomials. And unfortunately, these coefficients are precisely the one needed to study

Another interesting approach called *full recursion*, consists in computing **v***<sup>i</sup>* = (**P** + *y***Q**)*<sup>i</sup>*

all 0 *i* − *d* recursively using the relation **v***i*+<sup>1</sup> = (**P** + *y***Q**)**v***i*. There are two main interests for this approach: 1) we have only products between polynomials of degree 1 and polynomials of degree *n* (by dropping terms of degree greater than *n* like in the power approach); 2) we can take full advantage of the sparse structure (only *L* × |A| non-zero terms in the worst case) of the transition matrix **T** = **P** + **Q**. The resulting complexity is *O*( × *L* × |A| × *n*) in time, and *O*(*L* × *n*) in memory. Since these complexities are linear with *L*, this approach is able to handle very complex motifs. The drawback is that the approach can be very slow when dealing with large and *n*. It exists a sophisticated version of this recursion called *partial recursion* (see Nuel & Dumas, 2010) which allows to replace <sup>×</sup> *<sup>n</sup>* by log <sup>×</sup> *<sup>n</sup>*<sup>2</sup> in the time complexity. However, the quadratic complexity in *n* and numerical instabilities in floating-point arithmetic restrains

For completeness, let us point out another approach to the problem. The idea is that we can

where **I** is the identity matrix and *N* the number of matching position in *X*1:. It is then possible to obtain **P**(*N* = *n*) for any and *n* using (fast) Taylor expansions of *G*(*y*, *z*). For the mathematician, this approach is so "natural" that it is often referred as the "golden" approach to the problem of motif significance (Nicodème et al., 2002). However, this approach suffers several severe drawbacks that dramatically limits its practical interest: 1) the approach needs sophisticated computer algebra systems to be implemented (rather than simple floating point arithmetic for the previous approaches); 2) the explicit computation of (**<sup>I</sup>** <sup>−</sup> **<sup>P</sup>***<sup>z</sup>* <sup>+</sup> *yz***Q**)−<sup>1</sup> could be very time (and memory) consuming; 3) even if the explicit computation of the inverse matrix is avoided (which is highly advisable), the coefficient extraction using state of the art techniques (like high-order lifting for example) is often slower than the much simpler

Considering either the power or the recursion approaches we obtain easy to implement algorithms allowing to compute the exact value of *S*(*n*) in all cases except when dealing with high complexity motifs (large *L*) and/or frequent motifs (large *n*). But even if we stick to more tractable cases, exact computations could be slow. The question hence is: is it possible

**<sup>P</sup>**(*<sup>N</sup>* <sup>=</sup> *<sup>n</sup>*)*ynz* <sup>=</sup> **<sup>u</sup>***zd*(**<sup>I</sup>** <sup>−</sup> **<sup>P</sup>***<sup>z</sup>* <sup>+</sup> *yz***Q**)−1**<sup>v</sup>** (23)

**v** for

the tail distribution of *N*.

its use to small *n* (ex: *n* 10).

derive from Expression (12) the following expression:

*n*0 ∑ *d*

alternative developed above (see Nuel & Dumas, 2010, for details).

to compute fast and reliable approximations of *S*(*n*) ?

*G*(*y*, *z*) = ∑


Table 1. Characteristic moments the number *N* of occurrences of motif M = G(G|C)G in a sequence *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40. Computation performed using the power approach.

#### **3.6 Near-Gaussian approximations**

Since the random count *N* is basically defined by Eq. (2) as large sum of Bernouilli variables, the idea of approximating the distribution of *N* using Gaussian approximation sounds appealing. Indeed, Gaussian approximations are historically the first ones to have been suggested for this problem (Cowan, 1991; Kleffe & Borodovski, 1997; Pevzner et al., 1989; Prum et al., 1995). From the theoretical point of view, Central Limit Theorems (CLT) for weakly dependent variables ensure that *N* is asymptotically normal distributed. On Table 1, we can see the characteristic moments of *N* for motif M = G(G|C)G and various value of the sequence lengths . According to theory, we observe that the skewness and excess kurtosis both decease toward 0 when grows (a normal distribution has null skewness and excess kurtosis). But it is also clear that *N* is not normally distributed for small values of . As a consequence, the quality of a Gaussian approximation for *S*(*n*) is expected to be questionable at finite distance.

In order to overcome this issue, Nuel (2010) suggested to consider near Gaussian approximations instead of simple Gaussian approximations for this problem. The idea is simply to perform a higher order asymptotic development that exploits more than the two first moments of *N*. This technique is known as the Edgeworth's expansion. Blinnikov & Moessner (1998) gives a general (and rather complicated) formula for this expansion. For practical purpose, we present the result only up to order 3 expansions.

**Proposition 7** (Edgeworth's expansion)**.** If we denote by *<sup>ϕ</sup>*(*z*) = exp(−*z*2/2)/ <sup>√</sup>2*<sup>π</sup>* the probability distribution function (pdf) of a standard Gaussian, then for all *n* 0 we have the following approximation:

$$\mathbb{P}(N=n) \simeq \frac{\varrho(z)}{\sigma} \left( \mathbb{C}\_0(z) + \sigma \mathbb{C}\_1(z) + \sigma^2 \mathbb{C}\_2(z) + \sigma^3 \mathbb{C}\_3(z) \right) \tag{24}$$

with

$$\mathbb{C}\_{0}(z) = 1 \quad \mathbb{C}\_{1}(z) = \frac{\mathbb{S}\_{3}}{6} H\_{3}(z) \quad \mathbb{C}\_{2}(z) = \frac{\mathbb{S}\_{4}}{24} H\_{4}(z) + \frac{\mathbb{S}\_{3}^{2}}{72} H\_{6}(z) \tag{25}$$

$$\mathcal{C}\_{3}(z) = \frac{S\_{5}}{120}H\_{7}(z) + \frac{S\_{3}S\_{4}}{144}H\_{7}(z) + \frac{S\_{3}^{3}}{1296}H\_{9}(z) \tag{26}$$

where *<sup>μ</sup>* <sup>=</sup> **<sup>E</sup>**[*N*], *<sup>σ</sup>* <sup>=</sup> **<sup>V</sup>**[*N*], *<sup>z</sup>* = (*<sup>n</sup>* <sup>−</sup> *<sup>μ</sup>*)/*σ*, *Sk* <sup>=</sup> *<sup>κ</sup>k*/*σ*2*k*−<sup>2</sup> for all *<sup>k</sup>* 1, and where *Hk*(*z*) are the Hermite polynomials defined recursively by *<sup>H</sup>*0(*z*) = 1 and *Hk*(*z*) = *zHk*−1(*z*) − *H*� *<sup>k</sup>*−1(*z*) for all *<sup>k</sup>* 1.

in Biological Sequences 17

Significance Score of Motifs in Biological Sequences 189

−10

Fig. 7. Reliability of CB and BR approximations for M = G(G|C)G on a random sequence *X*1:� generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40, and with � = 1200. The error CB(*n*) − *S*(*n*) or BR(*n*) − *S*(*n*) is given on Figure (a); and the relative error (log-scale) − log10 |CB(*n*) − *S*(*n*)|/|*S*(*n*)| or − log10 |BR(*n*) − *S*(*n*)|/|*S*(*n*)| on Figure (b). The horizontal rule indicates the null error on Figure (a), and the threshold

From the computational point of view, the order *h* approximation requires the cumulants of *N* up to order *h* + 2. Using the power approach, the resulting complexity is hence *O*(log2 � × *<sup>h</sup>*<sup>2</sup> <sup>×</sup> *<sup>L</sup>*3) in time and *<sup>O</sup>*(log2 � <sup>×</sup> (*<sup>h</sup>* <sup>+</sup> <sup>2</sup>) <sup>×</sup> *<sup>L</sup>*2) in memory. Using the recursion, the complexity resulting complexity is *O*(� × *L* × |A| × *h*) in time, and *O*(*L* × *h*) in memory. In both cases,

Thanks to NG approximations, we hence have a fast and reliable way to compute an approximation of *S*(*n*) when *n* falls in the center of the distribution (ex: |*S*(*n*)| 3.0), but NG approximations unfortunately remain totally unreliable for tail distribution events (ex: |*S*(*n*)| > 3.0), which are moreover often precisely the event of interest. Fortunately we have a

We want here to study specifically the tail distribution of *N* with events on the form **P**(*N n*) with large *n* (or **P**(*N n*) with small *n*). For all *t* > 0 let us first notice that we can use the Markov inequality to write: **<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) = **<sup>P</sup>**(*etN <sup>e</sup>tn*) **<sup>E</sup>**[*etN*]/*etn* <sup>=</sup> exp(Λ(*t*) <sup>−</sup> *tn*). By taking the smallest of these bounds for *t* > 0 we hence get: log **P**(*N n*) Λ(*τ*) − *τn* with

surprisingly sharp for events located in the tail distribution. By dealing symmetrically with

(*τ*) = *n*. This upper bound, known as the Chernoff's Bound (CB), is often

log(10) (28)

*τn* − Λ(*τ*)

the computational time drops significantly from the exact computations.

**P**(*N n*) and *t* < 0 we hence obtain the following approximation for *S*(*n*):

where *δ<sup>n</sup>* = −1 if *n* **E**[*N*], and *δ<sup>n</sup>* = +1 if *n* > **E**[*N*].

CB(*n*) = *δ<sup>n</sup>*

−log10(relative error)

0 100 200 300 400

(b) Relative error

Cramer Bahadur−Rao

n

0 100 200 300 400

(a) Error

corresponding to two correct digits on Figure (b).

n

Cramer Bahadur−Rao

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

solution to this problem.

**3.7 Bahadur-Rao**

*τ* defined by Λ�

error

Fig. 6. Reliability of NG approximations for M = G(G|C)G on a random sequence *X*1:� generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40, and with � = 1200. The error NG*h*(*n*) − *S*(*n*) is given on Figure (a); and the relative error (log-scale) − log10 |NG*h*(*n*) − *S*(*n*)|/|*S*(*n*)| on Figure (b). The horizontal rule indicates the null error on Figure (a), and the threshold corresponding to two correct digits on Figure (b).

For *h* ∈ {0, 1, 2, 3} we define the Near Gaussian (NG) approximation of order *h* of *S*(*n*) by:

$$\mathrm{NG}\_{h}(n) = \begin{cases} +\log\_{10}\left(\sum\_{k=0}^{n}\frac{1}{\sigma}\varphi\left(\frac{k-\mu}{\sigma}\right)\sum\_{j=0}^{h}\sigma^{j}\mathbb{C}\_{j}\left(\frac{k-\mu}{\sigma}\right)\right) & \text{if } n \leqslant \mathbb{E}[N] \\\ -\log\_{10}\left(\sum\_{k=n}^{+\infty}\frac{1}{\sigma}\varphi\left(\frac{k-\mu}{\sigma}\right)\sum\_{j=0}^{h}\sigma^{j}\mathbb{C}\_{j}\left(\frac{k-\mu}{\sigma}\right)\right) & \text{if } n > \mathbb{E}[N] \end{cases} . \tag{27}$$

We can see on Figure 6 the reliability of NG approximations. In solid black, the order 0 approximation corresponds to the classical Gaussian approximation. Unsurprisingly, this central limit approximation is accurate for the center of the distribution (*n* close to the expectation *μ* = 153.3), the reliability quickly deceases when |*n* − *μ*| increases.

Central limit theorems (CLT) for *N* have established long ago that *N* should be asymptotically Gaussian distributed. The problem however with CLT theorems is that the quality of the resulting approximation dramatically decreases at finite distance when considering tail distribution events. Here we try to overcome the issue by considering Near-Gaussian approximations that exploits higher moments of *N* to improve the quality of the approximations. In order to do this, a critical problem is first to obtain the first *k*-th moments of *N*. Of course we can access these moments by computing the full distribution of *N*, but if it is possible to do so, why bothering with approximations. We hence need an method to compute the moments of *N* whose complexity should be somehow significantly smaller than the complete exact computations. With higher order approximation, we can see a dramatic improvement of reliability of the results, with a noticeable increase of the region where at least two digits are correct (up to *n* ∈ [80; 240] for NG3).

16 Will-be-set-by-IN-TECH

02468

−log10(relative error)

Fig. 6. Reliability of NG approximations for M = G(G|C)G on a random sequence *X*1:� generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40, and

with � = 1200. The error NG*h*(*n*) − *S*(*n*) is given on Figure (a); and the relative error (log-scale) − log10 |NG*h*(*n*) − *S*(*n*)|/|*S*(*n*)| on Figure (b). The horizontal rule indicates the null error on Figure (a), and the threshold corresponding to two correct digits on Figure (b).

For *h* ∈ {0, 1, 2, 3} we define the Near Gaussian (NG) approximation of order *h* of *S*(*n*) by:

� *h* ∑ *j*=0 *σj Cj*

� *h* ∑ *j*=0 *σj Cj*

� *<sup>k</sup>* <sup>−</sup> *<sup>μ</sup> σ*

� *<sup>k</sup>* <sup>−</sup> *<sup>μ</sup> σ*

We can see on Figure 6 the reliability of NG approximations. In solid black, the order 0 approximation corresponds to the classical Gaussian approximation. Unsurprisingly, this central limit approximation is accurate for the center of the distribution (*n* close to the

Central limit theorems (CLT) for *N* have established long ago that *N* should be asymptotically Gaussian distributed. The problem however with CLT theorems is that the quality of the resulting approximation dramatically decreases at finite distance when considering tail distribution events. Here we try to overcome the issue by considering Near-Gaussian approximations that exploits higher moments of *N* to improve the quality of the approximations. In order to do this, a critical problem is first to obtain the first *k*-th moments of *N*. Of course we can access these moments by computing the full distribution of *N*, but if it is possible to do so, why bothering with approximations. We hence need an method to compute the moments of *N* whose complexity should be somehow significantly smaller than the complete exact computations. With higher order approximation, we can see a dramatic improvement of reliability of the results, with a noticeable increase of the region where at

50 100 150 200 250 300

order 0 order 1 order 2 order 3

n

(b) Relative error

� *<sup>k</sup>* <sup>−</sup> *<sup>μ</sup> σ*

� *<sup>k</sup>* <sup>−</sup> *<sup>μ</sup> σ*

�<sup>⎞</sup>

�<sup>⎞</sup>

⎠ if *n* **E**[*N*]

. (27)

⎠ if *n* > **E**[*N*]

50 100 150 200 250 300

order 0 order 1 order 2 order 3

n

(a) Error

−10

NG*h*(*n*) =

⎧ ⎪⎪⎪⎪⎪⎪⎨

+ log10

− log10

least two digits are correct (up to *n* ∈ [80; 240] for NG3).

⎛ ⎝ *n* ∑ *k*=0

⎛ ⎝ +∞ ∑ *k*=*n* 1 *σ ϕ*

1 *σ ϕ*

expectation *μ* = 153.3), the reliability quickly deceases when |*n* − *μ*| increases.

⎪⎪⎪⎪⎪⎪⎩

error

Fig. 7. Reliability of CB and BR approximations for M = G(G|C)G on a random sequence *X*1:� generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40, and with � = 1200. The error CB(*n*) − *S*(*n*) or BR(*n*) − *S*(*n*) is given on Figure (a); and the relative error (log-scale) − log10 |CB(*n*) − *S*(*n*)|/|*S*(*n*)| or − log10 |BR(*n*) − *S*(*n*)|/|*S*(*n*)| on Figure (b). The horizontal rule indicates the null error on Figure (a), and the threshold corresponding to two correct digits on Figure (b).

From the computational point of view, the order *h* approximation requires the cumulants of *N* up to order *h* + 2. Using the power approach, the resulting complexity is hence *O*(log2 � × *<sup>h</sup>*<sup>2</sup> <sup>×</sup> *<sup>L</sup>*3) in time and *<sup>O</sup>*(log2 � <sup>×</sup> (*<sup>h</sup>* <sup>+</sup> <sup>2</sup>) <sup>×</sup> *<sup>L</sup>*2) in memory. Using the recursion, the complexity resulting complexity is *O*(� × *L* × |A| × *h*) in time, and *O*(*L* × *h*) in memory. In both cases, the computational time drops significantly from the exact computations.

Thanks to NG approximations, we hence have a fast and reliable way to compute an approximation of *S*(*n*) when *n* falls in the center of the distribution (ex: |*S*(*n*)| 3.0), but NG approximations unfortunately remain totally unreliable for tail distribution events (ex: |*S*(*n*)| > 3.0), which are moreover often precisely the event of interest. Fortunately we have a solution to this problem.

#### **3.7 Bahadur-Rao**

We want here to study specifically the tail distribution of *N* with events on the form **P**(*N n*) with large *n* (or **P**(*N n*) with small *n*). For all *t* > 0 let us first notice that we can use the Markov inequality to write: **<sup>P</sup>**(*<sup>N</sup> <sup>n</sup>*) = **<sup>P</sup>**(*etN <sup>e</sup>tn*) **<sup>E</sup>**[*etN*]/*etn* <sup>=</sup> exp(Λ(*t*) <sup>−</sup> *tn*). By taking the smallest of these bounds for *t* > 0 we hence get: log **P**(*N n*) Λ(*τ*) − *τn* with *τ* defined by Λ� (*τ*) = *n*. This upper bound, known as the Chernoff's Bound (CB), is often surprisingly sharp for events located in the tail distribution. By dealing symmetrically with **P**(*N n*) and *t* < 0 we hence obtain the following approximation for *S*(*n*):

$$\text{CB}(n) = \delta\_n \frac{\tau n - \Lambda(\tau)}{\log(10)} \tag{28}$$

where *δ<sup>n</sup>* = −1 if *n* **E**[*N*], and *δ<sup>n</sup>* = +1 if *n* > **E**[*N*].

in Biological Sequences 19

Significance Score of Motifs in Biological Sequences 191

−10

Fig. 8. Relative error in log-scale for various approximations of *S*(*n*) (*n* = 0, . . . , 200) in a sequence *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and

In this chapter we deliberately left aside the Poisson-based approximations and considered only two of these approximations: the (Near-) Gaussian approximations with NG*h*(*n*), and the large deviations based approximations with CB(*n*) and BR(*n*). The reason why Poisson-based approximations are not considered here is basically practical, these approximations cannot be directly derived from the formalism of this manuscript and require the introduction of many tedious notions like clumps, overlapping words and so on. However, we compare here the performance of all these approximations (including compound Poisson approximations) in the case where *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40 i.i.d. DNA sequence, and for two motifs: the frequent G(G|C)G, and the

We can see on Figure 8 the relative error (in log-scale) for all approximations. For Gaussian approximations, performances are only good in the very center of the distribution (for *n* very close to **E**(*n*)) for the frequent motif G(G|C)G, and performances are poor almost everywhere for the rare motif T(A|T)T. This observation to consistent with the well known claim that "Gaussian approximations a more suitable for frequent motif" (Lothaire, 2005). It has however to be pointed out that even in the most favorable case (with highly frequent motif), Gaussian approximations totally fail to capture the tail distribution of *N* and hence not suitable for the highly significant observations we usually encounter in biological sequences (Nuel, 2006b). If we consider now the near-Gaussian approximation, taking into account more moments of *N* dramatically improve the result for both motifs, but the failure to deal with extreme

Compound Poisson approximations are known to be extremely sensitive to the relative abundance of the motif of interest in the sequence, being more accurate for rare motifs (Lothaire, 2005; Roquain & Schbath, 2007). It is hence not a surprise to see that Poisson approximations are totally unreliable for the frequent motif G(G|C)G. For the rare motif T(A|T)T we naturally obtain much better results but like for Gaussian approximations, and even in this favorable case, reliability decreases in the tail distribution. Considering that Poisson

−log10(relative error)

0 20 40 60

S(n)

(b) A(A|T)A, rare motif

Gaussian Near−Gaussian Poisson Cramer Bahadur−Rao

−40 −20 0 20 40

S(n)

(a) G(G|C)G, frequent motif

02468

*π*(C) = *π*(G) = 0.40.

rare A(A|T)A.

distribution events remains.

Gaussian Near−Gaussian Poisson Cramer Bahadur−Rao

−log10(relative error)

From the computational point of view, the solution *τ* of Λ� (*τ*) = *n* can be easily determined numerically using (for example) using the Newton-Raphson sequence (Press et al., 1992). Starting for a first guess *t*<sup>0</sup> (ex: *t*<sup>0</sup> = 0), one performs *ti*<sup>+</sup><sup>1</sup> = *ti* + (*n* − Λ� (*ti*))/Λ��(*ti*) for *i* 0 until convergence to *τ*. The computation of Λ, Λ� , and Λ�� being possible thanks to Lemma 5 and the following formulas:

$$
\Lambda(t) = G(\varepsilon^t) \quad \Lambda'(t) = \frac{\varepsilon^t G'(\varepsilon^t)}{G(\varepsilon^t)} \quad \Lambda''(t) = \frac{\varepsilon^{2t} G''(\varepsilon^t)}{G(\varepsilon^t)} - \frac{\varepsilon^{2t} G'(t)^2}{G(\varepsilon^t)^2} + \frac{\varepsilon^t G'(\varepsilon^t)}{G(\varepsilon^t)} \tag{29}
$$

with *G*(*e<sup>t</sup>* )=[*z*0]**u**(**P** + *e<sup>t</sup>* **Q** + *z***Q**)<sup>−</sup>*d***v** = **u**(**P** + *e<sup>t</sup>* **Q**)<sup>−</sup>*d***v**, *G*� (*et* )=[*z*1]**u**(**P** + *e<sup>t</sup>* **Q** + *z***Q**)<sup>−</sup>*d***v**, and *G*��(*e<sup>t</sup>* ) = 2[*z*2]**u**(**P** + *e<sup>t</sup>* **Q** + *z***Q**)<sup>−</sup>*d***v**.

Moreover, this bound can be further refined using the Bahadur-Rao Theorem (Bahadur & Rao, 1960) and gives the following approximation for *S*(*n*):

$$\text{BR}(n) = \text{CB}(n) + \delta\_n \log 10 \left( (1 - e^{-|\tau|}) \sqrt{2\pi \Lambda''(\tau)} \right). \tag{30}$$

From the computational point of view, CB(*n*) and BR(*n*) can be computed either with the power approach with complexities *<sup>O</sup>*(log2 <sup>×</sup> *<sup>L</sup>*3) in time and *<sup>O</sup>*(log2 <sup>×</sup> *<sup>L</sup>*3) in memory; or with the recursion approach with complexities *O*( × *L* × |A|) in time and *O*(*L* × |A|) in memory.

On Figure 7 we can see the reliability of the approximations CB(*n*) and BR(*n*). Unsurprisingly, the farther from the center of the distribution, the better are both approximations. We also observe that BR(*n*) is a dramatic improvement over CB(*n*) since it obtains at least two correct digits of *S*(*n*) for all *n* but on [120, 200]. At the end of previous section, we have seen that the order 3 NG approximation achieves the same precision for region [80; 240], hence, by combining both NG3(*n*) (for the center of the distribution) and BR(*n*) (for the tail distributions), one can achieve at least two correct digits of *S*(*n*) on the whole bulk of the distribution for a modest computational cost.

#### **4. Discussion**

Obtaining the distribution of motif count in random sequences is a very challenging problem that has attracted considerable attention from mathematicians and computer scientists in the last fifty years. Recently however, a significant advance has been obtained by connecting the well-known theory of pattern matching and automata to the Markov chain embedding technique Lladser (2007); Nuel (2008a); Nuel & Prum (2007). Thanks to this finding, it is now possible to deal with simple (runs of 1 in binary sequences, single words, etc.) or complex motifs (PROSITE signature, gapped motifs, etc.) using the same general framework.

Using exact approaches, it is possible to obtain efficiently the first moments of any motif count *N*, and even the complete distribution of *N*. As a consequence, the computation of *S*(*n*) is now tractable for a wide range of motif problems including large datasets or complex motifs. However, the case of complex frequent motifs in large datasets remains an open problem (Nuel & Dumas, 2010).

As an alternative to exact computations, a wide range of approximations have been developed (see Lothaire, 2005; Nuel, 2006b; Reignier, 2000, for a review). We can basically classify these approximations in three categories: 1) Gaussian approximations (Cowan, 1991; Kleffe & Borodovski, 1997; Nuel, 2010; Pevzner et al., 1989; Prum et al., 1995); 2) Poisson approximations Erhardsson (2000); Geske et al. (1995); Godbole (1991); Reinert & Schbath (1999); Roquain & Schbath (2007); 3) large deviations approximations Denise et al. (2001); Nuel (2004).

18 Will-be-set-by-IN-TECH

numerically using (for example) using the Newton-Raphson sequence (Press et al., 1992).

Moreover, this bound can be further refined using the Bahadur-Rao Theorem (Bahadur & Rao,

 (1 − *e*

From the computational point of view, CB(*n*) and BR(*n*) can be computed either with the power approach with complexities *<sup>O</sup>*(log2 <sup>×</sup> *<sup>L</sup>*3) in time and *<sup>O</sup>*(log2 <sup>×</sup> *<sup>L</sup>*3) in memory; or with the recursion approach with complexities *O*( × *L* × |A|) in time and *O*(*L* × |A|) in

On Figure 7 we can see the reliability of the approximations CB(*n*) and BR(*n*). Unsurprisingly, the farther from the center of the distribution, the better are both approximations. We also observe that BR(*n*) is a dramatic improvement over CB(*n*) since it obtains at least two correct digits of *S*(*n*) for all *n* but on [120, 200]. At the end of previous section, we have seen that the order 3 NG approximation achieves the same precision for region [80; 240], hence, by combining both NG3(*n*) (for the center of the distribution) and BR(*n*) (for the tail distributions), one can achieve at least two correct digits of *S*(*n*) on the whole bulk of the

Obtaining the distribution of motif count in random sequences is a very challenging problem that has attracted considerable attention from mathematicians and computer scientists in the last fifty years. Recently however, a significant advance has been obtained by connecting the well-known theory of pattern matching and automata to the Markov chain embedding technique Lladser (2007); Nuel (2008a); Nuel & Prum (2007). Thanks to this finding, it is now possible to deal with simple (runs of 1 in binary sequences, single words, etc.) or complex

Using exact approaches, it is possible to obtain efficiently the first moments of any motif count *N*, and even the complete distribution of *N*. As a consequence, the computation of *S*(*n*) is now tractable for a wide range of motif problems including large datasets or complex motifs. However, the case of complex frequent motifs in large datasets remains an open problem

As an alternative to exact computations, a wide range of approximations have been developed (see Lothaire, 2005; Nuel, 2006b; Reignier, 2000, for a review). We can basically classify these approximations in three categories: 1) Gaussian approximations (Cowan, 1991; Kleffe & Borodovski, 1997; Nuel, 2010; Pevzner et al., 1989; Prum et al., 1995); 2) Poisson approximations Erhardsson (2000); Geske et al. (1995); Godbole (1991); Reinert & Schbath (1999); Roquain & Schbath (2007); 3) large deviations approximations Denise et al. (2001);

motifs (PROSITE signature, gapped motifs, etc.) using the same general framework.

*G*��(*e<sup>t</sup>* ) *<sup>G</sup>*(*et*) <sup>−</sup> *<sup>e</sup>*2*<sup>t</sup>*

(*et*

2*π*Λ��(*τ*)

**Q**)<sup>−</sup>*d***v**, *G*�

−|*τ*| ) 

Starting for a first guess *t*<sup>0</sup> (ex: *t*<sup>0</sup> = 0), one performs *ti*<sup>+</sup><sup>1</sup> = *ti* + (*n* − Λ�

*G*� (*et* ) *<sup>G</sup>*(*et*) <sup>Λ</sup>��(*t*) = *<sup>e</sup>*2*<sup>t</sup>*

**Q** + *z***Q**)<sup>−</sup>*d***v**.

BR(*n*) = CB(*n*) + *δ<sup>n</sup>* log 10

**Q** + *z***Q**)<sup>−</sup>*d***v** = **u**(**P** + *e<sup>t</sup>*

(*τ*) = *n* can be easily determined

, and Λ�� being possible thanks to

*et G*� (*et* ) *<sup>G</sup>*(*et*) (29)

*G*� (*t*)<sup>2</sup> *<sup>G</sup>*(*et*)<sup>2</sup> <sup>+</sup>

)=[*z*1]**u**(**P** + *e<sup>t</sup>*

(*ti*))/Λ��(*ti*) for

**Q** + *z***Q**)<sup>−</sup>*d***v**,

. (30)

From the computational point of view, the solution *τ* of Λ�

*i* 0 until convergence to *τ*. The computation of Λ, Λ�

(*t*) = *<sup>e</sup><sup>t</sup>*

1960) and gives the following approximation for *S*(*n*):

distribution for a modest computational cost.

Lemma 5 and the following formulas:

*t* ) Λ�

)=[*z*0]**u**(**P** + *e<sup>t</sup>*

) = 2[*z*2]**u**(**P** + *e<sup>t</sup>*

Λ(*t*) = *G*(*e*

with *G*(*e<sup>t</sup>*

and *G*��(*e<sup>t</sup>*

memory.

**4. Discussion**

(Nuel & Dumas, 2010).

Nuel (2004).

Fig. 8. Relative error in log-scale for various approximations of *S*(*n*) (*n* = 0, . . . , 200) in a sequence *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40.

In this chapter we deliberately left aside the Poisson-based approximations and considered only two of these approximations: the (Near-) Gaussian approximations with NG*h*(*n*), and the large deviations based approximations with CB(*n*) and BR(*n*). The reason why Poisson-based approximations are not considered here is basically practical, these approximations cannot be directly derived from the formalism of this manuscript and require the introduction of many tedious notions like clumps, overlapping words and so on. However, we compare here the performance of all these approximations (including compound Poisson approximations) in the case where *X*1: generated by a M0 model with parameters *π*(A) = *π*(T) = 0.10 and *π*(C) = *π*(G) = 0.40 i.i.d. DNA sequence, and for two motifs: the frequent G(G|C)G, and the rare A(A|T)A.

We can see on Figure 8 the relative error (in log-scale) for all approximations. For Gaussian approximations, performances are only good in the very center of the distribution (for *n* very close to **E**(*n*)) for the frequent motif G(G|C)G, and performances are poor almost everywhere for the rare motif T(A|T)T. This observation to consistent with the well known claim that "Gaussian approximations a more suitable for frequent motif" (Lothaire, 2005). It has however to be pointed out that even in the most favorable case (with highly frequent motif), Gaussian approximations totally fail to capture the tail distribution of *N* and hence not suitable for the highly significant observations we usually encounter in biological sequences (Nuel, 2006b). If we consider now the near-Gaussian approximation, taking into account more moments of *N* dramatically improve the result for both motifs, but the failure to deal with extreme distribution events remains.

Compound Poisson approximations are known to be extremely sensitive to the relative abundance of the motif of interest in the sequence, being more accurate for rare motifs (Lothaire, 2005; Roquain & Schbath, 2007). It is hence not a surprise to see that Poisson approximations are totally unreliable for the frequent motif G(G|C)G. For the rare motif T(A|T)T we naturally obtain much better results but like for Gaussian approximations, and even in this favorable case, reliability decreases in the tail distribution. Considering that Poisson

in Biological Sequences 21

Significance Score of Motifs in Biological Sequences 193

Cornish-Bowden (1985). IUPAC-IUB symbols for nucleotide nomenclature, *Nucl. Acids Res.*

Cowan (1991). Expected frequencies of dna patterns using whittle's formula, *J. Appl. Prob.*

Crochemore, M. & Stefanov, V. (2003). Waiting time and complexity for matching patterns

Denise, A., RA©gnier, M. & Vandenbogaert, M. (2001). Assessing the statistical significance of overrepresented oligonucleotides, *Lecture Notes in Computer Science* 2149: 85–97. El Karoui, M., Biaudet, V., Schbath, S. & Gruss, A. (1999). Characteristics of chi distribution

Erhardsson, T. (2000). Compound Poisson approximation for counts of rare patterns in

Fatemi, M., Pao, M., Jeong, S., Gal-Yam, E., Egger, G., Weisenberger, D. & Jones, P. (2005).

Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of

Geske, M. X., Godbole, A. P., Schaffner, A. A., Skrolnick, A. M. & Wallstrom, G. L. (1995).

Godbole, A. P. (1991). Poissons approximations for runs and patterns of rare events, *Adv. Appl.*

Green, T. J., Gupta, A., Miklau, G., Onizuka, M. & Suciu, D. (2004). Processing xml

Hampson, S., Kibler, D. & Baldi, P. (2002). Distribution patterns of over-represented k-mers in

Hopcroft, J. E., Motwani, R. & Ullman, J. D. (2001). *Introduction the automata theory, languages,*

Karlin, S., Burge, C. & Campbell, A. (1992). Statistical analyses of counts and distributions of restriction sites in DNA sequences, *Nucl. Acids. Res.* 20(6): 1363–1370. Kleffe, J. & Borodovski, M. (1997). First and second moment of counts of words in random

Leonardo Marino-Ramírez, John L. Spouge, G. C. K. & Landsman, D. (2004). Statistical

Liddle, A. R. (2007). Information criteria for astrophysical model selection, *Monthly Notices of*

Lladser, M. E. (2007). Mininal markov chain embeddings of pattern problems, *Information*

Lothaire, M. (ed.) (2005). *Applied Combinatorics on Words*, Cambridge University Press,

Nicodème, P., Salvy, B. & Flajolet, P. (2002). Motif statistics, *Theoretical Com. Sci.* 287(2): 593–617. Nuel, G. (2004). Ld-spatt: Large deviations statistics for patterns on markov chains, *J. Comp.*

analysis of over-represented words in human promoter sequences, *Nuc. Acids Res.*

Markov chains and extreme sojourns in birth-death chains, *Ann. Appl. Probab.*

Footprinting of mammalian promoters: use of a cpg dna methyltransferase revealing nucleosome positions at a single molecule level, *Nucleic Acids Res* 33(20): 176. Frith, M. C., Spouge, J. L., Hansen, U. & Weng, Z. (2002). Statistical significance of clusters

of motifs represented by position specific scoring matrices in nucleotide sequences,

Compound poisson approximations for word patterns under markovian hypotheses,

streams with deterministic automata and stream indexes, *ACM Trans. Database Syst.*

with automata, *Info. Proc. Letters* 87(3): 119–125.

on different bacterial genomes, *Res. Microbiol.* 150: 579–587.

13: 3021–3030.

10(2): 573–591.

*Prob.* 23.

29: 752–788.

32(3): 949–958.

Cambridge.

*Biol.* 11(6): 1023–1033.

*Nucl. Acids. Res.* 30(14): 3214–3224.

*J. Appl. Probab.* 32: 877–892.

multi-state trials, *Statistica Sinica* 6(4): 957–974.

non-coding yeast DNA, *Bioinformatics* 18(4): 513–528.

texts generated by markov chains, *Bioinformatics* 8(5): 433–441.

*and computation, 2d ed.*, ACM Press, New York.

*the Royal Astronomical Society: Letters* 377: 74–78.

*Theory and Applications Workshop*, pp. 251–255.

28: 886–892.

approximations are not easily generalizable to motifs defined by regular expressions, that their computations could be complicated and time consuming, and that their reliability is highly questionable in some configurations, it seems advisable to avoid their use is most cases.

With large deviations based approximations, we unsurprisingly get a low reliability in the center of the distribution, but a high reliability in the tail distribution. With Bahadur-Rao precise approximations, the improvement over the classical Chernoff's bound is quite impressive, and the complementarity with Near-Gaussian approximations clearly shows that a combination of both approaches could be a very efficient way to obtain reliable approximations of *S*(*n*) for all *n*.

In this chapter we gave all the necessary ingredients to assess the significance score of motif in a biological sequence using state of the art results, including several unpublished ones: Lemma 5 which is an extension of the results of Nuel (2010), and the complete "Bahadur-Rao" Section which provides interesting improvements over previous large deviations work (Denise et al., 2001; Nuel, 2004).

Let us finally point out that for the sake of compactness, we have left aside some interesting questions and extensions like: approximate matching Hopcroft et al. (2001), renewal occurrences (Nuel, 2006b; Roquain & Schbath, 2007), joint distributions (Nuel, 2008b; Stefanov & Szpankowski, 2007), dataset with many sequences (Nuel et al., 2010), and sensitivity to parameter estimation (Nuel, 2006c). Even if some results are already available for these problems, many questions still have to be answered in the exciting and challenging field of the distribution of motifs in random sequences.

#### **5. References**


20 Will-be-set-by-IN-TECH

approximations are not easily generalizable to motifs defined by regular expressions, that their computations could be complicated and time consuming, and that their reliability is highly questionable in some configurations, it seems advisable to avoid their use is most cases. With large deviations based approximations, we unsurprisingly get a low reliability in the center of the distribution, but a high reliability in the tail distribution. With Bahadur-Rao precise approximations, the improvement over the classical Chernoff's bound is quite impressive, and the complementarity with Near-Gaussian approximations clearly shows that a combination of both approaches could be a very efficient way to obtain reliable

In this chapter we gave all the necessary ingredients to assess the significance score of motif in a biological sequence using state of the art results, including several unpublished ones: Lemma 5 which is an extension of the results of Nuel (2010), and the complete "Bahadur-Rao" Section which provides interesting improvements over previous large deviations work

Let us finally point out that for the sake of compactness, we have left aside some interesting questions and extensions like: approximate matching Hopcroft et al. (2001), renewal occurrences (Nuel, 2006b; Roquain & Schbath, 2007), joint distributions (Nuel, 2008b; Stefanov & Szpankowski, 2007), dataset with many sequences (Nuel et al., 2010), and sensitivity to parameter estimation (Nuel, 2006c). Even if some results are already available for these problems, many questions still have to be answered in the exciting and challenging field of

Allauzen, C. & Mohri, M. (2006). A unified construction of the glushkov, follow, and

Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials, *J. Appl.*

Bahadur, R. R. & Rao, R. R. (1960). On deviations of the sample mean, *The Annals of Math.*

Beaudoing, E., Freier, S., Wyatt, J., Claverie, J.-M. & Gautheret, D. (2000). Patterns of variant polyadenylation signal usage in human genes, *Genome Res.* 10(7): 1001–1010. Blinnikov, S. & Moessner, R. (1998). Expansions for nearly Gaussian distributions, *Astron.*

Boeva, V., ClA©ment, J., RA©gnier, M. & Vandenbogaert, M. (2005). Assessing the significance

Brazma, A., Jonassen, I., Vilo, J. & Ukkonen, E. (1998). Predicting gene regulatory elements in

Bryne, J., Valen, E., Tang, M., Marstrand, T., Winther, O., da Piedade, I., Krogh, A.,

Chan, H. P., Zhang, N. R. & Chen, L. H. Y. (2010). Importance sampling of word patterns in

Chang, Y.-M. (2005). Distribution of waiting time until the rth occurrence of a compound

dna and protein sequences, *J. of Comput. Biol.* 17(12): 1697–1709.

silico on a genomic scale, *Genome Res.* 8(11): 1202–1215.

pattern, *Statistics and Probability Letters* 75(1): 29–38.

of sets of words, *Combinatorial Pattern Matching 05, Lecture Notes in Computer Science,*

Lenhard, B. & Sandelin, A. (2008). JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update., *Nucleic Acids Res.*

antimirov automata, *in* R. KrA?lovic & P. Urzyczyn (eds), *Mathematical Foundations of Computer Science 2006*, Vol. 4162 of *Lecture Notes in Computer Science*, Springer Berlin

approximations of *S*(*n*) for all *n*.

(Denise et al., 2001; Nuel, 2004).

**5. References**

the distribution of motifs in random sequences.

/ Heidelberg, pp. 110–121.

*Statistics.* 31(4): 1015–1027.

*vol. 3537*, Springer-Verlag.

36: 102–106.

*Astrophys. Suppl. Ser.* 130: 193–205.

*Prob.* 38: 508–518.


**10** 

*1Portugal 2UK* 

**A Systematic and Thorough Search for Domains** 

*1IBMC – Instituto de Biologia Molecular e Celular and ICBAS – Instituto de Ciências* 

The biological function of proteins is largely determined by their individual component domains, which are segments within the protein sequence that are self-contained and spatially arranged. These can be catalytic or structural, and define a number of different features of proteins such as their enzymatic activity, interactions with other proteins, sugars or lipids, and determine the cellular localization of the proteins that contain them. A number of intracellular three-dimensionally-arranged domains, such as Src-homology (SH) or Pleckstrin-homology (PH) domains, define the nature of protein interactions with other components of the cell, and enable them to interact with their substrates or binding partners. The specificity of interactions that is given by the domain is unique to its protein. Similarly, the extracellular part of most membrane-bound or secreted proteins of eukaryotic cells is also organized in semi-autonomously-arranged blocks that potentially confer multiple diverse functions to a particular protein. These domains have been classified and grouped into protein superfamilies depending on the similarity they have with domains of prototypical proteins, for example immunoglobulin, fibronectin or C-type lectin domains. Members of these groups are believed to be homologous and to have arisen by divergent evolution from a common ancestor. Many membrane-bound or extracellular proteins are comprised of several domains of the same type, but it is not uncommon to find mosaic

The scavenger receptor cysteine-rich (SRCR) superfamily comprises a group of proteins that contain one or multiple domains structurally similar to the membrane distal domain of the type I scavenger receptor expressed by human macrophages (Freeman et al., 1990). Proteins classified as belonging to this superfamily may contain other types of domains additionally to the dominant SRCR modules, such as EGF, CUB, LCCL, or other domains. In mammals, SRCR proteins are typically expressed in cells of the immune system (Resnick et al., 1994), although some members can be also expressed in non-immune cells and organs, including liver, kidney, placenta, stomach, brain and heart (Sarrias et al., 2004). Group A domaincontaining SRCR proteins are present in phyla from the most primitive metazoan to

proteins containing domains from different superfamilies.

**1. Introduction** 

**of the Scavenger Receptor Cysteine-Rich** 

**Group-B Family in the Human Genome** 

*2The Weatherall Institute of Molecular Medicine, University of Oxford* 

Alexandre M. Carmo1 and Vattipally B. Sreenu2

*Biomédicas de Abel Salazar, Universidade do Porto* 


### **A Systematic and Thorough Search for Domains of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome**

Alexandre M. Carmo1 and Vattipally B. Sreenu2

*1IBMC – Instituto de Biologia Molecular e Celular and ICBAS – Instituto de Ciências Biomédicas de Abel Salazar, Universidade do Porto 2The Weatherall Institute of Molecular Medicine, University of Oxford 1Portugal 2UK* 

#### **1. Introduction**

22 Will-be-set-by-IN-TECH

194 Bioinformatics – Trends and Methodologies

Nuel, G. (2006a). Effective p-value computations using Finite Markov Chain Imbedding

Nuel, G. (2006b). Numerical solutions for patterns statistics on markov chains, *Stat. App. in*

Nuel, G. (2006c). Pattern statistics on markov chains and sensitivity to parameter estimation,

Nuel, G. (2008a). Pattern Markov chains: optimal Markov chain embedding through

Nuel, G. (2008b). Waiting time distribution for pattern occurrence in a constrained sequence:

Nuel, G. (2010). On the first k moments of the random count of a pattern in a multi-states sequence generated by a markov source, *Journal of Applied Probability* 47: 1–19. Nuel, G. & Dumas, J.-G. (2010). Sparse approaches for the exact distribution of patterns in long

Nuel, G. & Prum, B. (2007). Analyse statistique des séquences biologiques: modélisation

Nuel, G., Regad, L., Martin, J. & Camproux, A.-C. (2010). Exact distribution of a pattern in

Pevzner, P., Borodovski, M. & Mironov, A. (1989). Linguistic of nucleotide sequences:

Prum, B., Rodolphe, F. & de Turckheim, E. (1995). Finding words with unexpected frequencies

Reignier, M. (2000). A unified approach to word occurrences probabilities, *Discrete Applied*

Reinert, G. & Schbath, S. (1999). Compound poisson and poisson process approximations for occurrences of multiple words in markov chains, *J. of Comp. Biol.* 5: 223–254. Ribeca, P. & Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach, *Bioinformatics* 24(24): 2839–2848. Roberts, R., Vincze, T., Posfai, J. & Macelis, D. (2010). REBASE – a database for dna restriction and modification: enzymes, genes and genomes, *Nucl. Acids Res.* 38: 234–236. Roquain, E. & Schbath, S. (2007). Improved compound poisson approximation for the number

Sigrist, C., Cerutti, L., de Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A. & Hulo,

Stefanov, V. T. & Szpankowski, W. (2007). Waiting Time Distributions for Pattern Occurrence

van Helden, J., André, B. & Collado-Vides, J. (1998). Extracting regulatory sites from

frequencies of occurrence of words, *J. Biomol. Struct. Dyn.* 6: 1013–1026. Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992). *Numerical Recipes in C*,

deterministic finite automata, *J. of Applied Prob.* 45(1): 226–243.

markovienne, alignements et motifs, *Hermes editions, Paris*.

*Biology* 1(1): 5.

*Science* 10: 3.

arXiv:1006.3246v1.

*Genet. and Mol. Biol.* 5(1): 26.

*Algorithms for Molecular Biology* 1(1): 17.

data, *Algorithms for Molecular Biology* 5: 15.

in dna sequences, *J. R. Statist. Soc. B* 11: 190–192.

Cambridge University Press.

*Mathematics* 104(1): 259–280.

*Probab.* 39(1): 128–140.

9(1): 305–320.

annotation, *Nucleic Acids Res.* 38.

frequencies, *J. Mol. Biol.* 281(5): 827–842.

(FMCI): application to local score and to pattern statistics, *Algorithms for Molecular*

an embedding markov chain approach, *Discrete Mathematics and Theoretical Computer*

multi-states sequences generated by a markov source, *submitted to J. Applied. Prob.* .

a set of random sequences generated by a markov source: applications to biological

The significance of deviation from mean statistical characteristics and prediction of

of occurrences of any rare word family in a stationary markov chain, *Adv. in Appl.*

N. (2010). PROSITE, a protein domain database for functional characterization and

in a Constrained Sequence, *Discrete Mathematics and Theoretical Computer Science*

the upstream region of yeast genes by computational analysis of oligonucleotide

The biological function of proteins is largely determined by their individual component domains, which are segments within the protein sequence that are self-contained and spatially arranged. These can be catalytic or structural, and define a number of different features of proteins such as their enzymatic activity, interactions with other proteins, sugars or lipids, and determine the cellular localization of the proteins that contain them. A number of intracellular three-dimensionally-arranged domains, such as Src-homology (SH) or Pleckstrin-homology (PH) domains, define the nature of protein interactions with other components of the cell, and enable them to interact with their substrates or binding partners. The specificity of interactions that is given by the domain is unique to its protein. Similarly, the extracellular part of most membrane-bound or secreted proteins of eukaryotic cells is also organized in semi-autonomously-arranged blocks that potentially confer multiple diverse functions to a particular protein. These domains have been classified and grouped into protein superfamilies depending on the similarity they have with domains of prototypical proteins, for example immunoglobulin, fibronectin or C-type lectin domains. Members of these groups are believed to be homologous and to have arisen by divergent evolution from a common ancestor. Many membrane-bound or extracellular proteins are comprised of several domains of the same type, but it is not uncommon to find mosaic proteins containing domains from different superfamilies.

The scavenger receptor cysteine-rich (SRCR) superfamily comprises a group of proteins that contain one or multiple domains structurally similar to the membrane distal domain of the type I scavenger receptor expressed by human macrophages (Freeman et al., 1990). Proteins classified as belonging to this superfamily may contain other types of domains additionally to the dominant SRCR modules, such as EGF, CUB, LCCL, or other domains. In mammals, SRCR proteins are typically expressed in cells of the immune system (Resnick et al., 1994), although some members can be also expressed in non-immune cells and organs, including liver, kidney, placenta, stomach, brain and heart (Sarrias et al., 2004). Group A domaincontaining SRCR proteins are present in phyla from the most primitive metazoan to

A Systematic and Thorough Search for Domains

may be the only unifying feature of the family.

**2.2 Structure and organization of SRCR domains** 

2009; Padilla et al., 2002).

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 197

(Graversen et al., 2002; Kristiansen et al., 2001). The remaining three members of the group B SRCR family are secreted glycoproteins of different sizes and structural complexity. DMBT1, which was identified on the basis of its deletion in a medulloblastoma cell line, is the largest member of the family, comprising 14 SRCR domains separated by SRCR-interspacing domains (Mollenhauer et al., 1997). Apart from being secreted, DMBT1 is also found in association with the plasma membrane of macrophages, although it is not clear whether there is a specific receptor or the poorly characterized DMBT1 gene may encode a transmembrane sequence. Once in the membrane, DMBT1 is a ligand for Surfactant protein D (SP-D), a C-type lectin that binds to exposed carbohydrates (Holmskov et al., 1999). The SRCR soluble proteins S4D-SRCRB and SSc5D have four and five group B domains, respectively, and little is known of their functional or binding properties (Gonçalves et al.,

However, it has been recently suggested that Sp (Sarrias et al., 2005), DMBT1 (Bikker et al., 2002), CD163 (Fabriek et al., 2009), CD5 (Vera et al., 2009) and CD6 (Sarrias et al., 2007) are capable of detecting microbe-associated molecular patterns, and could bind and clear bacteria or fungi, reaffirming a scavenger-like role for this group of molecules. These developments notwithstanding, SRCR superfamily proteins may prove to have very diverse functions, to the extent that the structural properties of the highly conserved SRCR domains

Typically, the 100-110 amino acid-long SRCR domains possess a characteristic pattern of cysteine residues that establish intra-domain disulfide bridges and contribute to the overall architecture of the compact domain. The number of cysteine residues and their distribution, together with the organization of the genomic sequence encoding each domain, divide the SRCR family into two groups, A and B. Group A domains are encoded by split exons, and typically have six cysteine residues establishing three disulfide bonds. Group B domains, on the other hand, are encoded by a single exon and have eight cysteine residues, whose

So far, eight human SRCR group B proteins have been described, Sp, CD5, CD6, S4D, SSc5D, CD163, M160 and DMBT1 that contain three to fourteen SRCR domains. Their encoding genes are dispersed throughout the genome, however a few highly similar pairs such as CD5-CD6 and CD163-M160 are located on the same chromosome. The identity between individual domains of different SRCR group B proteins varies from 20 to 80%, and phylogenetic analysis suggests that they have evolved by sequential intragene duplication, although there are examples that suggest they may have evolved in some cases by interprotein domain shuffling. Only four SRCR domains have been characterized by X-ray crystallography, and of these three are group A SRCR domains, those of hepsin, a cell surface serine protease involved in cell growth and maintenance of cellular morphology (Somoza et al., 2003), M2bp, a tumor associated antigen and matrix protein (Hohenester et al., 1999), and MARCO, a trimeric SRCR group A protein expressed by macrophages and dendritic cells that recognizes polyanionic particles and pathogens (Ojala et al., 2007). The crystal structure of the membrane proximal domain of CD5 (Rodamilans et al., 2007), together with an NMR solution structure of domain 1 of CD5 (Garza-Garcia et al., 2008) constitute the only sources of structural information of SRCR group B domains. Comparing the structures, it is however apparent that the 3D assembly of the different domains in the

distribution is remarkably conserved in nearly all known domains (Fig. 1).

two groups is overall conserved, all displaying a very similar fold.

vertebrates, whereas group B domain containing SRCR proteins are only found in vertebrates. Intriguingly, although SRCR proteins can include other domains, no proteins have been reported to contain group A and B domains simultaneously.

In mammalian species, SRCR group B orthologs are usually very well conserved and regarding some of the proteins, a high level of conservation is extended to birds and fish. However, in some cases a human SRCR protein apparently has no corresponding ortholog in some mammals, and conversely, there are examples of SRCR group B proteins that are well characterized in a few mammalian species, that have not been described in humans. By analyzing the human genome, we can now identify all the remaining, still undescribed genes encoding SRCR group B domains, which will allow us to perform phylogenetic analysis of the complete set of group B domains. By comprehensive and systematic whole genome analysis we have found two new putative transcriptional units containing clusters of potential SRCR domains, and additionally a further putative gene that contains a single domain. After our thorough search, we are now confident that all proteins containing group B SRCR domains in the human genome have been identified.

#### **2. The scavenger receptor cysteine-rich group B family**

#### **2.1 Biological function of SRCR group B proteins**

The cell surface antigens CD5 and CD6, which function in T lymphocytes, are probably the most well characterized of the family, each containing three extracellular SRCR domains (Aruffo et al., 1991; Jones et al., 1986). CD5 and CD6 co-associate with each other at the surface of T cells (Castro et al., 2003; Gimferrer et al., 2003), and are involved in the regulation of T cell receptor-mediated activation. The extensive characterization of the interaction of CD6 with its ligand CD166, expressed by antigen presenting cells (Aruffo et al., 1997), and the identification of different binding partners for CD5 (Biancone et al., 1996; Calvo et al., 1999; Pospisil et al., 2000; Van de Velde et al., 1991), had initially suggested that SRCR group B domains participate in intercellular contacts *via* protein-protein interactions. Also, the three SRCR domain-containing soluble protein Sp (Gebe et al., 1997) has been reported to bind to cells of myeloid and lymphoid origin. Also known as AIM (apoptosis inhibitor expressed by macrophages), API6 (apoptosis inhibitor 6) or CD5L (CD5-like molecule), Sp is best known for promoting macrophage survival. Therefore, this sub-group of small SRCR-containing proteins may be described as having a role in cellular communication, differentiation and activation. However, for most of the remaining members of the family no such clear function has been established. In particular the lack of cellular ligands for most of these proteins raises the possibility that a totally different function for SRCR domains may exist, if indeed SRCR domain proteins share any common function.

In addition to CD5, CD6 and Sp, the group B SRCR family presently contains five other proteins, of which two, CD163 and M160, are membrane bound and expressed by macrophages. CD163 (Law et al., 1993) and M160 (CD163L1) (Gronlund et al., 2000), which were both identified in human monocytes, are considered a subgroup of the SRCR group B molecules. No definitive function has been established for these molecules, although CD163 has been described as binding to, and internalizing, tumor necrosis factor-like weak inducer of apoptosis (TWEAK), thus having a potential role in atherosclerosis (Moreno et al., 2009). Additionally, CD163 has a detoxifying role in iron metabolism, where by binding to hemoglobin-complexed haptoglobin it is able to remove hemoglobin from the plasma

vertebrates, whereas group B domain containing SRCR proteins are only found in vertebrates. Intriguingly, although SRCR proteins can include other domains, no proteins

In mammalian species, SRCR group B orthologs are usually very well conserved and regarding some of the proteins, a high level of conservation is extended to birds and fish. However, in some cases a human SRCR protein apparently has no corresponding ortholog in some mammals, and conversely, there are examples of SRCR group B proteins that are well characterized in a few mammalian species, that have not been described in humans. By analyzing the human genome, we can now identify all the remaining, still undescribed genes encoding SRCR group B domains, which will allow us to perform phylogenetic analysis of the complete set of group B domains. By comprehensive and systematic whole genome analysis we have found two new putative transcriptional units containing clusters of potential SRCR domains, and additionally a further putative gene that contains a single domain. After our thorough search, we are now confident that all proteins containing group

The cell surface antigens CD5 and CD6, which function in T lymphocytes, are probably the most well characterized of the family, each containing three extracellular SRCR domains (Aruffo et al., 1991; Jones et al., 1986). CD5 and CD6 co-associate with each other at the surface of T cells (Castro et al., 2003; Gimferrer et al., 2003), and are involved in the regulation of T cell receptor-mediated activation. The extensive characterization of the interaction of CD6 with its ligand CD166, expressed by antigen presenting cells (Aruffo et al., 1997), and the identification of different binding partners for CD5 (Biancone et al., 1996; Calvo et al., 1999; Pospisil et al., 2000; Van de Velde et al., 1991), had initially suggested that SRCR group B domains participate in intercellular contacts *via* protein-protein interactions. Also, the three SRCR domain-containing soluble protein Sp (Gebe et al., 1997) has been reported to bind to cells of myeloid and lymphoid origin. Also known as AIM (apoptosis inhibitor expressed by macrophages), API6 (apoptosis inhibitor 6) or CD5L (CD5-like molecule), Sp is best known for promoting macrophage survival. Therefore, this sub-group of small SRCR-containing proteins may be described as having a role in cellular communication, differentiation and activation. However, for most of the remaining members of the family no such clear function has been established. In particular the lack of cellular ligands for most of these proteins raises the possibility that a totally different function for SRCR domains may exist, if indeed SRCR domain proteins share any common

In addition to CD5, CD6 and Sp, the group B SRCR family presently contains five other proteins, of which two, CD163 and M160, are membrane bound and expressed by macrophages. CD163 (Law et al., 1993) and M160 (CD163L1) (Gronlund et al., 2000), which were both identified in human monocytes, are considered a subgroup of the SRCR group B molecules. No definitive function has been established for these molecules, although CD163 has been described as binding to, and internalizing, tumor necrosis factor-like weak inducer of apoptosis (TWEAK), thus having a potential role in atherosclerosis (Moreno et al., 2009). Additionally, CD163 has a detoxifying role in iron metabolism, where by binding to hemoglobin-complexed haptoglobin it is able to remove hemoglobin from the plasma

have been reported to contain group A and B domains simultaneously.

B SRCR domains in the human genome have been identified.

**2.1 Biological function of SRCR group B proteins** 

function.

**2. The scavenger receptor cysteine-rich group B family** 

(Graversen et al., 2002; Kristiansen et al., 2001). The remaining three members of the group B SRCR family are secreted glycoproteins of different sizes and structural complexity. DMBT1, which was identified on the basis of its deletion in a medulloblastoma cell line, is the largest member of the family, comprising 14 SRCR domains separated by SRCR-interspacing domains (Mollenhauer et al., 1997). Apart from being secreted, DMBT1 is also found in association with the plasma membrane of macrophages, although it is not clear whether there is a specific receptor or the poorly characterized DMBT1 gene may encode a transmembrane sequence. Once in the membrane, DMBT1 is a ligand for Surfactant protein D (SP-D), a C-type lectin that binds to exposed carbohydrates (Holmskov et al., 1999). The SRCR soluble proteins S4D-SRCRB and SSc5D have four and five group B domains, respectively, and little is known of their functional or binding properties (Gonçalves et al., 2009; Padilla et al., 2002).

However, it has been recently suggested that Sp (Sarrias et al., 2005), DMBT1 (Bikker et al., 2002), CD163 (Fabriek et al., 2009), CD5 (Vera et al., 2009) and CD6 (Sarrias et al., 2007) are capable of detecting microbe-associated molecular patterns, and could bind and clear bacteria or fungi, reaffirming a scavenger-like role for this group of molecules. These developments notwithstanding, SRCR superfamily proteins may prove to have very diverse functions, to the extent that the structural properties of the highly conserved SRCR domains may be the only unifying feature of the family.

#### **2.2 Structure and organization of SRCR domains**

Typically, the 100-110 amino acid-long SRCR domains possess a characteristic pattern of cysteine residues that establish intra-domain disulfide bridges and contribute to the overall architecture of the compact domain. The number of cysteine residues and their distribution, together with the organization of the genomic sequence encoding each domain, divide the SRCR family into two groups, A and B. Group A domains are encoded by split exons, and typically have six cysteine residues establishing three disulfide bonds. Group B domains, on the other hand, are encoded by a single exon and have eight cysteine residues, whose distribution is remarkably conserved in nearly all known domains (Fig. 1).

So far, eight human SRCR group B proteins have been described, Sp, CD5, CD6, S4D, SSc5D, CD163, M160 and DMBT1 that contain three to fourteen SRCR domains. Their encoding genes are dispersed throughout the genome, however a few highly similar pairs such as CD5-CD6 and CD163-M160 are located on the same chromosome. The identity between individual domains of different SRCR group B proteins varies from 20 to 80%, and phylogenetic analysis suggests that they have evolved by sequential intragene duplication, although there are examples that suggest they may have evolved in some cases by interprotein domain shuffling. Only four SRCR domains have been characterized by X-ray crystallography, and of these three are group A SRCR domains, those of hepsin, a cell surface serine protease involved in cell growth and maintenance of cellular morphology (Somoza et al., 2003), M2bp, a tumor associated antigen and matrix protein (Hohenester et al., 1999), and MARCO, a trimeric SRCR group A protein expressed by macrophages and dendritic cells that recognizes polyanionic particles and pathogens (Ojala et al., 2007). The crystal structure of the membrane proximal domain of CD5 (Rodamilans et al., 2007), together with an NMR solution structure of domain 1 of CD5 (Garza-Garcia et al., 2008) constitute the only sources of structural information of SRCR group B domains. Comparing the structures, it is however apparent that the 3D assembly of the different domains in the two groups is overall conserved, all displaying a very similar fold.

A Systematic and Thorough Search for Domains

**2.3 Homology between SRCR domains** 

non-placental species (Table 1).

been reported in man.

SCART 1 and 2 molecules present in the mouse.

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 199

The level of amino acid identity among human SRCR group B domains from different molecules varies from 20% to 80%, but within the same molecule this level can be higher and even be identical in some domains (e.g. domains 3 and 7, and 10 and 11 of DMBT1). Similarly, some molecules are remarkably conserved between species, especially among mammals, although it appears that some level of conservation can be extended to birds, fish and amphibians in a few specific cases. There are good indications for there being orthologs of CD6 in the genomes of *T. guttata* and *D. rerio*, and some other examples. Nevertheless, the structure of SRCR group B-containing molecules is best preserved in mammalian species. The strong homology of SSc5D domains dates back to the divergence of egg- and non-egglaying mammals, while CD163 has clearly conserved orthologs in all mammals, including

Table 1. Homology between human and other mammalian CD163 domains. Numbers represent percentage of identity between each domain, compared to the human sequence. The significantly conserved homology of some SRCR orthologs is suggestive of profound functional constraints acting on these proteins. On the other hand, it appears that not all human SRCR B group molecules have described orthologs in all mammalian species, and conversely, that there are some SRCR proteins described in different animals that have not

Noticeably, bovine WC1 (Wijngaard et al., 1992) does not have a human counterpart, nor do the mouse SCART molecules (Kisielow et al., 2008). Similarly, the human macrophage specific receptor M160 is not found in all mammalian species, while the closely related molecule CD163, also specific to the monocytic/macrophage lineage, is clearly present in all genomes that we have examined. We have compared the similarity between individual domains of known and characterized members of the SRCR B group, and the corresponding domains in the bovine proteins (Table 2). While proteins such as S4D, SSc5D and CD163 show high levels of identity between human and bovine sequences, others like CD5 and Sp are more distantly related. M160 does not have a straightforward ortholog in cattle, so the bovine sequence used was of the related molecule M160-like, that is related in turn to the

Fig. 1. Sequence alignment of domains from group B SRCR superfamily members.

SRCR domains are typically sequences of 100-110 amino acids in length compacted into a heart-shaped fold, where a six/seven-stranded -sheet cradles an -helix. Strands 1, 3 and 4, together with 7, form a curved sheet that wraps around the core 1 helix. From 4 onwards, the structures start to diverge. It is the sequence of amino acids between the beginning of the domain and the 4 strand that is best conserved between group A and group B domains, and that roughly corresponds in the group A proteins to the first of two exons that encode a full SRCR A domain, and in group B proteins to the first 50 amino acids of the domain.

#### **2.3 Homology between SRCR domains**

198 Bioinformatics – Trends and Methodologies

Fig. 1. Sequence alignment of domains from group B SRCR superfamily members.

of the domain.

SRCR domains are typically sequences of 100-110 amino acids in length compacted into a heart-shaped fold, where a six/seven-stranded -sheet cradles an -helix. Strands 1, 3 and 4, together with 7, form a curved sheet that wraps around the core 1 helix. From 4 onwards, the structures start to diverge. It is the sequence of amino acids between the beginning of the domain and the 4 strand that is best conserved between group A and group B domains, and that roughly corresponds in the group A proteins to the first of two exons that encode a full SRCR A domain, and in group B proteins to the first 50 amino acids The level of amino acid identity among human SRCR group B domains from different molecules varies from 20% to 80%, but within the same molecule this level can be higher and even be identical in some domains (e.g. domains 3 and 7, and 10 and 11 of DMBT1). Similarly, some molecules are remarkably conserved between species, especially among mammals, although it appears that some level of conservation can be extended to birds, fish and amphibians in a few specific cases. There are good indications for there being orthologs of CD6 in the genomes of *T. guttata* and *D. rerio*, and some other examples. Nevertheless, the structure of SRCR group B-containing molecules is best preserved in mammalian species. The strong homology of SSc5D domains dates back to the divergence of egg- and non-egglaying mammals, while CD163 has clearly conserved orthologs in all mammals, including non-placental species (Table 1).


Table 1. Homology between human and other mammalian CD163 domains. Numbers represent percentage of identity between each domain, compared to the human sequence.

The significantly conserved homology of some SRCR orthologs is suggestive of profound functional constraints acting on these proteins. On the other hand, it appears that not all human SRCR B group molecules have described orthologs in all mammalian species, and conversely, that there are some SRCR proteins described in different animals that have not been reported in man.

Noticeably, bovine WC1 (Wijngaard et al., 1992) does not have a human counterpart, nor do the mouse SCART molecules (Kisielow et al., 2008). Similarly, the human macrophage specific receptor M160 is not found in all mammalian species, while the closely related molecule CD163, also specific to the monocytic/macrophage lineage, is clearly present in all genomes that we have examined. We have compared the similarity between individual domains of known and characterized members of the SRCR B group, and the corresponding domains in the bovine proteins (Table 2). While proteins such as S4D, SSc5D and CD163 show high levels of identity between human and bovine sequences, others like CD5 and Sp are more distantly related. M160 does not have a straightforward ortholog in cattle, so the bovine sequence used was of the related molecule M160-like, that is related in turn to the SCART 1 and 2 molecules present in the mouse.

A Systematic and Thorough Search for Domains

achieved. Clearly, either hypothesis does not exclude the other.

comprises five SRCR group B domains (Gonçalves et al., 2009).

novel proteins (Table 3).

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 201

SCART (CD163c), of which there are two genes found in the mouse, and the bovine gene M160-like (Herzig et al., 2010). These sets of genes are related to WC1 genes expressed in cattle, sheep and swine. To obtain a better idea of the relationship between these families of genes, we aligned the full sequences of known human and bovine proteins containing SRCR group B domains using ClustalW and drew the corresponding phylogram (Fig. 2). As can be seen, there are no direct links between human M160, bovine M160L and bovine WC1, raising the possibility that either some genes were lost during mammalian evolution, or that the complete characterization and annotation of the genomes has still not been fully

**3. A systematic and thorough search for SRCR domains in the genome** 

Our hope is that the evolution and function of SRCR domains would emerge when all members of this protein family have been identified. The advent of the human genome sequence has allowed us to screen, using bioinformatics-based approaches, for new SRCR proteins still not described or characterized. We decided to focus on group B molecules, given that proteins of this type are more conserved, restricted in number, and their specialized function, in this case immune-related, seems better defined. We performed searches for new members of the SRCR-SF in the completed human genome sequence by interrogating the genome using TBLASTN 2.2.20+ (Altschul et al., 1997; http://www.ncbi.nlm.nih.gov/BLAST). Initially, we screened for new sequences exhibiting similarity with any or all of the SRCR domains comprising the then known SRCR superfamily proteins (Gonçalves et al., 2009). We expected that, for a given TBLASTN run, *bona fide* new SRCR domains would have smaller E values than the best matches of the search sequence with Group A SRCR domains. According to this criterion, the search identified the sequences encoding domains within already known and characterized proteins *i.e.* CD5, CD6, Sp, S4D, CD163, M160 and DMBT1. Additionally, we identified a cluster of five new SRCR domains, which we further investigated and that later resulted in the cloning and characterization of SSc5D, a molecule secreted by macrophages and that

A *caveat* in our methodology was that not all group B domains were identified using this strategy. The most divergent domains, namely those of CD5, were not retrieved in all searches, and in particular CD5-d1 was rarely identified as having a clear homology with any other group B domain. Sequence alignment of all group B domains (Fig. 1) highlights the striking differences of CD5 sequences, and also to some extent of the CD6 domains, when compared with other sequences that are remarkably similar to each other. In order to perform a more rigorous search for all putative SRCR group B domains, we conducted a comprehensive systematic search using PSI-BLAST (Altschul et al., 1997) to find distant homologs. All known SRCR domains were used as queries to search iteratively against human non-redundant database with the target sequence length set to 250. BLOSUM62 amino acid substitution matrix with gap open penalty 11, and extension penalty 1, was used. Sequence masking was disabled and the PSI-BLAST threshold was set to 0.005. While searching, each PSI-BLAST query was iterated including new hits from the previous search until it converged, i.e. no new hits were found in subsequent searches. After each search iteration, results were checked for new SRCR proteins. This meticulous and robust method picked up all known SRCR domain-containing proteins along with


Table 2. Similarity between human and bovine corresponding SRCR domains. Percentage identity between each domain is indicated.

Fig. 2. Relationships between human and bovine SRCR molecules.

There are three groups of genes in the CD163 family: CD163 itself (CD163a), present in all mammals; M160 (CD163b), so far only found in the genomes of primates and in horses; and

Table 2. Similarity between human and bovine corresponding SRCR domains. Percentage

There are three groups of genes in the CD163 family: CD163 itself (CD163a), present in all mammals; M160 (CD163b), so far only found in the genomes of primates and in horses; and

identity between each domain is indicated.

Fig. 2. Relationships between human and bovine SRCR molecules.

SCART (CD163c), of which there are two genes found in the mouse, and the bovine gene M160-like (Herzig et al., 2010). These sets of genes are related to WC1 genes expressed in cattle, sheep and swine. To obtain a better idea of the relationship between these families of genes, we aligned the full sequences of known human and bovine proteins containing SRCR group B domains using ClustalW and drew the corresponding phylogram (Fig. 2). As can be seen, there are no direct links between human M160, bovine M160L and bovine WC1, raising the possibility that either some genes were lost during mammalian evolution, or that the complete characterization and annotation of the genomes has still not been fully achieved. Clearly, either hypothesis does not exclude the other.

#### **3. A systematic and thorough search for SRCR domains in the genome**

Our hope is that the evolution and function of SRCR domains would emerge when all members of this protein family have been identified. The advent of the human genome sequence has allowed us to screen, using bioinformatics-based approaches, for new SRCR proteins still not described or characterized. We decided to focus on group B molecules, given that proteins of this type are more conserved, restricted in number, and their specialized function, in this case immune-related, seems better defined. We performed searches for new members of the SRCR-SF in the completed human genome sequence by interrogating the genome using TBLASTN 2.2.20+ (Altschul et al., 1997; http://www.ncbi.nlm.nih.gov/BLAST). Initially, we screened for new sequences exhibiting similarity with any or all of the SRCR domains comprising the then known SRCR superfamily proteins (Gonçalves et al., 2009). We expected that, for a given TBLASTN run, *bona fide* new SRCR domains would have smaller E values than the best matches of the search sequence with Group A SRCR domains. According to this criterion, the search identified the sequences encoding domains within already known and characterized proteins *i.e.* CD5, CD6, Sp, S4D, CD163, M160 and DMBT1. Additionally, we identified a cluster of five new SRCR domains, which we further investigated and that later resulted in the cloning and characterization of SSc5D, a molecule secreted by macrophages and that comprises five SRCR group B domains (Gonçalves et al., 2009).

A *caveat* in our methodology was that not all group B domains were identified using this strategy. The most divergent domains, namely those of CD5, were not retrieved in all searches, and in particular CD5-d1 was rarely identified as having a clear homology with any other group B domain. Sequence alignment of all group B domains (Fig. 1) highlights the striking differences of CD5 sequences, and also to some extent of the CD6 domains, when compared with other sequences that are remarkably similar to each other. In order to perform a more rigorous search for all putative SRCR group B domains, we conducted a comprehensive systematic search using PSI-BLAST (Altschul et al., 1997) to find distant homologs. All known SRCR domains were used as queries to search iteratively against human non-redundant database with the target sequence length set to 250. BLOSUM62 amino acid substitution matrix with gap open penalty 11, and extension penalty 1, was used. Sequence masking was disabled and the PSI-BLAST threshold was set to 0.005. While searching, each PSI-BLAST query was iterated including new hits from the previous search until it converged, i.e. no new hits were found in subsequent searches. After each search iteration, results were checked for new SRCR proteins. This meticulous and robust method picked up all known SRCR domain-containing proteins along with novel proteins (Table 3).

A Systematic and Thorough Search for Domains

Fig. 3. Multiple sequence alignment of all SRCR domains.

family (Fig. 4).

The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino acid at that position. It is apparent from the WebLogo that, although sequences vary substantially between SRCR domains, all cysteine residues (colored in red) are conserved across the

Fig. 4. Multiple sequence alignment of SRCR domains in WebLogo format.

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 203


Table 3. List of SRCR-containing proteins. \* - DMBT1 has been described as containing 14 SRCR domains; # - annotated as a pseudogene; in red denotes new SRCR domains from uncharacterized proteins.

From this genome-wide search we obtained a total of 76 SRCR group B domains distributed in 11 genes, each putatively encoding a varying number of SRCR domains. The eleven genes are spread across the genome on seven different chromosomes: chromosome 10 contains three SRCR group B-encoding genes, chromosomes 11 and 12 contain two each, and chromosomes 1, 7, 14 and 19 each contain a copy of just one SRCR-encoding gene. Among these genes and in addition to known domains from characterized genes, our search has uncovered 23 new putative group B domains, three of which represent previously unreported domains localized within the DMBT1 gene. The DMBT1 gene thus putatively encodes a maximum of 17 SRCR domains. Some controversy has existed on the number of SRCR domains within the DMBT1 molecule. Like other SRCR-containing proteins (Castro et al., 2007; Padilla et al., 2002), DMBT1 can be expressed as different isoforms arising by alternative splicing, which include or exclude individual SRCR domains (Mollenhauer et al., 1999). It is possible that the new DMBT1 domains have not been previously reported because they are not expressed in the tissues or cells investigated, however it is also plausible that the exons coding for these domains have been silenced during evolution and are now non-functional.

The remaining new domains belong to 3 new putative genes, one 8 domain-encoding gene (8D), one gene, annotated as a DMBT1-like pseudo-gene, that encodes 11 fragments of SRCR domains of variable lengths (D11), and a gene encoding a putative Hedgehog interacting protein-like 1 molecule (HHIP-like 1), which contains a single SRCR domain.

In order to analyze the sequence conservation and diversity of SRCR domains, we aligned all individual 76 SRCR group B domains using ClustalW2 (Thompson et al., 1994) with the default substitution matrix (Gonnet series) and gap opening and extension penalties of 10 and 0.2 respectively. Due to the sequence diversity in SRCR domains, several insertions and deletions were found in the multiple sequence alignment (Fig. 3).

To locate sequence patterns as well as conserved amino acids, the multiple sequence alignment was used to create a WebLogo (Crooks et al., 2004; http://weblogo.berkeley.edu).

Protein Number of SRCR Domains Chromosome

Sp 3 3 S4D 4 7 DMBT1 17\* 10 CD5 3 11 CD6 3 11 CD163 9 12 M160 12 12 SSc5D 5 19 8D 8 10 D11# 11 10 HHIPL1 1 14 Table 3. List of SRCR-containing proteins. \* - DMBT1 has been described as containing 14 SRCR domains; # - annotated as a pseudogene; in red denotes new SRCR domains from

From this genome-wide search we obtained a total of 76 SRCR group B domains distributed in 11 genes, each putatively encoding a varying number of SRCR domains. The eleven genes are spread across the genome on seven different chromosomes: chromosome 10 contains three SRCR group B-encoding genes, chromosomes 11 and 12 contain two each, and chromosomes 1, 7, 14 and 19 each contain a copy of just one SRCR-encoding gene. Among these genes and in addition to known domains from characterized genes, our search has uncovered 23 new putative group B domains, three of which represent previously unreported domains localized within the DMBT1 gene. The DMBT1 gene thus putatively encodes a maximum of 17 SRCR domains. Some controversy has existed on the number of SRCR domains within the DMBT1 molecule. Like other SRCR-containing proteins (Castro et al., 2007; Padilla et al., 2002), DMBT1 can be expressed as different isoforms arising by alternative splicing, which include or exclude individual SRCR domains (Mollenhauer et al., 1999). It is possible that the new DMBT1 domains have not been previously reported because they are not expressed in the tissues or cells investigated, however it is also plausible that the exons coding for these domains have been silenced during evolution and

The remaining new domains belong to 3 new putative genes, one 8 domain-encoding gene (8D), one gene, annotated as a DMBT1-like pseudo-gene, that encodes 11 fragments of SRCR domains of variable lengths (D11), and a gene encoding a putative Hedgehog interacting

In order to analyze the sequence conservation and diversity of SRCR domains, we aligned all individual 76 SRCR group B domains using ClustalW2 (Thompson et al., 1994) with the default substitution matrix (Gonnet series) and gap opening and extension penalties of 10 and 0.2 respectively. Due to the sequence diversity in SRCR domains, several insertions and

To locate sequence patterns as well as conserved amino acids, the multiple sequence alignment was used to create a WebLogo (Crooks et al., 2004; http://weblogo.berkeley.edu).

protein-like 1 molecule (HHIP-like 1), which contains a single SRCR domain.

deletions were found in the multiple sequence alignment (Fig. 3).

uncharacterized proteins.

are now non-functional.


Fig. 3. Multiple sequence alignment of all SRCR domains.

The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino acid at that position. It is apparent from the WebLogo that, although sequences vary substantially between SRCR domains, all cysteine residues (colored in red) are conserved across the family (Fig. 4).

Fig. 4. Multiple sequence alignment of SRCR domains in WebLogo format.

A Systematic and Thorough Search for Domains

Carmo, unpublished).

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 205

Fig. 5. Maximum likelihood phylogenetic estimation of human SRCR domains. Internal branch reliability was assessed using bootstrapping method (100 bootstrap replicates). Branches observed in more than 75 of 100 bootstrapped re-sampling are shown in red.

The second cluster of SRCR domains we have uncovered is located on chromosome 10 and includes 8 such domains, thus we provisionally termed it 8D. Using the predicted exonderived protein sequence, we BLAST-searched other mammalian genomes and the proteins that we retrieved which were most similar to human 8D were mouse SCART1, bovine M160L, and mouse SCART2, whose ClustalW alignment scores were 70, 66 and 57, respectively (Fig. 6). Human 8D and mouse SCART1 have 64% identity for the entire sequence, while some individual SRCR domains share identities of close to, and even above 80%. We thus believe that 8D is the human ortholog of mouse SCART1. It remains to be seen whether human 8D can be expressed and produce a mature and functional protein, although we have detected several 8D transcripts of different sizes (C Gonçalves and A

In order to estimate evolutionary relationships among SRCR domains, we performed a phylogenetic analysis utilizing the maximum likelihood phylogenetic reconstruction method (proml) available in the phylip phylogenetic package (Felsenstein, 1989). Gaps were trimmed from the SRCR multiple sequence alignment prior to tree building and the Jones-Taylor-Thornton probability model employed with constant rate of change among sites. The reliability of internal branches was subsequently evaluated using 100 bootstrap samplings. SRCR domains exhibit very complex evolutionary relationships (Fig. 5). In the reconstructed phylogenetic tree, intra-protein domain clustering as well as inter-protein domain clusters were observed. Intra-domain clustering, as in the case of DMBT1, strongly suggests the evolution of these domains via sequential intragenic duplication. At the same time it is difficult to understand the inter-protein domain similarities. CD5, CD6, Spα, M160 and CD163 exhibit more diverse relationships. Among them, SRCR domains show greater inter than intra protein similarities. Given their low sequence similarities, it is uncertain whether the domains evolved through gene duplication and accumulated mutations have reduced sequence similarity, or if it is through a convergent evolution mechanism subsequent to domain shuffling. The similarities of domain pairs M160\_d4-CD163\_d1, M160\_d7-CD163\_d4 M160\_d8-CD163\_d5, M160\_d9-CD163\_d6, M160\_d10-CD163\_d7, M160\_d11-CD163\_d8, M160\_d12–CD163\_d9, CD5\_d1-CD6\_d3 and CD5\_d2-S4D\_d4 are strongly suggestive of inter-protein domain shuffling.

#### **4. Concluding remarks - the completion of the SRCR group B family**

In contrast to the complexity and variety of large protein families such as the G protein coupled receptor (GPCR) superfamily, which has nearly 800 genes in the human genome, corresponding to roughly 4% of the full protein-encoding genome, group B of the scavenger receptor cysteine-rich superfamily appears to be much more limited. So far it includes only 8 members in the entire human genome, although there are additionally 3 proteins described in other mammalians; SCART1 and SCART2 initially found in mice, and the 11 SRCR domain-containing protein WC1 expressed in cattle, sheep and swine. Also, the function of mammalian SRCR proteins seems to be restricted to the immune system, although the exact nature or biological role of the family is still to be fully determined.

In this study we set out to identify the remaining members of the SRCR group B family in order to obtain a clear understanding of the biological significance of this important group of proteins and to clarify some as yet unresolved questions regarding their evolution in mammalian species. We searched the human genome for the presence of SRCR-encoding genes using as probes the amino acid sequence of all reported human SRCR domains. Interestingly, one of the new members we have identified, HHIP-like 1, contains a single SRCR domain, which is unknown in the family. Moreover, the amino acid sequence corresponding to the SRCR domain constitutes only a small fraction of the total of the putative protein (13%). This is in contrast with most other members, whose amino acid content corresponding to SRCR domains relative to the whole of the protein is significantly higher, varying between 32% (SSc5D) and 92% (Sp). HHIP-like 1 is related to Hedgehog interacting protein, a regulatory component of the Hedgehog signaling pathway (Chuang and McMahon, 1999). However, unlike HHIP-like 1, HHIP does not contain an SRCR domain.

In order to estimate evolutionary relationships among SRCR domains, we performed a phylogenetic analysis utilizing the maximum likelihood phylogenetic reconstruction method (proml) available in the phylip phylogenetic package (Felsenstein, 1989). Gaps were trimmed from the SRCR multiple sequence alignment prior to tree building and the Jones-Taylor-Thornton probability model employed with constant rate of change among sites. The reliability of internal branches was subsequently evaluated using 100 bootstrap samplings. SRCR domains exhibit very complex evolutionary relationships (Fig. 5). In the reconstructed phylogenetic tree, intra-protein domain clustering as well as inter-protein domain clusters were observed. Intra-domain clustering, as in the case of DMBT1, strongly suggests the evolution of these domains via sequential intragenic duplication. At the same time it is difficult to understand the inter-protein domain similarities. CD5, CD6, Spα, M160 and CD163 exhibit more diverse relationships. Among them, SRCR domains show greater inter than intra protein similarities. Given their low sequence similarities, it is uncertain whether the domains evolved through gene duplication and accumulated mutations have reduced sequence similarity, or if it is through a convergent evolution mechanism subsequent to domain shuffling. The similarities of domain pairs M160\_d4-CD163\_d1, M160\_d7-CD163\_d4 M160\_d8-CD163\_d5, M160\_d9-CD163\_d6, M160\_d10-CD163\_d7, M160\_d11-CD163\_d8, M160\_d12–CD163\_d9, CD5\_d1-CD6\_d3 and CD5\_d2-S4D\_d4 are strongly suggestive of

**4. Concluding remarks - the completion of the SRCR group B family** 

nature or biological role of the family is still to be fully determined.

In contrast to the complexity and variety of large protein families such as the G protein coupled receptor (GPCR) superfamily, which has nearly 800 genes in the human genome, corresponding to roughly 4% of the full protein-encoding genome, group B of the scavenger receptor cysteine-rich superfamily appears to be much more limited. So far it includes only 8 members in the entire human genome, although there are additionally 3 proteins described in other mammalians; SCART1 and SCART2 initially found in mice, and the 11 SRCR domain-containing protein WC1 expressed in cattle, sheep and swine. Also, the function of mammalian SRCR proteins seems to be restricted to the immune system, although the exact

In this study we set out to identify the remaining members of the SRCR group B family in order to obtain a clear understanding of the biological significance of this important group of proteins and to clarify some as yet unresolved questions regarding their evolution in mammalian species. We searched the human genome for the presence of SRCR-encoding genes using as probes the amino acid sequence of all reported human SRCR domains. Interestingly, one of the new members we have identified, HHIP-like 1, contains a single SRCR domain, which is unknown in the family. Moreover, the amino acid sequence corresponding to the SRCR domain constitutes only a small fraction of the total of the putative protein (13%). This is in contrast with most other members, whose amino acid content corresponding to SRCR domains relative to the whole of the protein is significantly higher, varying between 32% (SSc5D) and 92% (Sp). HHIP-like 1 is related to Hedgehog interacting protein, a regulatory component of the Hedgehog signaling pathway (Chuang and McMahon, 1999). However, unlike HHIP-like 1, HHIP does not contain an SRCR

inter-protein domain shuffling.

domain.

Fig. 5. Maximum likelihood phylogenetic estimation of human SRCR domains. Internal branch reliability was assessed using bootstrapping method (100 bootstrap replicates). Branches observed in more than 75 of 100 bootstrapped re-sampling are shown in red.

The second cluster of SRCR domains we have uncovered is located on chromosome 10 and includes 8 such domains, thus we provisionally termed it 8D. Using the predicted exonderived protein sequence, we BLAST-searched other mammalian genomes and the proteins that we retrieved which were most similar to human 8D were mouse SCART1, bovine M160L, and mouse SCART2, whose ClustalW alignment scores were 70, 66 and 57, respectively (Fig. 6). Human 8D and mouse SCART1 have 64% identity for the entire sequence, while some individual SRCR domains share identities of close to, and even above 80%. We thus believe that 8D is the human ortholog of mouse SCART1. It remains to be seen whether human 8D can be expressed and produce a mature and functional protein, although we have detected several 8D transcripts of different sizes (C Gonçalves and A Carmo, unpublished).

A Systematic and Thorough Search for Domains

Vattipally Sreenu is funded by the University of Oxford.

lymphocytes. J Exp Med *184*, 811-819.

surface ligand. Eur J Immunol *29*, 2119-2129.

synapse. Journal of Immunology *178*, 4351-4361.

generator. Genome Res *14*, 1188-1190.

search programs. Nucleic Acids Res *25*, 3389-3402.

surface and secreted proteins. J Exp Med *174*, 949-952.

**5. Acknowledgements** 

Today *18*, 498-504.

32109-32115.

Biology *73*, 183-190.

*113*, 887-892.

Science *246*, 941-942.

**6. References** 

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 207

We thank Dr. Simon Lee for reviewing this manuscript. The work in Alexandre Carmo's laboratory is funded by FEDER through the Programa Operacional Factores de Competitividade – COMPETE, and by FCT – Fundação para a Ciência e a Tecnologia.

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman,

Aruffo, A., Bowen, M.A., Patel, D.D., Haynes, B.F., Starling, G.C., Gebe, J.A., and Bajorath, J.

Aruffo, A., Melnick, M.B., Linsley, P.S., and Seed, B. (1991). The lymphocyte glycoprotein

Biancone, L., Bowen, M.A., Lim, A., Aruffo, A., Andres, G., and Stamenkovic, I. (1996).

Bikker, F.J., Ligtenberg, A.J., Nazmi, K., Veerman, E.C., van't Hof, W., Bolscher, J.G.,

Calvo, J., Places, L., Padilla, O., Vilà, J.M., Vives, J., Bowen, M.A., and Lozano, F. (1999).

Castro, M.A.A., Nunes, R.J., Oliveira, M.I., Tavares, P.A., Simões, C., Parnes, J.R., Moreira,

Castro, M.A.A., Oliveira, M.I., Nunes, R.J., Fabre, S., Barbosa, R., Peixoto, A., Brown, M.H.,

Chuang, P.T., and McMahon, A.P. (1999). Vertebrate Hedgehog signalling modulated by

Crooks, G.E., Hon, G., Chandonia, J.M., and Brenner, S.E. (2004). WebLogo: a sequence logo

Fabriek, B.O., van Bruggen, R., Deng, D.M., Ligtenberg, A.J., Nazmi, K., Schornagel, K.,

Felsenstein, J. (1989). Mathematics vs. Evolution: Mathematical Evolutionary Theory.

induction of a Hedgehog-binding protein. Nature *397*, 617-621.

D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database

(1997). CD6-ligand interactions: a paradigm for SRCR domain function? Immunol

CD6 contains a repeated domain structure characteristic of a new family of cell

Identification of a novel inducible cell-surface ligand of CD5 on activated

Poustka, A., Nieuw Amerongen, A.V., and Mollenhauer, J. (2002). Identification of the bacteria-binding peptide domain on salivary agglutinin (gp-340/DMBT1), a member of the scavenger receptor cysteine-rich superfamily. J Biol Chem *277*,

Interaction of recombinant and natural soluble CD5 forms with an alternative cell

A., and Carmo, A.M. (2003). OX52 is the rat homologue of CD6: evidence for an effector function in the regulation of CD5 phosphorylation. Journal of Leukocyte

Parnes, J.R., Bismuth, G., Moreira, A.*, et al.* (2007). Extracellular Isoforms of CD6 generated by alternative splicing regulate targeting of CD6 to the immunological

Vloet, R.P., Dijkstra, C.D., and van den Berg, T.K. (2009). The macrophage scavenger receptor CD163 functions as an innate immune sensor for bacteria. Blood


Fig. 6. Sequence alignment of mouse SCART1, human 8D and bovine M160-like protein.

The last new set of domains we have identified is located in a gene also on chromosome 10, but has been annotated as a non-coding pseudogene. Analysis of its putative sequence derived from the exon-like sequences in fact reveal that some stretches of several of the SRCR domains are missing, adding to a number of frameshifts and premature stop codons. Curiously, 11 SRCR-like domains can be identified, exactly the same number as the typical bovine WC1 protein. Comparison between the two sequences has failed however to definitely determine whether these two genes have the same evolutionary origin, as individually identifiable SRCR or SRCR-type domains seem to have already drifted apart significantly.

With the recognition of the three new genes, albeit none of them proven to be functional as yet together with the detection of three new putative SRCR-encoding sequences present in the DMBT1 gene, we are confident that we have completed the identification of the full set of scavenger receptor cysteine-rich group B domains in the human genome.

#### **5. Acknowledgements**

We thank Dr. Simon Lee for reviewing this manuscript. The work in Alexandre Carmo's laboratory is funded by FEDER through the Programa Operacional Factores de Competitividade – COMPETE, and by FCT – Fundação para a Ciência e a Tecnologia. Vattipally Sreenu is funded by the University of Oxford.

#### **6. References**

206 Bioinformatics – Trends and Methodologies

Fig. 6. Sequence alignment of mouse SCART1, human 8D and bovine M160-like protein.

significantly.

The last new set of domains we have identified is located in a gene also on chromosome 10, but has been annotated as a non-coding pseudogene. Analysis of its putative sequence derived from the exon-like sequences in fact reveal that some stretches of several of the SRCR domains are missing, adding to a number of frameshifts and premature stop codons. Curiously, 11 SRCR-like domains can be identified, exactly the same number as the typical bovine WC1 protein. Comparison between the two sequences has failed however to definitely determine whether these two genes have the same evolutionary origin, as individually identifiable SRCR or SRCR-type domains seem to have already drifted apart

With the recognition of the three new genes, albeit none of them proven to be functional as yet together with the detection of three new putative SRCR-encoding sequences present in the DMBT1 gene, we are confident that we have completed the identification of the full set

of scavenger receptor cysteine-rich group B domains in the human genome.


A Systematic and Thorough Search for Domains

Nat Genet *17*, 32-39.

Biol Chem *282*, 16654-16666.

Lymphoma *36*, 353-365.

USA *104*, 11724-11729.

37.

domain. J Biol Chem *282*, 12669-12677.

recognition receptor. J Biol Chem *280*, 35391-35398.

atherosclerosis. Atherosclerosis *207*, 103-110.

6240.

of the Scavenger Receptor Cysteine-Rich Group-B Family in the Human Genome 209

Mollenhauer, J., Holmskov, U., Wiemann, S., Krebs, I., Herbertz, S., Madsen, J., Kioschis, P.,

Mollenhauer, J., Wiemann, S., Scheurlen, W., Korn, B., Hayashi, Y., Wilgenbus, K.K., von

Moreno, J.A., Muñoz-García, B., Martín-Ventura, J.L., Madrigal-Matute, J., Orbe, J., Páramo,

Ojala, J.R., Pikkarainen, T., Tuuttila, A., Sandalova, T., and Tryggvason, K. (2007). Crystal

Padilla, O., Pujana, M.A., López-de la Iglesia, A., Gimferrer, I., Arman, M., Vilà, J.M., Places,

Resnick, D., Pearson, A., and Krieger, M. (1994). The SRCR superfamily: a family

Rodamilans, B., Muñoz, I.G., Bragado-Nilsson, E., Sarrias, M.R., Padilla, O., Blanco, F.J.,

Sarrias, M.R., Roselló, S., Sánchez-Barbero, F., Sierra, J.M., Vila, J., Yélamos, J., Vives, J.,

Sarrias, M.R., Farnós, M., Mota, R., Sánchez-Barbero, F., Ibáñez, A., Gimferrer, I., Vera, J.,

Sarrias, M.R., Grønlund, J., Padilla, O., Madsen, J., Holmskov, U., and Lozano, F. (2004). The

Somoza, J.R., Ho, J.D., Luong, C., Ghate, M., Sprengeler, P.A., Mortara, K., Shrader, W.D.,

scavenger receptor cysteine-rich (SRCR) domain. Structure *11*, 1123-1131.

reminiscent of the Ig superfamily. Trends Biochem Sci *19*, 5-8.

mapping to human chromosome 7q11.23. Immunogenetics *54*, 621-634. Pospisil, R., Silverman, G.J., Marti, G.E., Aruffo, A., Bowen, M.A., and Mage, R.G. (2000).

Coy, J.F., and Poustka, A. (1999). The genomic structure of the DMBT1 gene: evidence for a region with susceptibility to genomic instability. Oncogene *18*, 6233-

Deimling, A., and Poustka, A. (1997). DMBT1, a new member of the SRCR superfamily, on chromosome 10q25.3-26.1 is deleted in malignant brain tumours.

J.A., Ortega, L., Egido, J., and Blanco-Colio, L.M. (2009). The CD163-expressing macrophages recognize and internalize TWEAK: potential consequences in

structure of the cysteine-rich domain of scavenger receptor MARCO reveals the presence of a basic and an acidic cluster that both contribute to ligand recognition. J

L., Vives, J., Estivill, X., and Lozano, F. (2002). Cloning of S4D-SRCRB, a new soluble member of the group B scavenger receptor cysteine-rich family (SRCR-SF)

CD5 is a potential selecting ligand for B-cell surface immunoglobulin: a possible role in maintenance and selective expansion of normal and malignant B cells. Leuk

Lozano, F., and Montoya, G. (2007). Crystal structure of the third extracellular domain of CD5 reveals the fold of a group B scavenger cysteine-rich receptor

Casals, C., and Lozano, F. (2005). A role for human Sp alpha as a pattern

Fenutría, R., Casals, C., Yélamos, J.*, et al.* (2007). CD6 binds to pathogen-associated molecular patterns and protects from LPS-induced septic shock. Proc Natl Acad Sci

Scavenger Receptor Cysteine-Rich (SRCR) domain: an ancient and highly conserved protein module of the innate immune system. Crit Rev Immunol *24*, 1-

Sperandio, D., Chan, H., McGrath, M.E.*, et al.* (2003). The structure of the extracellular region of human hepsin reveals a serine protease domain and a novel


Freeman, M., Ashkenas, J., Rees, D.J., Kingsley, D.M., Copeland, N.G., Jenkins, N.A., and

Garza-Garcia, A., Esposito, D., Rieping, W., Harris, R., Briggs, C., Brown, M.H., and Driscoll,

Gebe, J.A., Kiener, P.A., Ring, H.Z., Li, X., Francke, U., and Aruffo, A. (1997). Molecular

Gimferrer, I., Farnós, M., Calvo, M., Mittelbrunn, M., Enrich, C., Sánchez-Madrid, F., Vives,

Gonçalves, C.M., Castro, M.A.A., M., Henriques, T., Oliveira, M.I., Pinheiro, H.C., H.,

Graversen, J.H., Madsen, M., and Moestrup, S.K. (2002). CD163: a signal receptor scavenging

Gronlund, J., Vitved, L., Lausen, M., Skjodt, K., and Holmskov, U. (2000). Cloning of a novel

Herzig, C.T., Waters, R.W., Baldwin, C.L., and Telfer, J.C. (2010). Evolution of the CD163

Hohenester, E., Sasaki, T., and Timpl, R. (1999). Crystal structure of a scavenger receptor

Holmskov, U., Mollenhauer, J., Madsen, J., Vitved, L., Gronlund, J., Tornoe, I., Kliem, A.,

Kisielow, J., Kopf, M., and Karjalainen, K. (2008). SCART scavenger receptors identify a novel subset of adult gammadelta T cells. J Immunol *181*, 1710-1716. Kristiansen, M., Graversen, J.H., Jacobsen, C., Sonne, O., Hoffman, H.J., Law, S.K., and

Law, S.K., Micklem, K.J., Shaw, J.M., Zhang, X.P., Dong, Y., Willis, A.C., and Mason, D.Y.

membrane of lymphoid T cells. J Biol Chem *278*, 8564-8571.

rich superfamily. Molecular Immunology *46*, 2585-2596.

by human macrophages. J Immunol *165*, 6406-6415.

lymphocyte glycoprotein T1/Leu-1. Nature *323*, 346-349.

scavenger receptor superfamily. Eur J Immunol *23*, 2320-2325.

receptors. Proc Natl Acad Sci USA *87*, 8810-8814.

proteins. J Biol Chem *272*, 6151-6158.

*378*, 129-144.

314.

232.

Evol Biol *10*, 181.

*409*, 198-201.

Krieger, M. (1990). An ancient, highly conserved family of cysteine-rich protein domains revealed by cloning type I and type II murine macrophage scavenger

P.C. (2008). Three-dimensional solution structure and conformational plasticity of the N-terminal scavenger receptor cysteine-rich domain of human CD5. J Mol Biol

cloning, mapping to human chromosome 1 q21-q23, and cell binding characteristics of Spalpha, a new member of the scavenger receptor cysteine-rich (SRCR) family of

J., and Lozano, F. (2003). The accessory molecules CD5 and CD6 associate on the

Oliveira, C., Sreenu, V.B., Evans, E.J., Davis, S.J., Moreira, A.*, et al.* (2009). Molecular cloning and analysis of SSc5D, a new member of the scavenger receptor cysteine-

haptoglobin-hemoglobin complexes from plasma. Int J Biochem Cell Biol *34*, 309-

scavenger receptor cysteine-rich type I transmembrane molecule (M160) expressed

family and its relationship to the bovine gamma delta T cell co-receptor WC1. BMC

cysteine-rich domain sheds light on an ancient superfamily. Nat Struct Biol *6*, 228-

Reid, K.B., Poustka, A., and Skjodt, K. (1999). Cloning of gp-340, a putative opsonin receptor for lung surfactant protein D. Proc Natl Acad Sci USA *96*, 10794-10799. Jones, N.H., Clabby, M.L., Dialynas, D.P., Huang, H.J., Herzenberg, L.A., and Strominger,

J.L. (1986). Isolation of complementary DNA clones encoding the human

Moestrup, S.K. (2001). Identification of the haemoglobin scavenger receptor. Nature

(1993). A new macrophage differentiation antigen which is a member of the


**11** 

*U.S.A.* 

**Assessing Multiple Sequence** 

*1Department of Computer Science and Engineering* 

*2School of Biological Sciences and 3Center for Plant Science Innovation University of Nebraska-Lincoln,* 

**Alignments Using Visual Tools** 

Catherine L. Anderson1, Cory L. Strope2 and Etsuko N. Moriyama2,3

Bioinformatics and molecular evolutionary analyses most often start with comparing DNA or amino acid sequences by aligning them. Pairwise alignment, for example, is used to measure the similarities between a query sequence and each of those in a database in BLAST similarity search, the most used bioinformatics tool (Altschul *et al.*, 1990; Camacho *et al.*, 2009). Evolutionary history among sequences can be reflected better when more than two sequences are aligned, in a multiple sequence alignment (MSA). When building an MSA, we assume that the sequences compared are derived from a common ancestral sequence. Then the process of MSA building is to infer homologous positions between the input sequences and place gaps in the sequences in order to align these homologous positions. These gaps represent evolutionary events of their own. Gaps (also called indels) are caused by either insertions or deletions of characters (nucleotides or amino acids) on a particular lineage of sequences during the evolution. Building an MSA is, therefore, to reconstruct the evolutionary history of the sequences involved. While it is easy to understand that the quality of MSAs affects the quality of phylogenetic tree reconstruction, the effect of MSA quality reaches far beyond this. Some examples of bioinformatics methods that utilize information extracted from MSAs include: profile building in similarity search (*e.g*., PSI-BLAST: Altschul *et al.*, 1997), motif/profile recognition (*e.g*., PROSITE: Hulo *et al.*, 2008), profile hidden Markov models for protein families/domains (*e.g.*, Pfam: Finn *et al.*, 2010), and protein secondary-structure prediction (for review, see Pirovano & Heringa, 2010). There are numerous bioinformatics and molecular evolutionary analyses that are affected by

Despite the significance of having good MSAs, assessing MSA quality is far from straightforward. Measuring the quality of MSAs requires two components: a benchmark dataset and a scoring method. A benchmark dataset includes *reference alignments*. These alignments are considered to represent the evolutionary history of the sequences truthfully. The same set of sequences included in a reference alignment is then aligned using the MSA methods to be tested. The *reconstructed* MSA can be compared with the reference MSA using a scoring method and the quality of the reconstructed MSA is assessed compared to the

MSA quality and they can be benefited by having reliable MSAs.

**1. Introduction** 


### **Assessing Multiple Sequence Alignments Using Visual Tools**

Catherine L. Anderson1, Cory L. Strope2 and Etsuko N. Moriyama2,3

*1Department of Computer Science and Engineering 2School of Biological Sciences and 3Center for Plant Science Innovation University of Nebraska-Lincoln, U.S.A.* 

#### **1. Introduction**

210 Bioinformatics – Trends and Methodologies

Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the

Van de Velde, H., von Hoegen, I., Luo, W., Parnes, J.R., and Thielemans, K. (1991). The B-cell surface protein CD72/Lyb-2 is the ligand for CD5. Nature *351*, 662-665. Vera, J., Fenutría, R., Cañadas, O., Figueras, M., Mota, R., Sarrias, M.R., Williams, D.L.,

Wijngaard, P.L., Metzelaar, M.J., MacHugh, N.D., Morrison, W.I., and Clevers, H.C. (1992).

shock-like syndrome. Proc Natl Acad Sci USA *106*, 1506-1511.

CD4-CD8- gamma delta T lymphocytes. J Immunol *149*, 3273-3277.

Res *22*, 4673-4680.

sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids

Casals, C., Yelamos, J., and Lozano, F. (2009). The CD5 ectodomain interacts with conserved fungal cell wall components and protects from zymosan-induced septic

Molecular characterization of the WC1 antigen expressed specifically on bovine

Bioinformatics and molecular evolutionary analyses most often start with comparing DNA or amino acid sequences by aligning them. Pairwise alignment, for example, is used to measure the similarities between a query sequence and each of those in a database in BLAST similarity search, the most used bioinformatics tool (Altschul *et al.*, 1990; Camacho *et al.*, 2009). Evolutionary history among sequences can be reflected better when more than two sequences are aligned, in a multiple sequence alignment (MSA). When building an MSA, we assume that the sequences compared are derived from a common ancestral sequence. Then the process of MSA building is to infer homologous positions between the input sequences and place gaps in the sequences in order to align these homologous positions. These gaps represent evolutionary events of their own. Gaps (also called indels) are caused by either insertions or deletions of characters (nucleotides or amino acids) on a particular lineage of sequences during the evolution. Building an MSA is, therefore, to reconstruct the evolutionary history of the sequences involved. While it is easy to understand that the quality of MSAs affects the quality of phylogenetic tree reconstruction, the effect of MSA quality reaches far beyond this. Some examples of bioinformatics methods that utilize information extracted from MSAs include: profile building in similarity search (*e.g*., PSI-BLAST: Altschul *et al.*, 1997), motif/profile recognition (*e.g*., PROSITE: Hulo *et al.*, 2008), profile hidden Markov models for protein families/domains (*e.g.*, Pfam: Finn *et al.*, 2010), and protein secondary-structure prediction (for review, see Pirovano & Heringa, 2010). There are numerous bioinformatics and molecular evolutionary analyses that are affected by MSA quality and they can be benefited by having reliable MSAs.

Despite the significance of having good MSAs, assessing MSA quality is far from straightforward. Measuring the quality of MSAs requires two components: a benchmark dataset and a scoring method. A benchmark dataset includes *reference alignments*. These alignments are considered to represent the evolutionary history of the sequences truthfully. The same set of sequences included in a reference alignment is then aligned using the MSA methods to be tested. The *reconstructed* MSA can be compared with the reference MSA using a scoring method and the quality of the reconstructed MSA is assessed compared to the

Assessing Multiple Sequence Alignments Using Visual Tools 213

There are two types of alignment statistics. The first type of statistics is used to characterize a single alignment for the level of conservation in each alignment position and for various gap measures. These are descriptive measures for a specific alignment and should not be interpreted as a measure of the alignment quality. The second type of statistics can be used

We describe the following two descriptive statistics: information content and average

The Shannon entropy is a measure of the amount of uncertainty (Shannon, 1948). When it is applied to MSA analysis, it is interpreted as a measure of the diversity of characters within a given alignment column (Schneider & Stephens, 1990). The amount of information conveyed, or information content, is given by the decrease in this uncertainty and represents

<sup>2</sup> ( ) - ( , )log (, )

where *s* is any character contained in column *k* and *f(s,k)* is the frequency of *s* as it appears in column *k*. If there are *xs* of the character *s* in the column that has *x* of non-gap characters,

where *S* is the number of character types for an alignment (4 for a nucleotide alignment and

It can be seen from these equations that the higher the number of distinct characters within a column, the higher the entropy value (*H*) and thus, the lower the information content (*I*) in the column. For a completely conserved column *c*, one which contains only one type of characters, the entropy *H(c)* is 0; thus it contains the maximum amount of information. For a nucleotide alignment this maximum value is 2, while for an amino acid alignment it is 4.32. Note that gaps are not considered in calculating *f(s,k)* in equation (1). Excluding gaps from calculation could inflate the information content for a column that contains many gaps. A single character in a column of gaps, for example, can be erroneously attributed a maximum information content. In order to compensate for this situation, the column information calculation is normalized by multiplying each column's information content by the proportion of non-gap characters present in the column (Schneider & Stephens, 1990). While the information content is a measure applicable to a single alignment, it can be useful

Hydrophobicity is one of the most useful properties of amino acid residues, which is directly related to the function and structure of proteins. Many different types of

, (1)

<sup>2</sup> *Ik S Hk* ( ) lo g ( ) , (2)

*H k f s k f s k*

**2. Statistics used to assess multiple sequence alignments** 

to compare any two alignments containing the same sequences.

hydrophobicity. Both are calculated on a per column basis.

the level of sequence conservation within a column.

**2.1.1 Information content** 

**2.1.2 Average hydrophobicity** 

**2.1 Descriptive statistics on a single multiple sequence alignment** 

Formally defined, the entropy for the *k*th column of an alignment is given as:

*s k*

*f(s,k)* is calculated as *xs/x*. The information content in the *k*th column is given as:

20 for an amino acid alignment). Both *H(k)* and *I(k)* have their units in bits.

to compare the information statistics among alternate alignments for trends.

reference MSA. Problems exist both in benchmark MSA datasets as well as in the methods used to measure the MSA quality.

The majority of benchmark MSA datasets are built on real sequences by aligning structural elements and in some cases with hand-curation (*e.g*., PREFAB: Edgar, 2004b; OXBench: Raghava *et al.*, 2003; HOMSTRAD: Stebbings & Mizuguchi, 2004; BAliBASE: Thompson *et al.*, 2005; Thompson *et al.*, 2011; SABmark: Van Walle *et al.*, 2005). Since the true evolutionary history of the sequences included in these datasets is unknown, positional homologies among sequences are unknown and the accuracy of these reference MSAs is subjective (some issues on benchmark datasets, see Edgar, 2010). Some other benchmark datasets are generated by simulating sequence evolution based on specific molecular evolutionary models (*e.g*., IRMBASE: Subramanian *et al.*, 2005). The advantage of these simulated datasets is that the evolutionary history of sequences (the guide tree) is known and the *true* alignment is given as an outcome of the simulation. Since the evolutionary history is known, these datasets can be used to assess the quality of both MSAs as well as phylogenetic reconstruction methods. The disadvantage is that the biological correctness of the simulation relies solely on the evolutionary models used.

Issues also exist in the methods used to measure the quality of MSAs. While a number of statistics has been proposed (*e.g.*, Position Shift Error score: Cline *et al.*, 2002; sum-of-pairs score and column score: Thompson *et al.*, 1999), there is no definite answer how to measure 'biological correctness' of MSAs. It remains for the end user to incorporate the statistics into their evaluation of this 'biological correctness'.

Due to its significant impact on many bioinformatics and molecular evolutionary studies, MSA is one of the most scrutinized bioinformatics fields (Kemena & Notredame, 2009; Thompson *et al.*, 2011). However, assessment of MSAs is usually reserved for power users. Often regular users simply run one MSA method and proceed to the next analysis without examining their alignment output (Morrison, 2009b). Considering how MSA quality affects the outcomes of further analysis, assessment of MSAs, however, should be included as regular part of sequence analysis. In order to facilitate comparative analysis of MSAs, we recently developed a software package called SuiteMSA (Anderson *et al.*, 2011). SuiteMSA provides several alignment-viewing tools that allow the user to compare MSAs both visually and quantitatively. SuiteMSA also includes a feature-rich biological sequence simulator, indel-Seq-Gen v2.1 (Strope *et al.*, 2009), with a user-friendly graphical interface, allowing the users to generate their own benchmark alignments for testing various MSAs.

In this chapter, we first review some of the statistics used to assess the quality of MSAs focusing on those used in SuiteMSA. We then describe how MSA comparison can be actually performed using various MSA viewers available in SuiteMSA. Five examples are chosen from diverse types of alignment problems: proteins with secondary structures, transmembrane proteins, proteins with length variation, simulated protein sequences, and ribosomal DNAs. These comparisons illustrate how various MSA methods perform differently based on their underlying assumptions. We also discuss how different alignment statistics should be used for assessing MSAs and their limitations.1

<sup>1</sup> All input files and alignments shown in this chapter are available from the following website: http://bioinfolab.unl.edu/~canderson/SuiteMSA/supplement.html

#### **2. Statistics used to assess multiple sequence alignments**

There are two types of alignment statistics. The first type of statistics is used to characterize a single alignment for the level of conservation in each alignment position and for various gap measures. These are descriptive measures for a specific alignment and should not be interpreted as a measure of the alignment quality. The second type of statistics can be used to compare any two alignments containing the same sequences.

#### **2.1 Descriptive statistics on a single multiple sequence alignment**

We describe the following two descriptive statistics: information content and average hydrophobicity. Both are calculated on a per column basis.

#### **2.1.1 Information content**

212 Bioinformatics – Trends and Methodologies

reference MSA. Problems exist both in benchmark MSA datasets as well as in the methods

The majority of benchmark MSA datasets are built on real sequences by aligning structural elements and in some cases with hand-curation (*e.g*., PREFAB: Edgar, 2004b; OXBench: Raghava *et al.*, 2003; HOMSTRAD: Stebbings & Mizuguchi, 2004; BAliBASE: Thompson *et al.*, 2005; Thompson *et al.*, 2011; SABmark: Van Walle *et al.*, 2005). Since the true evolutionary history of the sequences included in these datasets is unknown, positional homologies among sequences are unknown and the accuracy of these reference MSAs is subjective (some issues on benchmark datasets, see Edgar, 2010). Some other benchmark datasets are generated by simulating sequence evolution based on specific molecular evolutionary models (*e.g*., IRMBASE: Subramanian *et al.*, 2005). The advantage of these simulated datasets is that the evolutionary history of sequences (the guide tree) is known and the *true* alignment is given as an outcome of the simulation. Since the evolutionary history is known, these datasets can be used to assess the quality of both MSAs as well as phylogenetic reconstruction methods. The disadvantage is that the biological correctness of the simulation

Issues also exist in the methods used to measure the quality of MSAs. While a number of statistics has been proposed (*e.g.*, Position Shift Error score: Cline *et al.*, 2002; sum-of-pairs score and column score: Thompson *et al.*, 1999), there is no definite answer how to measure 'biological correctness' of MSAs. It remains for the end user to incorporate the statistics into

Due to its significant impact on many bioinformatics and molecular evolutionary studies, MSA is one of the most scrutinized bioinformatics fields (Kemena & Notredame, 2009; Thompson *et al.*, 2011). However, assessment of MSAs is usually reserved for power users. Often regular users simply run one MSA method and proceed to the next analysis without examining their alignment output (Morrison, 2009b). Considering how MSA quality affects the outcomes of further analysis, assessment of MSAs, however, should be included as regular part of sequence analysis. In order to facilitate comparative analysis of MSAs, we recently developed a software package called SuiteMSA (Anderson *et al.*, 2011). SuiteMSA provides several alignment-viewing tools that allow the user to compare MSAs both visually and quantitatively. SuiteMSA also includes a feature-rich biological sequence simulator, indel-Seq-Gen v2.1 (Strope *et al.*, 2009), with a user-friendly graphical interface, allowing the users to generate their own benchmark alignments for testing various MSAs. In this chapter, we first review some of the statistics used to assess the quality of MSAs focusing on those used in SuiteMSA. We then describe how MSA comparison can be actually performed using various MSA viewers available in SuiteMSA. Five examples are chosen from diverse types of alignment problems: proteins with secondary structures, transmembrane proteins, proteins with length variation, simulated protein sequences, and ribosomal DNAs. These comparisons illustrate how various MSA methods perform differently based on their underlying assumptions. We also discuss how different alignment

used to measure the MSA quality.

relies solely on the evolutionary models used.

their evaluation of this 'biological correctness'.

statistics should be used for assessing MSAs and their limitations.1

http://bioinfolab.unl.edu/~canderson/SuiteMSA/supplement.html

1 All input files and alignments shown in this chapter are available from the following website:

The Shannon entropy is a measure of the amount of uncertainty (Shannon, 1948). When it is applied to MSA analysis, it is interpreted as a measure of the diversity of characters within a given alignment column (Schneider & Stephens, 1990). The amount of information conveyed, or information content, is given by the decrease in this uncertainty and represents the level of sequence conservation within a column.

Formally defined, the entropy for the *k*th column of an alignment is given as:

$$H(k) = -\sum\_{s \le k} f(s,k) \log\_2 f(s,k) \,. \tag{1}$$

where *s* is any character contained in column *k* and *f(s,k)* is the frequency of *s* as it appears in column *k*. If there are *xs* of the character *s* in the column that has *x* of non-gap characters, *f(s,k)* is calculated as *xs/x*. The information content in the *k*th column is given as:

<sup>2</sup> *Ik S Hk* ( ) lo g ( ) , (2)

where *S* is the number of character types for an alignment (4 for a nucleotide alignment and 20 for an amino acid alignment). Both *H(k)* and *I(k)* have their units in bits.

It can be seen from these equations that the higher the number of distinct characters within a column, the higher the entropy value (*H*) and thus, the lower the information content (*I*) in the column. For a completely conserved column *c*, one which contains only one type of characters, the entropy *H(c)* is 0; thus it contains the maximum amount of information. For a nucleotide alignment this maximum value is 2, while for an amino acid alignment it is 4.32.

Note that gaps are not considered in calculating *f(s,k)* in equation (1). Excluding gaps from calculation could inflate the information content for a column that contains many gaps. A single character in a column of gaps, for example, can be erroneously attributed a maximum information content. In order to compensate for this situation, the column information calculation is normalized by multiplying each column's information content by the proportion of non-gap characters present in the column (Schneider & Stephens, 1990).

While the information content is a measure applicable to a single alignment, it can be useful to compare the information statistics among alternate alignments for trends.

#### **2.1.2 Average hydrophobicity**

Hydrophobicity is one of the most useful properties of amino acid residues, which is directly related to the function and structure of proteins. Many different types of

Assessing Multiple Sequence Alignments Using Visual Tools 215

*pijk* = 1 if A*ik* and A*jk* of alignment **A** are in the same column of **Ar** ,

*Sk pijk ji*1

> *Sk k*1

 

*M* 

*N* 

*N i*1

where *Srk* is the score for the reference alignment, **Ar**. This reference score is calculated

 

*Srk k*1

 

*Mr* 

as *Srk* = *x*(*x*-1)/2 where *x* is the number of characters in column *k* excluding gaps. The maximum possible SPS is a value of 1.0 when **A** = **Ar.** The SPS is not symmetric in that

To calculate the CS, the test and reference alignments are compared column-wise. The column score is the number of 'matched' columns between the test alignment and the reference alignment divided by the total number of 'considered' columns in the test

*Ck* = 1 if all the characters in the column k of alignment **A** are matched in alignment **Ar** ,

*k*

*Un-gapped CS*: This score considers only un-gapped columns (columns that have no gaps), where *M* of equation (8) equals the number of un-gapped columns in the alignment (shown in red in Fig. 1). For example, if an alignment has 500 columns and only 200 contain no gaps and of these 200, 150 columns are exactly as they appear in the reference alignment, then the un-gapped CS is given as 150/200 = 0.75. The disadvantage to these criteria is that very gappy alignments with very few un-gapped columns can still produce a high column score if those un-gapped columns are all 'matched'. For instance, a test alignment of any length, even if only one column is un-gapped and matches a column in the reference alignment, will

*Ck k*1

 *M*

*M* 

the score will be different if the reference and test alignments are switched.

*SPS*

iv. Then the score for *kth* column of **A** is defined as:

*pijk* = 0 otherwise.

 

v. The score for the full alignment **A** is given as:

alignment. This is formally defined as follows:

ii. The column score for the full alignment **A** is given as:

In SuiteMSA, two types of CS are calculated: un-gapped and gapped.

*CS*

**2.2.2 Column score (CS)** 

i. For the *k*th column of **A**:

*Ck* = 0 otherwise.

 

yield a column score of 1.0.

(4)

(7)

. (5)

, (6)

. (8)

hydrophobicity indices are available (Kawashima *et al.*, 2008). By plotting hydrophobicity values along the sequence, the presence of functional/structural regions (*e.g.*, membranespanning regions in transmembrane proteins or core regions in globular proteins) can be predicted. For MSA analysis, comparing the distribution of hydrophobicity along the alignment among different MSAs can provide a visual aid for evaluating the consistency between alignments. Equation (3) below shows how the average hydrophobicity for column *k*, *h(k)*, is calculated for an alignment containing *N* sequences:

$$h(k) = \frac{\left(\sum\_{i=1}^{N} h\_i\right)}{N} \tag{3}$$

where *hi* is the hydrophobicity index value of *ith* residue of column *k*. In SuiteMSA, the hydrophobicity index provided by Kyte and Doolittle (1982) is used and the value of 0 is assigned for a gap.

#### **2.2 Measuring the similarity between two multiple sequence alignments**

As mentioned earlier, many statistics have been proposed to compare two MSAs. The sumof-pairs score (SPS) and the column score (CS) are the two used most often. Both scores were proposed by Thompson *et al.* (1999). The values of these two scores react differently to varying inconsistency between MSAs compared.

When comparing two alignments, one is referred to as the *reference alignment* and the other the *test alignment*. The test alignment is compared against the reference. If the reference alignment is known to be 'correct', these statistics can be used to measure the alignment quality. As mentioned before, however, the 'correctness' of an alignment can be highly subjective in the case of many available benchmark datasets. An alignment can be said to be truly 'correct' only if its exact evolutionary history is known and if the alignment reflects it correctly. Usually it is possible only if the alignment was generated by a sequence evolution simulator. Even if the 'true' alignment can be obtained by sequence simulation, however, 'biological realism' of the evolutionary model used with the simulation becomes an issue. In this chapter, SPS and CS are thus used more as general comparison measures.

#### **2.2.1 Sum-of-pairs score (SPS)**

To calculate the SPS for a test MSA against the reference MSA, each pair of characters within an alignment column is treated as an alignment unit. The per-column SPS is the number of alignment units within a specific column of the test alignment that are also aligned in the same column of the reference alignment. The total of all per-column scores from the entire alignment is obtained and normalized by dividing by the total number of character pairs. This is formally defined as follows:


$$\begin{cases} p\_{\bar{\eta}k} = \mathbf{1} \text{ if } \mathbf{A}\_{\bar{\eta}} \text{ and } \mathbf{A}\_{\bar{\mu}} \text{ of alignment } \mathbf{A} \text{ are in the same column of } \mathbf{A}\_{\mathbf{r}'}\\ p\_{\bar{\eta}k} = \mathbf{0} \text{ otherwise.} \end{cases} \tag{4}$$

iv. Then the score for *kth* column of **A** is defined as:

$$S\_k = \sum\_{i=1}^{N} \sum\_{j=i+1}^{N} p\_{ijk} \, . \tag{5}$$

v. The score for the full alignment **A** is given as:

 , (6) *SPS Sk k*1 *M Srk k*1 *Mr* 

where *Srk* is the score for the reference alignment, **Ar**. This reference score is calculated as *Srk* = *x*(*x*-1)/2 where *x* is the number of characters in column *k* excluding gaps.

The maximum possible SPS is a value of 1.0 when **A** = **Ar.** The SPS is not symmetric in that the score will be different if the reference and test alignments are switched.

#### **2.2.2 Column score (CS)**

214 Bioinformatics – Trends and Methodologies

hydrophobicity indices are available (Kawashima *et al.*, 2008). By plotting hydrophobicity values along the sequence, the presence of functional/structural regions (*e.g.*, membranespanning regions in transmembrane proteins or core regions in globular proteins) can be predicted. For MSA analysis, comparing the distribution of hydrophobicity along the alignment among different MSAs can provide a visual aid for evaluating the consistency between alignments. Equation (3) below shows how the average hydrophobicity for column

<sup>1</sup> ( )

where *hi* is the hydrophobicity index value of *ith* residue of column *k*. In SuiteMSA, the hydrophobicity index provided by Kyte and Doolittle (1982) is used and the value of 0 is

As mentioned earlier, many statistics have been proposed to compare two MSAs. The sumof-pairs score (SPS) and the column score (CS) are the two used most often. Both scores were proposed by Thompson *et al.* (1999). The values of these two scores react differently to

When comparing two alignments, one is referred to as the *reference alignment* and the other the *test alignment*. The test alignment is compared against the reference. If the reference alignment is known to be 'correct', these statistics can be used to measure the alignment quality. As mentioned before, however, the 'correctness' of an alignment can be highly subjective in the case of many available benchmark datasets. An alignment can be said to be truly 'correct' only if its exact evolutionary history is known and if the alignment reflects it correctly. Usually it is possible only if the alignment was generated by a sequence evolution simulator. Even if the 'true' alignment can be obtained by sequence simulation, however, 'biological realism' of the evolutionary model used with the simulation becomes an issue. In

To calculate the SPS for a test MSA against the reference MSA, each pair of characters within an alignment column is treated as an alignment unit. The per-column SPS is the number of alignment units within a specific column of the test alignment that are also aligned in the same column of the reference alignment. The total of all per-column scores from the entire alignment is obtained and normalized by dividing by the total number of character pairs.

i. Let an alignment of length *M* containing *N* sequences be an *N* by *M* array, **A**. Then the character in the *i*th sequence and *k*th column of the alignment is identified as *Aik*. ii. Let there be two alignments for comparison: alignment **Ar** (referred to as the reference alignment) of length *Mr* containing *N* sequences and alignment **A** (referred to as the test alignment) of length *M* containing *N* sequence, where *Mr* and *M* can be but are not

iii. To examine the *k*th column of **A**, consisting of elements *A1k*, *A2k*, … *Ank*, let *pijk* be defined

*h k*

**2.2 Measuring the similarity between two multiple sequence alignments** 

this chapter, SPS and CS are thus used more as general comparison measures.

*N i i h*

*N* 

, (3)

*k*, *h(k)*, is calculated for an alignment containing *N* sequences:

varying inconsistency between MSAs compared.

**2.2.1 Sum-of-pairs score (SPS)** 

This is formally defined as follows:

required to be equal.

as:

assigned for a gap.

To calculate the CS, the test and reference alignments are compared column-wise. The column score is the number of 'matched' columns between the test alignment and the reference alignment divided by the total number of 'considered' columns in the test alignment. This is formally defined as follows:

i. For the *k*th column of **A**:

(7) *Ck* = 1 if all the characters in the column k of alignment **A** are matched in alignment **Ar** , *Ck* = 0 otherwise. *k*

ii. The column score for the full alignment **A** is given as:

$$\left. \begin{pmatrix} \sum^{\mathsf{M}} \mathsf{C}\_{k} \\ \sum^{\mathsf{N}} \end{pmatrix} \right\rangle\_{\mathsf{M}} \; . \tag{8}$$

In SuiteMSA, two types of CS are calculated: un-gapped and gapped.

*Un-gapped CS*: This score considers only un-gapped columns (columns that have no gaps), where *M* of equation (8) equals the number of un-gapped columns in the alignment (shown in red in Fig. 1). For example, if an alignment has 500 columns and only 200 contain no gaps and of these 200, 150 columns are exactly as they appear in the reference alignment, then the un-gapped CS is given as 150/200 = 0.75. The disadvantage to these criteria is that very gappy alignments with very few un-gapped columns can still produce a high column score if those un-gapped columns are all 'matched'. For instance, a test alignment of any length, even if only one column is un-gapped and matches a column in the reference alignment, will yield a column score of 1.0.

Assessing Multiple Sequence Alignments Using Visual Tools 217

standalone; C program; MSF format.

VerAlign available from http://www.ibi.vu.nl/programs/veralignwww

part of the GUI software; fasta format.

In the following sections, using various examples, we will show how MSAs can be compared using SuiteMSA's visual tools and statistics. See Anderson *et al.* (2011) and SuiteMSA User's Manual for detailed description of various tools available in SuiteMSA. Among the numerous MSA methods currently available, we chose seven MSA methods listed in Table 2 for comparative analysis. We chose these methods based on their general popularity in various bioinformatics analyses, their availability, and some of their features

Table 1. Programs available to calculate SPS and CS. The actual SPS and CS values for alignments discussed in this chapter given by different programs are available from our

useful for aligning particular types of proteins (*e.g.*, transmembrane proteins).

Reference Description

evolutionary events.

webprank/

Progressive alignment; weights sequences based on branch lengths and adjusts gap penalties; one of the earliest methods implemented. http://www.clustal.org/

Progressive alignment; fast distance estimation using kmer counting; iterative refinement using tree-dependent restricted partitioning. http://www.drive5.com/muscle/

Progressive alignment; L-INS-i method is used for iterative refinement incorporating local pairwise

Uses partition function posterior probability estimates to compute maximum expected accuracy alignments. [eProbalign] http://probalign.njit.edu/probalign/login

Phylogeny-aware gap handling; not meant for divergent sequences; recognizes insertions and deletions as distinct

[webPRANK] http://www.ebi.ac.uk/goldman-srv/

alignment information in this study. http://mafft.cbrc.jp/alignment/software/

Modeler score, and Shift scores; fasta format.

qscore (Edgar, 2004b) standalone; C++ program; calculates Q score (SPS), TC (CS),

MSF format.

Program Reference Note

bali\_score (Thompson *et al.*, 1999)

SuiteMSA (Anderson *et al.*, 2011)

website (see footnote 1).

Method (version)

ClustalW2 (2.1)

MUSCLE (3.8.31)

MAFFT (6.843)

Probalign (1.4)

PRANK (web version)

**3. Visual inspection of MSAs** 

(Larkin *et al.*,

(Edgar, 2004a,

(Katoh & Toh,

(Roshan & Livesay, 2006)

(Löytynoja & Goldman, 2005,

2007)

2004b)

2008)

2008)


Fig. 1. Illustration of the column score calculation. In the Test alignment, 'un-gapped' columns are shown in red. 'Un-gapped matched' columns are indicated with red '+' under the alignment. For 'gapped' CS, all but 5th column of the Test alignment are considered and these columns are shown in blue as well as red. However, only those columns indicated with '+' (both red and blue) are counted as 'matched' against the Reference alignment. In this example, 'un-gapped' CS is 0.5 (2 out of 4 columns are matched) and 'gapped' CS is 0.4 (4 out of 10 columns are considered to be matched).

*Gapped CS:* This score considers columns that contain more than 20% non-gap characters. To be 'matched' the characters that appear in a column of the test alignment must appear in a column of the reference alignment with no additional characters. For example, in Fig. 1, all but 5th column of the Test alignment are considered. The columns 6-11 are not counted as 'matched'. This is because, for example, while in the Test alignment, 'G' of T1 position 9 is aligned only with 'G' of T6 and T7, in the Reference alignment, 'G' of T1 position 9 is aligned with 'G' of T2 as well as T6 and T7. The advantage to 'gapped' CS is that it allows more columns to be considered; columns with gaps can be matched if the same non-gap characters (but no other characters) are aligned in the reference alignment. This does offset the disadvantage of the potentially inflated un-gapped CS mentioned before.

Exclusion of any alignment columns that include gaps can be justified since gaps represent evolutionary events that are often not traceable. They are either the insertion of new characters, the deletion of existing characters, or a combination of the two. Therefore, while they are represented by the same gap symbol in the alignment, they are not equivalent. It is often not possible to infer if a gap in one alignment was generated by the same event as a gap in the second alignment. On the other hand, excluding all alignment positions with gaps even for those containing only a small number of gaps may not be desirable. In SuiteMSA, as described above, a column is considered as long as it contains a number of non-gap characters above the 20% threshold. A third column score is also provided in SuiteMSA as '% consistency', which considers all columns regardless of the number of gaps. Comparing these values can help assessing the difference between two alignments.

#### **2.2.3 Implementation of SPS and CS**

In addition to SuiteMSA, several implementations of SPS and CS are available as listed in Table 1. Note that not all of these programs generate the same value for the same alignment. The difference is caused by different criteria used to define, for example, 'matched' columns and which columns should be 'considered' for counting. When comparing scores, due to this inconsistency among programs, it is necessary to use the same implementation of scoring methods.


Table 1. Programs available to calculate SPS and CS. The actual SPS and CS values for alignments discussed in this chapter given by different programs are available from our website (see footnote 1).

#### **3. Visual inspection of MSAs**

216 Bioinformatics – Trends and Methodologies

Fig. 1. Illustration of the column score calculation. In the Test alignment, 'un-gapped' columns are shown in red. 'Un-gapped matched' columns are indicated with red '+' under the alignment. For 'gapped' CS, all but 5th column of the Test alignment are considered and these columns are shown in blue as well as red. However, only those columns indicated with '+' (both red and blue) are counted as 'matched' against the Reference alignment. In this example, 'un-gapped' CS is 0.5 (2 out of 4 columns are matched) and 'gapped' CS is 0.4 (4 out

*Gapped CS:* This score considers columns that contain more than 20% non-gap characters. To be 'matched' the characters that appear in a column of the test alignment must appear in a column of the reference alignment with no additional characters. For example, in Fig. 1, all but 5th column of the Test alignment are considered. The columns 6-11 are not counted as 'matched'. This is because, for example, while in the Test alignment, 'G' of T1 position 9 is aligned only with 'G' of T6 and T7, in the Reference alignment, 'G' of T1 position 9 is aligned with 'G' of T2 as well as T6 and T7. The advantage to 'gapped' CS is that it allows more columns to be considered; columns with gaps can be matched if the same non-gap characters (but no other characters) are aligned in the reference alignment. This does offset

Exclusion of any alignment columns that include gaps can be justified since gaps represent evolutionary events that are often not traceable. They are either the insertion of new characters, the deletion of existing characters, or a combination of the two. Therefore, while they are represented by the same gap symbol in the alignment, they are not equivalent. It is often not possible to infer if a gap in one alignment was generated by the same event as a gap in the second alignment. On the other hand, excluding all alignment positions with gaps even for those containing only a small number of gaps may not be desirable. In SuiteMSA, as described above, a column is considered as long as it contains a number of non-gap characters above the 20% threshold. A third column score is also provided in SuiteMSA as '% consistency', which considers all columns regardless of the number of gaps. Comparing these

In addition to SuiteMSA, several implementations of SPS and CS are available as listed in Table 1. Note that not all of these programs generate the same value for the same alignment. The difference is caused by different criteria used to define, for example, 'matched' columns and which columns should be 'considered' for counting. When comparing scores, due to this inconsistency among programs, it is necessary to use the same implementation of scoring

the disadvantage of the potentially inflated un-gapped CS mentioned before.

values can help assessing the difference between two alignments.

**2.2.3 Implementation of SPS and CS** 

methods.

of 10 columns are considered to be matched).

In the following sections, using various examples, we will show how MSAs can be compared using SuiteMSA's visual tools and statistics. See Anderson *et al.* (2011) and SuiteMSA User's Manual for detailed description of various tools available in SuiteMSA. Among the numerous MSA methods currently available, we chose seven MSA methods listed in Table 2 for comparative analysis. We chose these methods based on their general popularity in various bioinformatics analyses, their availability, and some of their features useful for aligning particular types of proteins (*e.g.*, transmembrane proteins).


Assessing Multiple Sequence Alignments Using Visual Tools 219

Fig. 2. The alignment of eight protein sequences from the lipocalin family. The MSA Viewer is used to display the MSA aligned with the predicted secondary structures. Black thick lines marked with M1, M2, and M3 indicate the locations of the three conserved motifs. Symbols used for the secondary structure prediction are: H (green) for helix, C (cream) for coil, and E (brown) for beta-strand. The alignment statistics are shown above the MSA. The column information content is displayed as a blue bar chart at the bottom indicating the

 % gaps. The number of gap symbols within the alignment divided by the total number of characters within the alignment (alignment length times number of sequences). This should not be confused with the number of insertion/deletion events in the alignment

 % conserved. The number of completely conserved columns divided by the total number of columns. A conserved column is defined as an un-gapped column containing a single

% columns un-gapped. The number of un-gapped columns divided by the total number

 The histogram of character count per column. This histogram represents the gappiness of the MSA using a non-gap character frequency distribution (the inverse of gap frequency distribution). For the lipocalin MSA, 73% of the columns have no gap (this is

In Fig. 3A, we compared the previously shown lipocalin MSA (listed as 'Reference') with the MSA generated by ClustalW2 using the MSA Comparator. Under the blue selection bar and the green range bar, alignment positions are color-coded for the consistency with respect to the reference MSA. Blue characters illustrate where completely consistent columns are, and red characters depict those inconsistently aligned. Compared against the reference, ClustalW2 MSA is more compacted with very few gaps, making the alignment shorter (201 positions compared to 219 in the reference). We further examined the ClustalW2 MSA using the secondary structure display function of the MSA Viewer. As illustrated in Fig. 3B, the ClustalW2 MSA does not have the beta-strand regions (shown as brown-colored clusters of

As mentioned earlier, the information content is the indicator of sequence divergence within a single MSA, and not a direct comparison between two alignments. However, as shown in Fig. 3A, the information content distributions (blue and green bar charts) can be compared between the alignments. It is especially useful when dealing with large alignments containing

level of conservation for each column.

also shown as % columns un-gapped).

'E' letters) aligned as well as the reference MSA does.

type of characters.

**3.1.2 Comparing two MSAs** 

of columns.

since an individual event can span multiple positions.


Table 2. The seven MSA methods compared in this study. All methods are used with the default options unless noted otherwise.

#### **3.1 Examining a protein MSA with secondary structure prediction**

When protein sequences are aligned, it is useful to identify the location of their functional or structural landmarks to determine if such landmarks are aligned properly. Useful landmarks include secondary structures, transmembrane regions, and conserved domains or motifs. Color-coding MSAs based on properties of amino acids also helps determine if the distribution of different types of amino acids is consistent or varied among sequences.

#### **3.1.1 Inspecting a single MSA**

In Fig. 2, eight protein sequences of the lipocalin family (Pfam PF00061; Finn *et al.*, 2010) are aligned. The lipocalin family proteins are highly divergent at the sequence level yet highly conserved at the structure level (Flower *et al.*, 2000). The common structural feature among these proteins is a single eight-stranded antiparallel beta-barrel. The MSA shown in Fig. 2 was originally produced using PROMALS3D (Pei *et al.*, 2008) with manual adjustment (Strope *et al.*, 2009). Using SuiteMSA's secondary structure viewer, we aligned the lipocalin MSA with the secondary structures predicted from the eight sequences using PSIPRED (Jones, 1999). It can be seen in Fig. 2 that eight beta-strand regions (shown as brown-colored clusters of 'E' letters) are clearly well aligned with very few gaps.

Fig. 2 also shows the per-column information content displayed as a blue bar chart below the MSA. The information content reflects the level of conservation for each column. This display is especially useful when dealing with alignments containing a large number of sequences and/or long sequences. When comparing such large alignments, the information content display can be used to quickly scan along the alignment to search for, *e.g.*, high conservation areas (indicated as high information content regions). In Fig. 2, fully conserved columns (positions 51, 53, 148, 150, and 179 are readily identifiable by the full-height bars. In fact, these positions are part of the three conserved motifs shared among lipocalin proteins. These motifs (indicated as M1, M2, and M3 in Fig. 2) are described as "structurally conserved regions" (SCR1, 2, and 3, respectively) by Flower *et al*. (2000). SCR1 corresponds to PROSITE lipocalin motif (PS00213; Hulo *et al.*, 2008).

Several summary statistics are given at the top of MSA Viewer window (Fig. 2). The following statistics are available:

alignment) chosen for this study.

Progressive alignment with profile pre-processing; incorporates secondary structure and transmembrane information; PSIPRED and Phobious (for GPCR

http://www.ibi.vu.nl/programs/pralinewww/

Progressive alignment enhanced with profiles and secondary structure information; a hidden Markov model using a combined scoring of amino acids and secondary structures. http://prodata.swmed.edu/promals/

Reference Description

Table 2. The seven MSA methods compared in this study. All methods are used with the

When protein sequences are aligned, it is useful to identify the location of their functional or structural landmarks to determine if such landmarks are aligned properly. Useful landmarks include secondary structures, transmembrane regions, and conserved domains or motifs. Color-coding MSAs based on properties of amino acids also helps determine if the distribution of different types of amino acids is consistent or varied among sequences.

In Fig. 2, eight protein sequences of the lipocalin family (Pfam PF00061; Finn *et al.*, 2010) are aligned. The lipocalin family proteins are highly divergent at the sequence level yet highly conserved at the structure level (Flower *et al.*, 2000). The common structural feature among these proteins is a single eight-stranded antiparallel beta-barrel. The MSA shown in Fig. 2 was originally produced using PROMALS3D (Pei *et al.*, 2008) with manual adjustment (Strope *et al.*, 2009). Using SuiteMSA's secondary structure viewer, we aligned the lipocalin MSA with the secondary structures predicted from the eight sequences using PSIPRED (Jones, 1999). It can be seen in Fig. 2 that eight beta-strand regions (shown as brown-colored

Fig. 2 also shows the per-column information content displayed as a blue bar chart below the MSA. The information content reflects the level of conservation for each column. This display is especially useful when dealing with alignments containing a large number of sequences and/or long sequences. When comparing such large alignments, the information content display can be used to quickly scan along the alignment to search for, *e.g.*, high conservation areas (indicated as high information content regions). In Fig. 2, fully conserved columns (positions 51, 53, 148, 150, and 179 are readily identifiable by the full-height bars. In fact, these positions are part of the three conserved motifs shared among lipocalin proteins. These motifs (indicated as M1, M2, and M3 in Fig. 2) are described as "structurally conserved regions" (SCR1, 2, and 3, respectively) by Flower *et al*. (2000). SCR1 corresponds

Several summary statistics are given at the top of MSA Viewer window (Fig. 2). The

**3.1 Examining a protein MSA with secondary structure prediction** 

clusters of 'E' letters) are clearly well aligned with very few gaps.

to PROSITE lipocalin motif (PS00213; Hulo *et al.*, 2008).

following statistics are available:

Method (version)

PRALINE (web version)

PROMALS (web version) (Pirovano *et al.*,

(Pei & Grishin,

2008)

2007)

default options unless noted otherwise.

**3.1.1 Inspecting a single MSA** 

Fig. 2. The alignment of eight protein sequences from the lipocalin family. The MSA Viewer is used to display the MSA aligned with the predicted secondary structures. Black thick lines marked with M1, M2, and M3 indicate the locations of the three conserved motifs. Symbols used for the secondary structure prediction are: H (green) for helix, C (cream) for coil, and E (brown) for beta-strand. The alignment statistics are shown above the MSA. The column information content is displayed as a blue bar chart at the bottom indicating the level of conservation for each column.


#### **3.1.2 Comparing two MSAs**

In Fig. 3A, we compared the previously shown lipocalin MSA (listed as 'Reference') with the MSA generated by ClustalW2 using the MSA Comparator. Under the blue selection bar and the green range bar, alignment positions are color-coded for the consistency with respect to the reference MSA. Blue characters illustrate where completely consistent columns are, and red characters depict those inconsistently aligned. Compared against the reference, ClustalW2 MSA is more compacted with very few gaps, making the alignment shorter (201 positions compared to 219 in the reference). We further examined the ClustalW2 MSA using the secondary structure display function of the MSA Viewer. As illustrated in Fig. 3B, the ClustalW2 MSA does not have the beta-strand regions (shown as brown-colored clusters of 'E' letters) aligned as well as the reference MSA does.

As mentioned earlier, the information content is the indicator of sequence divergence within a single MSA, and not a direct comparison between two alignments. However, as shown in Fig. 3A, the information content distributions (blue and green bar charts) can be compared between the alignments. It is especially useful when dealing with large alignments containing

Assessing Multiple Sequence Alignments Using Visual Tools 221

Fig. 4. Comparison of the lipocalin family reference MSA (MSA 1) with four reconstructed MSAs (PRALINE, MAFFT, MUSCLE, and ClustalW2). The Pixel Plot is used to show the alignment patterns with each non-gap character represented with a solid colored pixel and a gap with a blank pixel. Characters corresponding to those under the blue selection bar for the reference MSA are highlighted in magenta in all MSAs. The green range bars for MSAs 2-5

of the magenta areas. Alignments generated by MAFFT, MUSCLE, and ClustalW2 (MSAs 3- 5) are roughly consistent to each other, but not consistent with the reference and PRALINE alignments. All four MSA methods tested produced shorter alignments (201-214 positions) compared to the reference alignment (219 positions). The shortest alignment was obtained

In the previous section, we showed that comparing MSAs and secondary structure predictions help us assess the quality of MSAs. In this section, we will examine alignments of another type

G protein-coupled receptor (GPCR) proteins contain seven transmembrane (TM) regions. They constitute a large protein superfamily grouped into three major and several minor classes (Horn *et al.*, 2003; Vroling *et al.*, 2011). Although the TM regions are relatively constant in length (22~24 amino acids or aa), the lengths of the N-/C-terminal and loop

show the column ranges where corresponding characters are located.

**4. Aligning transmembrane protein sequences** 

from ClustalW2 (201 positions).

of proteins, transmembrane proteins.

**4.1 G-protein coupled receptors** 

Fig. 3. Comparison of the ClustalW2 MSA with the reference alignment of the lipocalin family. A. The two MSAs are compared using the MSA Comparator (the reference and ClustalW2 alignments shown at the top and bottom, respectively). The column SPS display (brown bar chart) is positioned between the two MSAs and is aligned to the ClustalW2 alignment. At the bottom of the column SPS display is the column score (CS) indicator. The un-gapped CS uses those columns marked with purple squares, and the gapped CS uses columns marked with both purple and red squares (small and large squares indicate 'considered' and 'matched' columns, respectively). Summary statistics shown above the reference alignment include: % consistency, SPS, and two types of CS. B. The MSA Viewer is used to generate the secondary structure representation for the reference and ClustalW2 MSAs (shown at the top and bottom, respectively). Symbols used for the secondary structure prediction are: H (green) for helix, C (cream) for coil, and E (brown) for beta-strand.

many/long sequences. On the other hand, SPS is the result of a direct comparison between two MSAs. The per-column SPS (brown bar chart) displayed in Fig. 3A clearly shows where the test alignment (ClustalW2 in this case) is consistent (and to what degree) with the reference.

#### **3.1.3 Comparing multiple MSAs**

In Fig. 4, we compared MSAs produced by four methods against the reference lipocalin family MSA (MSA 1). Using the Pixel Plot, we can clearly see different patterns among the MSAs. The magenta-highlighted areas illustrate how the corresponding characters are aligned (or not) in each MSA. The PRALINE MSA (MSA 2) is fairly consistent compared to the reference MSA. This is expected since PRALINE uses secondary structure information when optimizing the alignments. On the other hand, MAFFT, MUSCLE, and ClustalW2 MSAs show a similar displacement of the same sequences, apparent from the ragged edges

Fig. 3. Comparison of the ClustalW2 MSA with the reference alignment of the lipocalin family. A. The two MSAs are compared using the MSA Comparator (the reference and ClustalW2 alignments shown at the top and bottom, respectively). The column SPS display (brown bar chart) is positioned between the two MSAs and is aligned to the ClustalW2 alignment. At the bottom of the column SPS display is the column score (CS) indicator. The un-gapped CS uses those columns marked with purple squares, and the gapped CS uses columns marked with both purple and red squares (small and large squares indicate 'considered' and 'matched' columns, respectively). Summary statistics shown above the reference alignment include: % consistency, SPS, and two types of CS. B. The MSA Viewer is used to generate the secondary structure representation for the reference and ClustalW2 MSAs (shown at the top and bottom, respectively). Symbols used for the secondary structure

prediction are: H (green) for helix, C (cream) for coil, and E (brown) for beta-strand.

alignment (ClustalW2 in this case) is consistent (and to what degree) with the reference.

**3.1.3 Comparing multiple MSAs** 

many/long sequences. On the other hand, SPS is the result of a direct comparison between two MSAs. The per-column SPS (brown bar chart) displayed in Fig. 3A clearly shows where the test

In Fig. 4, we compared MSAs produced by four methods against the reference lipocalin family MSA (MSA 1). Using the Pixel Plot, we can clearly see different patterns among the MSAs. The magenta-highlighted areas illustrate how the corresponding characters are aligned (or not) in each MSA. The PRALINE MSA (MSA 2) is fairly consistent compared to the reference MSA. This is expected since PRALINE uses secondary structure information when optimizing the alignments. On the other hand, MAFFT, MUSCLE, and ClustalW2 MSAs show a similar displacement of the same sequences, apparent from the ragged edges

Fig. 4. Comparison of the lipocalin family reference MSA (MSA 1) with four reconstructed MSAs (PRALINE, MAFFT, MUSCLE, and ClustalW2). The Pixel Plot is used to show the alignment patterns with each non-gap character represented with a solid colored pixel and a gap with a blank pixel. Characters corresponding to those under the blue selection bar for the reference MSA are highlighted in magenta in all MSAs. The green range bars for MSAs 2-5 show the column ranges where corresponding characters are located.

of the magenta areas. Alignments generated by MAFFT, MUSCLE, and ClustalW2 (MSAs 3- 5) are roughly consistent to each other, but not consistent with the reference and PRALINE alignments. All four MSA methods tested produced shorter alignments (201-214 positions) compared to the reference alignment (219 positions). The shortest alignment was obtained from ClustalW2 (201 positions).

#### **4. Aligning transmembrane protein sequences**

In the previous section, we showed that comparing MSAs and secondary structure predictions help us assess the quality of MSAs. In this section, we will examine alignments of another type of proteins, transmembrane proteins.

#### **4.1 G-protein coupled receptors**

G protein-coupled receptor (GPCR) proteins contain seven transmembrane (TM) regions. They constitute a large protein superfamily grouped into three major and several minor classes (Horn *et al.*, 2003; Vroling *et al.*, 2011). Although the TM regions are relatively constant in length (22~24 amino acids or aa), the lengths of the N-/C-terminal and loop

Assessing Multiple Sequence Alignments Using Visual Tools 223

We aligned the 25 GPCR protein sequences using seven methods. The seven MSAs produced were compared using Pixel Plot in Fig. 6. Compared to the terminal or loop regions, seven TM regions are expected to have fewer gaps. Using the Pixel Plot we can confirm such patterns. Approximate areas predicted to have TM regions can be located as clusters of solid colored pixels. In Fig. 7, the seven MSAs are represented in the predicted TM structures. The area includes the first five TM regions shown as the green-colored clusters. Both PRALINE and PROMALS utilize information from secondary structure prediction (also TM prediction for PRALINE) as well as profiles based on PSI-BLAST similarity search (Table 2). As expected, in the MSAs reconstructed by these two methods, predicted TM structures are aligned better than other methods. Other methods with the exception of PRANK also generated MSAs that aligned the area containing the first three TM regions relatively well. The rest of the sequences were more difficult for alignment. Probalign had a difficulty in reconstructing also the third TM region. With all MSA methods, all positions after the third TM region were not well reconstructed in terms of conservation of TM regions. The difficulty in aligning the second half of the protein sequences is likely caused by the large length variation found among GPCR classes, especially in the fourth and fifth loops (between TM4 and TM5, and TM5 and TM6,

In order to gain more insights on the difference among GPCR protein MSAs quantitatively, we gathered SPS values from all pairwise comparisons among the seven MSAs. Each of the seven MSAs was used as the reference and other six MSAs were tested against. Fig. 8 clearly shows that SPS is not symmetric. As expected, PRALINE and PROMALS, both of which utilize secondary structure and TM prediction information, had very high SPS' when they are compared to each other (0.546 and 0.543). Interestingly, using PRALINE or PROMALS as the reference, MAFFT was found to perform very well although MAFFT does not incorporate secondary structure nor profile information. It should be also noted that SPS' are among the highest when Probalign was compared to MAFFT (either as the reference or the

The most drastic difference between the row and column averages of SPS' is found in PRANK. The SPS' obtained when the PRANK MSA was used as the reference (shown in the PRANK column) are all higher than those obtained when the PRANK MSA was tested against others (shown in the PRANK row). This can be explained by the gappy nature of the PRANK MSA (see Figs. 6 and 7). The PRANK MSA tends to have more gaps because of the underlying design of the method. It attempts to identify distinct insertions and deletions and tries not to collapse such independent events into the same column. For the same set of sequences, the reference alignment that has more gaps has a fewer number of characterpairs available (denominator in equation (6)) when averaging the total SPS, which tends to generate a higher SPS. Note also that the phylogeny-aware algorithm used with PRANK cannot perform well when sequences are too diverged (Löytynoja & Goldman, 2008). With extremely diverged GPCR sequences, PRANK was not expected to perform very well, which was indicated by constantly low SPS' obtained with PRANK. Although in the absence of 'true' reference alignment, low SPS values do not necessarily indicate incorrect alignment but rather inconsistency between the alignments, virtually no TM region was conserved in the PRANK MSA (Figs. 6 and 7). We will examine more on PRANK in the next section.

**4.3 Comparison of GPCR MSAs reconstructed by seven methods** 

respectively) (Wistrand *et al.*, 2006).

test MSA).

regions are highly varied especially among different classes (Inoue *et al.*, 2004; Wistrand *et al.*, 2006). GPCR sequences are also highly divergent. These features make aligning GPCR sequences a challenge. We sampled 25 protein sequences from three major classes of GPCRs (Classes A, B, and C). The lengths of these GPCR sequences vary from 201 to 972 aa.

#### **4.2 Alignment of GPCR sequences**

Fig. 5 shows the alignment of the 25 GPCRs generated by PRALINE (showing only the first three TM area). Since PRALINE incorporates information from secondary structure, TM structure, as well as profiles based on PSI-BLAST similarity search (Table 2), it is expected to perform well in aligning TM regions. In order to confirm this, TM regions were predicted for each of the 25 GPCR sequences using MEMSAT3 (http://bioinf.cs.ucl.ac.uk/psipred/; Nugent & Jones, 2009). The predicted TM structural information was then aligned with the PRALINE MSA. Fig. 5 shows that the predicted TM regions (depicted with 'X' in green color) are clearly well aligned and visualized as green-colored clusters. The 'hydrophobicity' color scheme used for the MSA display as well as the average column hydrophobicity plot also confirm that more hydrophobic amino acids are found in predicted TM regions.

Fig. 5. Alignment of 25 GPCR proteins generated by PRALINE compared with TM structural predictions. Using the MSA Viewer, the PRALINE MSA is displayed using the 'hydrophobicity' color scheme showing hydrophobic amino acids more toward red and hydrophilic amino acids more toward blue. The predicted TM structure corresponding to each sequence of the MSA is aligned below. The symbols (based on MEMSAT3 prediction) used to show different TM structural components are as follows: 'X' (green) for the TM region, '+' (light brown) for the inside loop, 'I' (brown) for the inside helix cap, '=' (cream) for the outside loop, and 'O' (yellow) for the outside helix cap. The first three TM regions are depicted as three clusters of green letter X's. At the bottom of the display is the information content for each column (blue bar) and the average hydrophobicity for each column (black line plot). The average hydrophobicity (see equation (3)) is based on the index given by Kyte and Doolittle (1982).

regions are highly varied especially among different classes (Inoue *et al.*, 2004; Wistrand *et al.*, 2006). GPCR sequences are also highly divergent. These features make aligning GPCR sequences a challenge. We sampled 25 protein sequences from three major classes of GPCRs

Fig. 5 shows the alignment of the 25 GPCRs generated by PRALINE (showing only the first three TM area). Since PRALINE incorporates information from secondary structure, TM structure, as well as profiles based on PSI-BLAST similarity search (Table 2), it is expected to perform well in aligning TM regions. In order to confirm this, TM regions were predicted for each of the 25 GPCR sequences using MEMSAT3 (http://bioinf.cs.ucl.ac.uk/psipred/; Nugent & Jones, 2009). The predicted TM structural information was then aligned with the PRALINE MSA. Fig. 5 shows that the predicted TM regions (depicted with 'X' in green color) are clearly well aligned and visualized as green-colored clusters. The 'hydrophobicity' color scheme used for the MSA display as well as the average column hydrophobicity plot

(Classes A, B, and C). The lengths of these GPCR sequences vary from 201 to 972 aa.

also confirm that more hydrophobic amino acids are found in predicted TM regions.

Fig. 5. Alignment of 25 GPCR proteins generated by PRALINE compared with TM structural predictions. Using the MSA Viewer, the PRALINE MSA is displayed using the 'hydrophobicity' color scheme showing hydrophobic amino acids more toward red and hydrophilic amino acids more toward blue. The predicted TM structure corresponding to each sequence of the MSA is aligned below. The symbols (based on MEMSAT3 prediction) used to show different TM structural components are as follows: 'X' (green) for the TM region, '+' (light brown) for the inside loop, 'I' (brown) for the inside helix cap, '=' (cream) for the outside loop, and 'O' (yellow) for the outside helix cap. The first three TM regions are depicted as three clusters of green letter X's. At the bottom of the display is the information content for each column (blue bar) and the average hydrophobicity for each column (black line plot). The average hydrophobicity (see equation (3)) is based on the index given by Kyte and Doolittle (1982).

**4.2 Alignment of GPCR sequences** 

#### **4.3 Comparison of GPCR MSAs reconstructed by seven methods**

We aligned the 25 GPCR protein sequences using seven methods. The seven MSAs produced were compared using Pixel Plot in Fig. 6. Compared to the terminal or loop regions, seven TM regions are expected to have fewer gaps. Using the Pixel Plot we can confirm such patterns. Approximate areas predicted to have TM regions can be located as clusters of solid colored pixels. In Fig. 7, the seven MSAs are represented in the predicted TM structures. The area includes the first five TM regions shown as the green-colored clusters. Both PRALINE and PROMALS utilize information from secondary structure prediction (also TM prediction for PRALINE) as well as profiles based on PSI-BLAST similarity search (Table 2). As expected, in the MSAs reconstructed by these two methods, predicted TM structures are aligned better than other methods. Other methods with the exception of PRANK also generated MSAs that aligned the area containing the first three TM regions relatively well. The rest of the sequences were more difficult for alignment. Probalign had a difficulty in reconstructing also the third TM region. With all MSA methods, all positions after the third TM region were not well reconstructed in terms of conservation of TM regions. The difficulty in aligning the second half of the protein sequences is likely caused by the large length variation found among GPCR classes, especially in the fourth and fifth loops (between TM4 and TM5, and TM5 and TM6, respectively) (Wistrand *et al.*, 2006).

In order to gain more insights on the difference among GPCR protein MSAs quantitatively, we gathered SPS values from all pairwise comparisons among the seven MSAs. Each of the seven MSAs was used as the reference and other six MSAs were tested against. Fig. 8 clearly shows that SPS is not symmetric. As expected, PRALINE and PROMALS, both of which utilize secondary structure and TM prediction information, had very high SPS' when they are compared to each other (0.546 and 0.543). Interestingly, using PRALINE or PROMALS as the reference, MAFFT was found to perform very well although MAFFT does not incorporate secondary structure nor profile information. It should be also noted that SPS' are among the highest when Probalign was compared to MAFFT (either as the reference or the test MSA).

The most drastic difference between the row and column averages of SPS' is found in PRANK. The SPS' obtained when the PRANK MSA was used as the reference (shown in the PRANK column) are all higher than those obtained when the PRANK MSA was tested against others (shown in the PRANK row). This can be explained by the gappy nature of the PRANK MSA (see Figs. 6 and 7). The PRANK MSA tends to have more gaps because of the underlying design of the method. It attempts to identify distinct insertions and deletions and tries not to collapse such independent events into the same column. For the same set of sequences, the reference alignment that has more gaps has a fewer number of characterpairs available (denominator in equation (6)) when averaging the total SPS, which tends to generate a higher SPS. Note also that the phylogeny-aware algorithm used with PRANK cannot perform well when sequences are too diverged (Löytynoja & Goldman, 2008). With extremely diverged GPCR sequences, PRANK was not expected to perform very well, which was indicated by constantly low SPS' obtained with PRANK. Although in the absence of 'true' reference alignment, low SPS values do not necessarily indicate incorrect alignment but rather inconsistency between the alignments, virtually no TM region was conserved in the PRANK MSA (Figs. 6 and 7). We will examine more on PRANK in the next section.

Assessing Multiple Sequence Alignments Using Visual Tools 225

Fig. 7. Seven GPCR protein MSAs represented in TM structures predicted by MEMSAT3. The areas covering the first five predicted TM regions are shown. The red boxes indicate the areas containing amino acids predicted for the second TM regions. Wider red boxes in reconstructed MSAs indicate TM regions with a higher number of gaps (*e.g.*, Probalign and PRANK). The symbols representing amino acids predicted for different TM structural components are as follows: 'X' (green) for the TM region, '+' (light brown) for the inside loop, 'I' (brown) for the inside helix cap, '=' (cream) for the outside loop, and 'O' (yellow) for the outside helix cap.

Fig. 6. Comparison of seven GPCR protein MSAs. The characters corresponding to the second TM (TM2) regions are highlighted with magenta-colored pixels. It shows that the TM2 region is reconstructed well in the MSAs by PRALINE (MSA 1) and PROMALS (MSA 2), relatively well by MAFFT (MSA 3), MUSCLE (MSA 4) and ClustalW2 (MSA 5), and not very well by Probalign (MSA 6) and PRANK (MSA 7). For the PRALINE MSA, the positions for the seven TM regions are as follows: 555-572, 595-611, 641-663, 683-703, 738-756, 813-830, and 854-875, which roughly correspond to solid-colored regions.

Fig. 6. Comparison of seven GPCR protein MSAs. The characters corresponding to the second TM (TM2) regions are highlighted with magenta-colored pixels. It shows that the TM2 region is reconstructed well in the MSAs by PRALINE (MSA 1) and PROMALS (MSA 2), relatively well by MAFFT (MSA 3), MUSCLE (MSA 4) and ClustalW2 (MSA 5), and not very well by Probalign (MSA 6) and PRANK (MSA 7). For the PRALINE MSA, the positions for the seven TM regions are as follows: 555-572, 595-611, 641-663, 683-703, 738-756, 813-830, and 854-875,

which roughly correspond to solid-colored regions.

Fig. 7. Seven GPCR protein MSAs represented in TM structures predicted by MEMSAT3. The areas covering the first five predicted TM regions are shown. The red boxes indicate the areas containing amino acids predicted for the second TM regions. Wider red boxes in reconstructed MSAs indicate TM regions with a higher number of gaps (*e.g.*, Probalign and PRANK). The symbols representing amino acids predicted for different TM structural components are as follows: 'X' (green) for the TM region, '+' (light brown) for the inside loop, 'I' (brown) for the inside helix cap, '=' (cream) for the outside loop, and 'O' (yellow) for the outside helix cap.

Assessing Multiple Sequence Alignments Using Visual Tools 227

Fig. 9. Reference MSA of HIV/SIV gp120 proteins. The reference alignment was based on the seed alignment of Pfam Family GP120 (PF00516). The secondary structure is predicted for each sequence by PSIPRED (Jones, 1999) and is displayed using the MSA Viewer. The symbols depicting different secondary structures are as follows: H (green): alpha-helix, E (brown): beta-strand, and C (cream): coil. The MSA region shown above contains the Nterminal conserved area and the V1 variable region (positions 100-150). The predicted regions of the beta-strands (positions 50-53 and 93-96) and the alpha-helix (positions 66-78)

Progressive alignment methods such as ClustalW2, MAFFT, and MUSCLE build an alignment based on aligning the profiles of previously aligned sequences. The presence of a gap in the profiles is not checked to determine if the addition of another gap is parsimonious with the guide tree. The decision of whether adding a gap or not is based on the optimization function score. Since the inference of additional gaps penalizes the optimization function score, it often results in incorrectly matching potentially independent

PRANK attempts to avoid the above-mentioned pit-falls in progressive alignments by utilizing "phylogeny-aware" handling of gaps and treating insertions and deletions differently. The overall effect of the PRANK method compared to other progressive alignment methods is that the alignment is extended due to the separation of the independent insertions. As Löytynoja and Goldman (2008) stated, "the resulting alignments may be fragmented by many gaps and may not be as visually beautiful as the traditional

We aligned the set of 21 gp120 sequences using PRANK and other five alignment methods. For PRANK, we reconstructed the phylogeny using PhyML 3.0 (Guindon et al., 2010) and used it as the input phylogeny (with rooting between HIV1 and HIV2 clusters, the topology was identical with the one given in Löytynoja & Goldman, 2008). As shown in Fig. 10, the

alignments, but if they represent correct homology, we have to get used to them."

correspond to the known HIV gp120 protein structure.

insertions, creating incorrect homologies.

**5.3 Comparison of the PRANK MSA with others** 


Fig. 8. Pairwise comparison of the sum-of-pairs scores (SPS) between GPCR protein MSAs reconstructed by the seven methods. The numbers in parentheses are the alignment lengths (the number of columns in each alignment). The highest score in each comparison is shown in boldface.

#### **5. A different perspective on gaps**

In this section we highlight the alignment method PRANK, which is unique in emphasizing a different perspective on the evolutionary process producing insertions and deletions. As shown in the previous section, it tends to produce more gaps in alignments compared to other methods. We compare the alignment generated by PRANK with four other methods.

#### **5.1 Viral envelope glycoprotein, gp120**

Löytynoja and Goldman (2008) used the viral exterior envelope glycoprotein, gp120, from human and simian immunodeficiency viruses (HIVs and SIVs, respectively) as an example to demonstrate how PRANK works. In this section, we used the same set of sequences they used (the seed alignment of Pfam Family GP120, PF00516, excluding SIVGB, SIVV1, and SIVG1). The entry of HIVs and SIVs into the host requires the interaction of the viral gp120 with the cell-surface proteins of the host. In order to avoid the host's immune system, several regions of the gp120 proteins evolve fast. Fig. 9 shows the MSA of gp120 proteins compared with the predicted secondary structures.

#### **5.2 Gap treatment**

The 'gap' within an alignment is a general expression for two very different types of evolutionary events. It represents either an insertion of one or more characters or the deletion of one or more. Both types of events are unobservable, and as such it is difficult to distinguish which event creates a gap in an alignment. For example, the 'gappy' section of an alignment, such as the V1 section of HIV/SIV gp120 (Fig. 9), can be interpreted either as the result of a high substitution rate along with frequent independent deletions or as the result of frequent independent short insertions and deletions. Optimization functions used in most MSA methods over-infer the former scenario, stacking independent insertions in the same column and potentially erroneously inflating substitution rates in such regions. Using phylogenetic information, PRANK, on the other hand, allows for the inference of both deletions and insertions as separate events.

 **Test PRALINE PROMALS Probalign MAFFT MUSCLE ClustalW2 PRANK [Average] PRALINE** (1,077) 0.543 0.563 **0.592** 0.474 **0.381** 0.351 **0.484 PROMALS** 0.546 (1,078) 0.538 0.568 **0.500** 0.345 0.344 0.474 **Probalign** 0.477 0.455 (1,111) 0.520 0.457 0.324 **0.362** 0.433 **MAFFT 0.569 0.546 0.590** (1,081) 0.498 0.346 0.357 **0.484 MUSCLE** 0.462 0.487 0.526 0.505 (1,445) 0.325 0.322 0.438 **ClustalW2** 0.383 0.346 0.384 0.362 0.335 (1,041) 0.283 0.349 **PRANK** 0.231 0.227 0.227 0.245 0.218 0.186 (2,419) 0.232

**[Average]** 0.445 0.434 0.471 **0.465** 0.414 0.318 0.337

Fig. 8. Pairwise comparison of the sum-of-pairs scores (SPS) between GPCR protein MSAs reconstructed by the seven methods. The numbers in parentheses are the alignment lengths (the number of columns in each alignment). The highest score in each comparison is shown

In this section we highlight the alignment method PRANK, which is unique in emphasizing a different perspective on the evolutionary process producing insertions and deletions. As shown in the previous section, it tends to produce more gaps in alignments compared to other methods. We compare the alignment generated by PRANK with four

Löytynoja and Goldman (2008) used the viral exterior envelope glycoprotein, gp120, from human and simian immunodeficiency viruses (HIVs and SIVs, respectively) as an example to demonstrate how PRANK works. In this section, we used the same set of sequences they used (the seed alignment of Pfam Family GP120, PF00516, excluding SIVGB, SIVV1, and SIVG1). The entry of HIVs and SIVs into the host requires the interaction of the viral gp120 with the cell-surface proteins of the host. In order to avoid the host's immune system, several regions of the gp120 proteins evolve fast. Fig. 9 shows the MSA of gp120 proteins

The 'gap' within an alignment is a general expression for two very different types of evolutionary events. It represents either an insertion of one or more characters or the deletion of one or more. Both types of events are unobservable, and as such it is difficult to distinguish which event creates a gap in an alignment. For example, the 'gappy' section of an alignment, such as the V1 section of HIV/SIV gp120 (Fig. 9), can be interpreted either as the result of a high substitution rate along with frequent independent deletions or as the result of frequent independent short insertions and deletions. Optimization functions used in most MSA methods over-infer the former scenario, stacking independent insertions in the same column and potentially erroneously inflating substitution rates in such regions. Using phylogenetic information, PRANK, on the other hand, allows for the inference of both

**Reference**

in boldface.

other methods.

**5.2 Gap treatment** 

**5. A different perspective on gaps** 

**5.1 Viral envelope glycoprotein, gp120** 

compared with the predicted secondary structures.

deletions and insertions as separate events.

Fig. 9. Reference MSA of HIV/SIV gp120 proteins. The reference alignment was based on the seed alignment of Pfam Family GP120 (PF00516). The secondary structure is predicted for each sequence by PSIPRED (Jones, 1999) and is displayed using the MSA Viewer. The symbols depicting different secondary structures are as follows: H (green): alpha-helix, E (brown): beta-strand, and C (cream): coil. The MSA region shown above contains the Nterminal conserved area and the V1 variable region (positions 100-150). The predicted regions of the beta-strands (positions 50-53 and 93-96) and the alpha-helix (positions 66-78) correspond to the known HIV gp120 protein structure.

Progressive alignment methods such as ClustalW2, MAFFT, and MUSCLE build an alignment based on aligning the profiles of previously aligned sequences. The presence of a gap in the profiles is not checked to determine if the addition of another gap is parsimonious with the guide tree. The decision of whether adding a gap or not is based on the optimization function score. Since the inference of additional gaps penalizes the optimization function score, it often results in incorrectly matching potentially independent insertions, creating incorrect homologies.

PRANK attempts to avoid the above-mentioned pit-falls in progressive alignments by utilizing "phylogeny-aware" handling of gaps and treating insertions and deletions differently. The overall effect of the PRANK method compared to other progressive alignment methods is that the alignment is extended due to the separation of the independent insertions. As Löytynoja and Goldman (2008) stated, "the resulting alignments may be fragmented by many gaps and may not be as visually beautiful as the traditional alignments, but if they represent correct homology, we have to get used to them."

#### **5.3 Comparison of the PRANK MSA with others**

We aligned the set of 21 gp120 sequences using PRANK and other five alignment methods. For PRANK, we reconstructed the phylogeny using PhyML 3.0 (Guindon et al., 2010) and used it as the input phylogeny (with rooting between HIV1 and HIV2 clusters, the topology was identical with the one given in Löytynoja & Goldman, 2008). As shown in Fig. 10, the

Assessing Multiple Sequence Alignments Using Visual Tools 229

% consistency

Reference - - - - 563 14.90 14.50 71.20 PRANK 0.872 0.845 0.702 53.08 633 13.60 **23.90** 61.30 PROMALS 0.919 0.855 **0.775 68.24** 548 15.30 12.10 76.60 Probalign **0.920** 0.838 0.761 62.50 579 15.00 16.80 72.70 MAFFT 0.907 0.827 0.727 63.90 557 15.40 13.60 74.70 MUSCLE 0.910 **0.926** 0.750 66.20 548 **15.50** 12.10 **77.60** 

Table 3. Alignment statistics for gp120 MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in boldface. Fig. 11 illustrates a comparison between the MSAs generated by MAFFT (top) and PRANK (bottom). In Fig. 11A, the region under the blue selection bar for the MAFFT MSA (positions 104 to 128; 25 aa long) is more compact than the region covered by the corresponding amino acids in the PRANK MSA as indicated by the long green range bar (ranging from positions 106 to 159; 54 aa long). In this region, PRANK shows, for example, two independent insertions marked with light blue boxes ('GL' and 'MIR') both happening in SIV/HIV2 sequences. In the MAFFT MSA, these two sequences are part of a much longer insertion region unique to SIV/HIV2, implying that frequent deletion events shortened this region in various HIV2. Another insertion found in HIV1 by PRANK, 'SSSLR' (in a light green box), is shown to be almost independent. However, in the MAFFT MSA, the corresponding region appears to have experienced many deletion events instead. This shows the "gap magnet" phenomenon found in many progressivealignment methods. Fig. 11B from the same MSA area highlights another possible artifact often found in MSAs generated by progressive alignment methods. In the red area in the MAFFT MSA, all sequences are aligned (matched) generating the "collapsed insertions", implying homologous relationships among these sequences. However, in the PRANK MSA, the corresponding sequences are spread out in a wide range of columns. These examples show that the inferred evolutionary scenarios can be completely different

MSA length (aa)

%

conserved % gaps % no-gap

columns

Method SPS CS with

no-gap

CS with gaps

depending on the alignment methods used to analyze sequences.

**6. Using simulated sequences for testing MSA methods** 

In this section we will discuss the use of simulation data in the comparison of alignment methods. The advantage of using simulated sequences is the availability of the 'true' alignment. In the simulation example discussed in this section, we simulate two sets of eight lipocalin sequences described in Section 3. The lipocalin protein family has a common structural feature, a single eight-stranded antiparallel beta-barrel. They also share three conserved motifs. We will use the simulation program indel-Seq-Gen version 2.1 (iSGv2.1; Strope *et al.*, 2009) to simulate this lipocalin family proteins. iSGv2.1 is included in the

SuiteMSA package and the simulation can be done using its graphical user interface.

Fig. 10. Comparison of gp120 MSAs. The Pixel Plot is used to compare five reconstructed MSAs (MSA 2: PRANK, MSA 3: PROMALS, MSA 4: Probalign, MSA 5: MAFFT, and MSA 6: MUSCLE) with the reference alignment (MSA 1, based on the seed alignment of Pfam Family GP120, PF00516). The area highlighted in magenta color is part of the V1 variable region, where the patterns show that the PRANK MSA is highly inconsistent with other MSAs.

major differences among MSAs are found starting at the first highly variable area, V1. Within this area, PRANK infers far more insertions than the other methods. The number of sites covered by the blue selection bar in the reference alignment (MSA 1) is 53. The corresponding sites in the other alignments are spread over from 51 columns with MAFFT (MSA 5) to 84 columns in PRANK (MSA 2).

Table 3 summarizes alignment statistics. As expected, PRANK generated the longest alignment. This is indicated in the PRANK MSA having a higher % gaps, lower % consistency, and lower % no-gap columns. Note also that the reference alignment used was the Pfam seed alignment, which in principle was generated using an alignment strategy similar to methods other than PRANK. These comparisons clearly illustrate the point made by Löytynoja and Goldman (2008). Depending on the MSA method used, a very different evolutionary mechanism would be emphasized to explain fast evolving gp120 sequences: either accelerated substitution rates or extremely high rate of short insertions or deletions. Another important point is that scores devised for MSA comparison (*e.g.*, SPS) should be used with the knowledge of the assumption underlying the design of the method used as well as the nature of the reference alignment.

Fig. 10. Comparison of gp120 MSAs. The Pixel Plot is used to compare five reconstructed MSAs (MSA 2: PRANK, MSA 3: PROMALS, MSA 4: Probalign, MSA 5: MAFFT, and MSA 6: MUSCLE) with the reference alignment (MSA 1, based on the seed alignment of Pfam Family GP120, PF00516). The area highlighted in magenta color is part of the V1 variable region, where the patterns show that the PRANK MSA is highly inconsistent with other

major differences among MSAs are found starting at the first highly variable area, V1. Within this area, PRANK infers far more insertions than the other methods. The number of sites covered by the blue selection bar in the reference alignment (MSA 1) is 53. The corresponding sites in the other alignments are spread over from 51 columns with MAFFT

Table 3 summarizes alignment statistics. As expected, PRANK generated the longest alignment. This is indicated in the PRANK MSA having a higher % gaps, lower % consistency, and lower % no-gap columns. Note also that the reference alignment used was the Pfam seed alignment, which in principle was generated using an alignment strategy similar to methods other than PRANK. These comparisons clearly illustrate the point made by Löytynoja and Goldman (2008). Depending on the MSA method used, a very different evolutionary mechanism would be emphasized to explain fast evolving gp120 sequences: either accelerated substitution rates or extremely high rate of short insertions or deletions. Another important point is that scores devised for MSA comparison (*e.g.*, SPS) should be used with the knowledge of the assumption underlying the design of the method used as

MSAs.

(MSA 5) to 84 columns in PRANK (MSA 2).

well as the nature of the reference alignment.


Table 3. Alignment statistics for gp120 MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in boldface.

Fig. 11 illustrates a comparison between the MSAs generated by MAFFT (top) and PRANK (bottom). In Fig. 11A, the region under the blue selection bar for the MAFFT MSA (positions 104 to 128; 25 aa long) is more compact than the region covered by the corresponding amino acids in the PRANK MSA as indicated by the long green range bar (ranging from positions 106 to 159; 54 aa long). In this region, PRANK shows, for example, two independent insertions marked with light blue boxes ('GL' and 'MIR') both happening in SIV/HIV2 sequences. In the MAFFT MSA, these two sequences are part of a much longer insertion region unique to SIV/HIV2, implying that frequent deletion events shortened this region in various HIV2. Another insertion found in HIV1 by PRANK, 'SSSLR' (in a light green box), is shown to be almost independent. However, in the MAFFT MSA, the corresponding region appears to have experienced many deletion events instead. This shows the "gap magnet" phenomenon found in many progressivealignment methods. Fig. 11B from the same MSA area highlights another possible artifact often found in MSAs generated by progressive alignment methods. In the red area in the MAFFT MSA, all sequences are aligned (matched) generating the "collapsed insertions", implying homologous relationships among these sequences. However, in the PRANK MSA, the corresponding sequences are spread out in a wide range of columns. These examples show that the inferred evolutionary scenarios can be completely different depending on the alignment methods used to analyze sequences.

#### **6. Using simulated sequences for testing MSA methods**

In this section we will discuss the use of simulation data in the comparison of alignment methods. The advantage of using simulated sequences is the availability of the 'true' alignment. In the simulation example discussed in this section, we simulate two sets of eight lipocalin sequences described in Section 3. The lipocalin protein family has a common structural feature, a single eight-stranded antiparallel beta-barrel. They also share three conserved motifs. We will use the simulation program indel-Seq-Gen version 2.1 (iSGv2.1; Strope *et al.*, 2009) to simulate this lipocalin family proteins. iSGv2.1 is included in the SuiteMSA package and the simulation can be done using its graphical user interface.

Assessing Multiple Sequence Alignments Using Visual Tools 231

iSGv2.1 simulation requires a guide tree and a root sequence or MSA. By providing a root MSA, instead of generating a random root sequence, the site-specific amino acid (or nucleotide) frequency distribution derived from each MSA column can be used to generate a simulation root sequence (for details, refer to iSG user manual). For the root MSA, we used the 8-protein alignment of the lipocalin family we described in Section 3. The evolutionary parameters chosen to simulate the lipocalin protein family are listed below. We performed two simulations: the second more divergent than the first. For any parameters not mentioned, default values were used. Three input files are prepared: a guide tree file, a lineage file, and a root MSA file. For details on preparing the guide tree, the three motifs used, and how to set up the length-limitation template, refer to Strope *et al.* (2009) as well as Anderson *et al.* (2011). All input files used for this simulation are available from:

Guide tree file: lipo8\_3.tre (provides the guide tree and option parameters listed

Insertion probability = deletion probability = 0.02 (first simulation), 0.025 (second

Indel length distribution = deletion length distribution: file name = inDL (provides

After running each simulation, we obtained a set of eight simulated sequences, the true alignment of the eight sequences, and a record of all insertion and deletion events. As shown in Fig. 12, the 'true' alignments from both simulations (the first more conserved and the second more diverged) maintained the three conserved lipocalin motifs (M1, M2, and M3) specified in the simulations. As expected, the second MSA derived from the simulated sequences with a higher rate of substitutions (longer branch lengths) and a higher rate of indel probability is about 100 aa longer (Fig. 12B, 303 aa) than the first MSA (Fig. 12A, 215

We used four MSA methods (MAFFT, MUSCLE, Probalign, and PRANK) to align both sets of the eight simulated sequences. For PRANK, the simulation guide trees (with branch lengths scaled for the 'more conserved' and 'more divergent' simulations) were used as the input phylogenies. In Fig. 13, the Pixel Plot is used to compare the reconstructed MSAs against the reference MSAs (the true alignments obtained from the two simulations). Tables 4 and 5 summarize the alignment statistics for the two sets of

Lineage file: lipo8\_3.spec (provides the motif and template information)

http://bioinfolab.unl.edu/~canderson/SuiteMSA/supplement.html.

Branch scale: 0.5 (first simulation), 2.0 (second simulation)

aa). We used these 'true' alignments as the references for the next analysis.

**6.2 Comparison of MSA reconstruction using simulated sequences** 

iii. Guide tree options (information included in the guide tree file)

Use root msa file: lipo8\_3template.root\_in

**6.1 Setting up the iSGv2.1 simulation** 

i. Basic parameters

below)

ii. Advanced parameters

simulation)

simulated data.

Substitution model: PAM

Random number seed: 6262

Maximum indel length: 10

indel length distribution)

Fig. 11. Comparison of gp120 alignment regions generated by MAFFT and PRANK. For both panels A and B, the MAFFT MSA is used as the reference (top) and the PRANK MSA (bottom) is compared against. The blue selection bar for the MAFFT MSA shows the alignment area selected, and the green range bar for the PRANK MSA shows the column range where corresponding amino acids are found in the MSA. The corresponding amino acids in the two MSAs are shown with red color ('red' indicates that the alignment of these characters is inconsistent between the MSAs). See the main text for the description on the sequences marked with light blue and light green boxes.

#### **6.1 Setting up the iSGv2.1 simulation**

iSGv2.1 simulation requires a guide tree and a root sequence or MSA. By providing a root MSA, instead of generating a random root sequence, the site-specific amino acid (or nucleotide) frequency distribution derived from each MSA column can be used to generate a simulation root sequence (for details, refer to iSG user manual). For the root MSA, we used the 8-protein alignment of the lipocalin family we described in Section 3. The evolutionary parameters chosen to simulate the lipocalin protein family are listed below. We performed two simulations: the second more divergent than the first. For any parameters not mentioned, default values were used. Three input files are prepared: a guide tree file, a lineage file, and a root MSA file. For details on preparing the guide tree, the three motifs used, and how to set up the length-limitation template, refer to Strope *et al.* (2009) as well as Anderson *et al.* (2011). All input files used for this simulation are available from: http://bioinfolab.unl.edu/~canderson/SuiteMSA/supplement.html.

#### i. Basic parameters

230 Bioinformatics – Trends and Methodologies

Fig. 11. Comparison of gp120 alignment regions generated by MAFFT and PRANK. For both

panels A and B, the MAFFT MSA is used as the reference (top) and the PRANK MSA (bottom) is compared against. The blue selection bar for the MAFFT MSA shows the alignment area selected, and the green range bar for the PRANK MSA shows the column range where corresponding amino acids are found in the MSA. The corresponding amino acids in the two MSAs are shown with red color ('red' indicates that the alignment of these characters is inconsistent between the MSAs). See the main text for the description on the

sequences marked with light blue and light green boxes.

	- Lineage file: lipo8\_3.spec (provides the motif and template information)
	- Branch scale: 0.5 (first simulation), 2.0 (second simulation)
	- Random number seed: 6262
	- Use root msa file: lipo8\_3template.root\_in
	- Maximum indel length: 10
	- Insertion probability = deletion probability = 0.02 (first simulation), 0.025 (second simulation)
	- Indel length distribution = deletion length distribution: file name = inDL (provides indel length distribution)

After running each simulation, we obtained a set of eight simulated sequences, the true alignment of the eight sequences, and a record of all insertion and deletion events. As shown in Fig. 12, the 'true' alignments from both simulations (the first more conserved and the second more diverged) maintained the three conserved lipocalin motifs (M1, M2, and M3) specified in the simulations. As expected, the second MSA derived from the simulated sequences with a higher rate of substitutions (longer branch lengths) and a higher rate of indel probability is about 100 aa longer (Fig. 12B, 303 aa) than the first MSA (Fig. 12A, 215 aa). We used these 'true' alignments as the references for the next analysis.

#### **6.2 Comparison of MSA reconstruction using simulated sequences**

We used four MSA methods (MAFFT, MUSCLE, Probalign, and PRANK) to align both sets of the eight simulated sequences. For PRANK, the simulation guide trees (with branch lengths scaled for the 'more conserved' and 'more divergent' simulations) were used as the input phylogenies. In Fig. 13, the Pixel Plot is used to compare the reconstructed MSAs against the reference MSAs (the true alignments obtained from the two simulations). Tables 4 and 5 summarize the alignment statistics for the two sets of simulated data.

Assessing Multiple Sequence Alignments Using Visual Tools 233

Fig. 13. Comparison of simulated lipocalin protein MSAs. The Pixel Plot is used to compare four reconstructed MSAs with the reference alignments (A: more conserved and B: more divergent simulations). The 'true' alignments obtained from the simulations are used as the reference alignment (MSA 1). M1 (red box), M2 (blue box), and M3 (green box) show the location of three conserved regions. The regions highlighted in magenta show an example of inconsistent alignments found in reconstructed alignments relative to the true reference alignments. MSA methods used are MAFFT (MSA 2), MUSCLE (MSA 3), Probalign (MSA 4),

Fig. 13A shows that all four methods produced highly consistent MSAs for the sequences obtained from the more conserved simulation. While two of the three conserved motifs were identified correctly in all MSAs, in the region of the first motif (M1), all reconstructed MSAs contained gaps. Consistently very high SPS' (0.91~0.93, Table 4) indicate that all methods performed very well. The proportion of gaps is also consistent between all reconstructed

and PRANK (MSA 5).

MSAs and the reference (~10%, Table 4).

Fig. 12. 'True' alignments of the two sets of simulated lipocalin protein sequences (A: more conserved and B: more divergent simulations). Both alignments clearly show that the three motifs (M1, M2, and M3) are conserved among these two sets of simulated protein sequences.

Fig. 12. 'True' alignments of the two sets of simulated lipocalin protein sequences (A: more conserved and B: more divergent simulations). Both alignments clearly show that the three

motifs (M1, M2, and M3) are conserved among these two sets of simulated protein

sequences.


Fig. 13. Comparison of simulated lipocalin protein MSAs. The Pixel Plot is used to compare four reconstructed MSAs with the reference alignments (A: more conserved and B: more divergent simulations). The 'true' alignments obtained from the simulations are used as the reference alignment (MSA 1). M1 (red box), M2 (blue box), and M3 (green box) show the location of three conserved regions. The regions highlighted in magenta show an example of inconsistent alignments found in reconstructed alignments relative to the true reference alignments. MSA methods used are MAFFT (MSA 2), MUSCLE (MSA 3), Probalign (MSA 4), and PRANK (MSA 5).

Fig. 13A shows that all four methods produced highly consistent MSAs for the sequences obtained from the more conserved simulation. While two of the three conserved motifs were identified correctly in all MSAs, in the region of the first motif (M1), all reconstructed MSAs contained gaps. Consistently very high SPS' (0.91~0.93, Table 4) indicate that all methods performed very well. The proportion of gaps is also consistent between all reconstructed MSAs and the reference (~10%, Table 4).

Assessing Multiple Sequence Alignments Using Visual Tools 235

Fig. 14. Comparisons of PRANK (A) and MAFFT (B) alignments against the reference alignment (the simulated 'true' alignment). The region is taken from the area highlighted in magenta in Fig. 13A. The MSA Comparator is used to show the actual insertion (marked with green) and deletion (marked with yellow) events in the reference alignment. These

When the divergence level was much higher, as shown in Fig. 13B, all methods could still identify all of the three conserved motif sites. However, all MSAs were highly inconsistent within the unconstrained areas. SPS' are significantly lower (0.50-0.59, Table 5). For this dataset, PRANK produced the most inconsistent MSA, which was expected as PRANK is recommended for aligning closely related sequences. It should also be noted that there is little agreement among the alignments. This indicates that regardless of the statistics used, no one method can be concluded as ideal. Using multiple methods is recommended so that a selection of alignment hypotheses can be used to generate a more robust hypothesis.

We have so far concentrated our discussion on protein sequence alignments. In order to obtain the full picture of alignment issues, in this section, we will examine the alignment of

The ribosomal RNA genes contain large stretches of highly conserved sites (stem or knot binding sites) interspersed with regions of varying sites (loop regions). These two types of regions within the gene have different information content due to strong selective constraints on the secondary structures and function within the stem and knot areas *versus* very weak constraints on the loop area. Fig. 15 shows a predicted secondary structure of the small-subunit ribosomal RNA (or 18S rRNA) from a parasitic protozoa *Toxoplasma gondii*, a member of the family Sarcocystidae (Phylum Apicomplexa; Class Conoidasida; Subclass

Figs. 16 and 17 show part of the 18S rDNA MSA of 60 Coccidia species (D. A. Morrison personal communication; Morrison, 2009a). As shown in Fig. 16, stem regions are highly conserved. This alignment illustrates the high level of conservation found in approximately 45% of the 18S rDNA alignment. On the other hand, large loop regions as the one shown in Fig. 15 have much lower functional constraints. As shown in Fig. 17, sequences of such regions are highly variable and alignment reconstruction of such regions often requires

**7.1 Small-subunit ribosomal DNA sequences and secondary structure** 

events are traced during the iSGv2.1 simulation.

**7. Aligning ribosomal DNA sequences** 

ribosomal DNA (rDNA) sequences.

Coccidia).


Table 4. Alignment statistics for the simulated 'more conserved' lipocalin family MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in boldface.


Table 5. Alignment statistics for the simulated 'more divergent' lipocalin family MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in boldface.

In Fig. 14, we compared PRANK and MAFFT alignments more in detail using the MSA Comparator. This is the same region highlighted in magenta color in Fig. 13A. The alignment columns that are fully consistent between the reference and PRANK (Fig. 14A) or MAFFT (Fig. 14B) are shown with blue color. Black characters, on the other hand, indicate inconsistently aligned columns. For example, the characters contained in the red square in the reference MSA are aligned exactly the same in the PRANK MSA. However, as shown in Fig. 14B for the MAFFT MSA, the gap in column 35 (in the reference) is filled with the characters shifted from the left. The Pixel Plot in Fig. 13 shows that the same shifting and filling of the gap happened in all but the PRANK MSA. This demonstrates the "gap magnet" phenomenon described in the previous section. Using the simulated data, we know the origins of gaps. In Fig. 14, '-' in yellow cells are derived from deletion events. Characters in green cells, on the other hand, are derived by insertion events. Therefore, stacking up the 'QVD' sequences and avoiding inserting gaps as done by MAFFT (Fig. 14B) is evolutionary incorrect. With commonly used affine-gap penalty systems, opening new gaps is highly penalized as opposed to extending an existing gap. This is reinforced with progressive alignment methods. This situation is clearly illustrated in the example shown in Fig. 14B. Using its "phylogeny-aware" gap handling, PRANK was able to correctly align these gaps.

MSA length (aa)

MSA length (aa)

%

%

conserved % gaps % no-gap

conserved % gaps % no-gap

columns

columns

% consistency

Reference - - - - 215 7.4 10.9 74.9 MAFFT **0.933 0.850 0.782 72.77** 213 7.0 10.0 75.1 MUSCLE 0.921 0.817 0.751 70.28 212 7.1 9.6 **77.4**  Probalign 0.912 0.787 0.704 63.76 218 **7.8 12.1** 73.9 PRANK 0.932 0.826 0.776 71.5 215 7.5 10.5 75.2

Table 4. Alignment statistics for the simulated 'more conserved' lipocalin family MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in

> % consistency

Reference - -- - 303 2.00 **39.70** 39.60 MAFFT 0.533 **0.481** 0.263 20.64 252 2.40 27.50 42.90 MUSCLE 0.532 0.421 0.314 **28.11** 219 2.70 16.60 **63.00** Probalign **0.585** 0.465 **0.321** 23.56 258 2.30 29.20 49.20 PRANK 0.504 0.395 0.264 24.88 213 **2.80** 14.30 60.60

Table 5. Alignment statistics for the simulated 'more divergent' lipocalin family MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in

In Fig. 14, we compared PRANK and MAFFT alignments more in detail using the MSA Comparator. This is the same region highlighted in magenta color in Fig. 13A. The alignment columns that are fully consistent between the reference and PRANK (Fig. 14A) or MAFFT (Fig. 14B) are shown with blue color. Black characters, on the other hand, indicate inconsistently aligned columns. For example, the characters contained in the red square in the reference MSA are aligned exactly the same in the PRANK MSA. However, as shown in Fig. 14B for the MAFFT MSA, the gap in column 35 (in the reference) is filled with the characters shifted from the left. The Pixel Plot in Fig. 13 shows that the same shifting and filling of the gap happened in all but the PRANK MSA. This demonstrates the "gap magnet" phenomenon described in the previous section. Using the simulated data, we know the origins of gaps. In Fig. 14, '-' in yellow cells are derived from deletion events. Characters in green cells, on the other hand, are derived by insertion events. Therefore, stacking up the 'QVD' sequences and avoiding inserting gaps as done by MAFFT (Fig. 14B) is evolutionary incorrect. With commonly used affine-gap penalty systems, opening new gaps is highly penalized as opposed to extending an existing gap. This is reinforced with progressive alignment methods. This situation is clearly illustrated in the example shown in Fig. 14B. Using its "phylogeny-aware" gap handling, PRANK

Method SPS CS with

each comparison is shown in boldface.

each comparison is shown in boldface.

was able to correctly align these gaps.

Method SPS CS with

no-gap

no-gap

CS with gaps

CS with gaps

Fig. 14. Comparisons of PRANK (A) and MAFFT (B) alignments against the reference alignment (the simulated 'true' alignment). The region is taken from the area highlighted in magenta in Fig. 13A. The MSA Comparator is used to show the actual insertion (marked with green) and deletion (marked with yellow) events in the reference alignment. These events are traced during the iSGv2.1 simulation.

When the divergence level was much higher, as shown in Fig. 13B, all methods could still identify all of the three conserved motif sites. However, all MSAs were highly inconsistent within the unconstrained areas. SPS' are significantly lower (0.50-0.59, Table 5). For this dataset, PRANK produced the most inconsistent MSA, which was expected as PRANK is recommended for aligning closely related sequences. It should also be noted that there is little agreement among the alignments. This indicates that regardless of the statistics used, no one method can be concluded as ideal. Using multiple methods is recommended so that a selection of alignment hypotheses can be used to generate a more robust hypothesis.

#### **7. Aligning ribosomal DNA sequences**

We have so far concentrated our discussion on protein sequence alignments. In order to obtain the full picture of alignment issues, in this section, we will examine the alignment of ribosomal DNA (rDNA) sequences.

#### **7.1 Small-subunit ribosomal DNA sequences and secondary structure**

The ribosomal RNA genes contain large stretches of highly conserved sites (stem or knot binding sites) interspersed with regions of varying sites (loop regions). These two types of regions within the gene have different information content due to strong selective constraints on the secondary structures and function within the stem and knot areas *versus* very weak constraints on the loop area. Fig. 15 shows a predicted secondary structure of the small-subunit ribosomal RNA (or 18S rRNA) from a parasitic protozoa *Toxoplasma gondii*, a member of the family Sarcocystidae (Phylum Apicomplexa; Class Conoidasida; Subclass Coccidia).

Figs. 16 and 17 show part of the 18S rDNA MSA of 60 Coccidia species (D. A. Morrison personal communication; Morrison, 2009a). As shown in Fig. 16, stem regions are highly conserved. This alignment illustrates the high level of conservation found in approximately 45% of the 18S rDNA alignment. On the other hand, large loop regions as the one shown in Fig. 15 have much lower functional constraints. As shown in Fig. 17, sequences of such regions are highly variable and alignment reconstruction of such regions often requires

Assessing Multiple Sequence Alignments Using Visual Tools 237

Fig. 16. Alignment of a highly conserved stem region of 18S rDNA from 60 Coccidia species.

We generated the alignments of full 18S rDNA sequences using four MSA methods. Using the above-mentioned alignment provided by D. A. Morrison as the reference, we compared the performance of the MSA methods. The alignment statistics are summarized in Table 6.

> % consistency

Reference - - - - 2095 50.60 15.80 62.40 Probalign **0.953** 0.919 0.863 65.96 2389 44.30 **26.10** 55.60 MAFFT 0.950 0.898 0.844 73.08 2088 50.10 15.50 **64.10**  MUSCLE 0.950 0.917 **0.867** 73.77 2116 50.20 16.60 63.00 ClustalW2 0.948 **0.946** 0.855 **75.23** 2055 **51.40** 14.10 62.50

Table 6. Alignment statistics for the 18S rDNA MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in

MSA length (nuc)

%

conserved % gaps % no-gap

columns

Using the MSA Viewer, the rRNA secondary structure information from *T. gondii* is displayed below the alignment. This alignment corresponds to the region in the callout 'S' shown in Fig. 15. The sites that are considered to be ambiguously aligned for this family are indicated by a red 'A' in the structural representation. These positions do not appear in the

*T. gondii* structure. The alignment was provided by D. A. Morrison.

CS with gaps

**7.2 Comparison of 18S rDNA MSA reconstruction** 

no gap

Method SPS CS with

boldface.

laborious manual adjustment, iteratively incorporating information from the predicted rRNA secondary structures (Morrison, 2009a).

Fig. 15. Predicted secondary structure of the *Toxoplasma gondii* 18S rRNA. The secondary structure was obtained from Comparative RNA Web Site (http://www.rna.ccbb.utexas.edu/; Cannone *et al.*, 2002). The callouts 'S' and 'V' show the 'stem' and the large 'loop' regions, respectively. Their sequence-structure alignments are shown at the bottom (the orange arrows pointing to the beginning and ending of the

regions).

laborious manual adjustment, iteratively incorporating information from the predicted

Fig. 15. Predicted secondary structure of the *Toxoplasma gondii* 18S rRNA. The secondary

(http://www.rna.ccbb.utexas.edu/; Cannone *et al.*, 2002). The callouts 'S' and 'V' show the 'stem' and the large 'loop' regions, respectively. Their sequence-structure alignments are shown at the bottom (the orange arrows pointing to the beginning and ending of the

structure was obtained from Comparative RNA Web Site

regions).

rRNA secondary structures (Morrison, 2009a).

Fig. 16. Alignment of a highly conserved stem region of 18S rDNA from 60 Coccidia species. Using the MSA Viewer, the rRNA secondary structure information from *T. gondii* is displayed below the alignment. This alignment corresponds to the region in the callout 'S' shown in Fig. 15. The sites that are considered to be ambiguously aligned for this family are indicated by a red 'A' in the structural representation. These positions do not appear in the *T. gondii* structure. The alignment was provided by D. A. Morrison.

#### **7.2 Comparison of 18S rDNA MSA reconstruction**

We generated the alignments of full 18S rDNA sequences using four MSA methods. Using the above-mentioned alignment provided by D. A. Morrison as the reference, we compared the performance of the MSA methods. The alignment statistics are summarized in Table 6.


Table 6. Alignment statistics for the 18S rDNA MSAs. SPS, CS, and % consistency are obtained against the reference alignment. The highest value in each comparison is shown in boldface.

Assessing Multiple Sequence Alignments Using Visual Tools 239

However, each method also introduced a large number of gaps, affecting the consistency in the alignments of the surrounding areas immediately before and after the selected region. In spite of the high SPS' observed with these MSAs and the degree of conservation within the alignments, there is little consensus among the MSAs of this phylogentically critical area. Through visual comparisons among alternative MSAs it becomes possible to recognize that very different hypotheses could emerge depending on the MSA chosen. Such significant differences among the MSAs are, and should be, alarming to researchers, since such

Fig. 18. Comparison of 18S rDNA MSAs. Pixel Plot is used to compare four reconstructed MSAs with the reference. The alignment provided by D. A. Morrison was used as the reference

and compared with MSAs generated by Probalign, MAFFT, MUSCLE, and ClustalW2.

inconsistency in MSAs could affect phylogenetic hypotheses.

Fig. 17. Alignment of a highly variable loop region of 18S rDNA from 60 Coccidia species. One of the secondary structures used to refine the alignment was from *T. gondii*. This structure is displayed below the alignment. This alignment corresponds to the region in the callout 'V' shown in Fig. 15. The alignment was provided by D. A. Morrison.

All four methods appear to have produced alignments highly consistent with the reference. This must be owing to highly conserved stem or functional regions that cover almost 50% of the sequence regions. Such consistency is reflected by the high CS values particularly when gapped columns are excluded (CS with no gap) and also the small differences in CS values among MSAs. ClustalW2 has the highest un-gapped CS, indicating that ClustalW2 has the highest number of columns that match the reference alignment, and likewise, the highest % consistency. While the ClustalW2 MSA is the shortest (2,055 nucleotides), the longest and most 'gappy' MSA was obtained by Probalign. Similar trends are found in the other examples described in this chapter. Note, however, that the Probalign MSA had the highest SPS (0.953) and second highest 'CS with gaps' (0.863; MUSCLE had a slightly better score, 0.867).

Now let us visually examine these alignments. Keep in mind that the sequences are highly conserved, and that phylogenetic information will be derived mainly from the regions that are sufficiently variable. In Fig. 18, the Pixel Plot was used to compare the four reconstructed MSAs against the reference MSA. The selected area of the reference MSA under the blue selection bar includes the subsequence shown in the callout 'V' of Fig. 15 (Fig. 16 also shows the same area of the reference MSA). The magenta-colored pixels show the distribution of characters included in this selected area. In the reference MSA, the magenta-colored area has relatively small amount of gaps, providing the largest aligned overlap, putatively the most phylogenetically informative region, within this large loop region. However, in the alternative MSAs (MSAs 2-5), magenta-colored corresponding characters are spread over much wider regions (green bars show the ranges covered by corresponding characters). Note that each MSA method found a few conserved subsequences (matched columns) within this region.

Fig. 17. Alignment of a highly variable loop region of 18S rDNA from 60 Coccidia species. One of the secondary structures used to refine the alignment was from *T. gondii*. This structure is displayed below the alignment. This alignment corresponds to the region in the

All four methods appear to have produced alignments highly consistent with the reference. This must be owing to highly conserved stem or functional regions that cover almost 50% of the sequence regions. Such consistency is reflected by the high CS values particularly when gapped columns are excluded (CS with no gap) and also the small differences in CS values among MSAs. ClustalW2 has the highest un-gapped CS, indicating that ClustalW2 has the highest number of columns that match the reference alignment, and likewise, the highest % consistency. While the ClustalW2 MSA is the shortest (2,055 nucleotides), the longest and most 'gappy' MSA was obtained by Probalign. Similar trends are found in the other examples described in this chapter. Note, however, that the Probalign MSA had the highest SPS (0.953) and second highest 'CS with gaps' (0.863; MUSCLE had a slightly better score,

Now let us visually examine these alignments. Keep in mind that the sequences are highly conserved, and that phylogenetic information will be derived mainly from the regions that are sufficiently variable. In Fig. 18, the Pixel Plot was used to compare the four reconstructed MSAs against the reference MSA. The selected area of the reference MSA under the blue selection bar includes the subsequence shown in the callout 'V' of Fig. 15 (Fig. 16 also shows the same area of the reference MSA). The magenta-colored pixels show the distribution of characters included in this selected area. In the reference MSA, the magenta-colored area has relatively small amount of gaps, providing the largest aligned overlap, putatively the most phylogenetically informative region, within this large loop region. However, in the alternative MSAs (MSAs 2-5), magenta-colored corresponding characters are spread over much wider regions (green bars show the ranges covered by corresponding characters). Note that each MSA method found a few conserved subsequences (matched columns) within this region.

callout 'V' shown in Fig. 15. The alignment was provided by D. A. Morrison.

0.867).

However, each method also introduced a large number of gaps, affecting the consistency in the alignments of the surrounding areas immediately before and after the selected region. In spite of the high SPS' observed with these MSAs and the degree of conservation within the alignments, there is little consensus among the MSAs of this phylogentically critical area. Through visual comparisons among alternative MSAs it becomes possible to recognize that very different hypotheses could emerge depending on the MSA chosen. Such significant differences among the MSAs are, and should be, alarming to researchers, since such inconsistency in MSAs could affect phylogenetic hypotheses.

Fig. 18. Comparison of 18S rDNA MSAs. Pixel Plot is used to compare four reconstructed MSAs with the reference. The alignment provided by D. A. Morrison was used as the reference and compared with MSAs generated by Probalign, MAFFT, MUSCLE, and ClustalW2.

Assessing Multiple Sequence Alignments Using Visual Tools 241

Cline, M., Hughey, R., & Karplus, K. (2002). Predicting reliable regions in protein sequence

Edgar, R. C. (2004a). MUSCLE: a multiple sequence alignment method with reduced time

Edgar, R. C. (2004b). MUSCLE: multiple sequence alignment with high accuracy and high

Edgar, R. C. (2010). Quality measures for protein alignment benchmarks. *Nucleic acids* 

Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O. L.,

Flower, D. R., North, A. C., & Sansom, C. E. (2000). The lipocalin protein family: structural and sequence overview. *Biochimica et biophysica acta,* Vol. 1482, No. 1-2, pp. 9-24. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010).

Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., & Vriend, G. (2003). GPCRDB

Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B. A., de Castro, E., Lachaize, C.,

Inoue, Y., Ikeda, M., & Shimizu, T. (2004). Proteome-wide classification and identification of

Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring

Katoh, K. & Toh, H. (2008). Recent developments in the MAFFT multiple sequence alignment program. *Briefings in bioinformatics,* Vol. 9, No. 4, pp. 286-298. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., & Kanehisa, M.

Kemena, C. & Notredame, C. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era. *Bioinformatics,* Vol. 25, No. 19, pp. 2455-2465. Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character

Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H.,

Löytynoja, A. & Goldman, N. (2005). An algorithm for progressive multiple alignment of

Löytynoja, A. & Goldman, N. (2008). Phylogeny-aware gap placement prevents errors in

of a protein. *Journal of molecular biology,* Vol. 157, No. 1, pp. 105-132.

matrices. *Journal of molecular biology,* Vol. 292, No. 2, pp. 195-202.

*acids research,* Vol. 36, No. Database issue, pp. D245-249.

*research,* Vol. 36, No. Database issue, pp. D202-205.

*States of America,* Vol. 102, No. 30, pp. 10557-10562.

Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L., Eddy, S. R., & Bateman, A. (2010). The Pfam protein families database. *Nucleic acids research,* 

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. *Systematic biology*, Vol. 59, No. 3, pp. 307-

information system for G protein-coupled receptors. *Nucleic acids research,* Vol. 31,

Langendijk-Genevaux, P. S., & Sigrist, C. J. (2008). The 20 years of PROSITE. *Nucleic* 

mammalian-type GPCRs by binary topology pattern. *Computational biology and* 

(2008). AAindex: amino acid index database, progress report 2008. *Nucleic acids* 

Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. J., & Higgins, D. G. (2007). Clustal W and Clustal X version 2.0. *Bioinformatics,* Vol. 23,

sequences with insertions. *Proceedings of the National academy of sciences of the United* 

sequence alignment and evolutionary analysis. *Science,* Vol. 320, No. 5883, pp. 1632-

and space complexity. *BMC bioinformatics,* Vol. 5, No. 1, pp. 113.

throughput. *Nucleic acids research,* Vol. 32, No. 5, pp. 1792-1797.

alignments. *Bioinformatics,* Vol. 18, No. 2, pp. 306-314.

*research,* Vol. 38, No. 7, pp. 2145-2153.

321.

No. 1, pp. 294-297.

No. 21, pp. 2947-2948.

1635.

*chemistry,* Vol. 28, No. 1, pp. 39-49.

Vol. 38, No. Database issue, pp. D211-222.

#### **8. Conclusion**

Advancements in the field of bioinformatics and molecular evolution have resulted in many different methods for reconstructing MSAs. While each MSA method has a different objective function and different heuristics to maximize the objective function for building the alignment, if they were in fact meant to reconstruct alignments that reflect the evolutionary history of sequences, we would expect some level of consensus between them. Such is not the case in reality. We used five types of alignment problems in this chapter. Using seven different MSA methods, we discussed the similarity and difference among MSAs built by these methods. We have shown that assessment of MSAs can be performed using a combination of descriptive statistics both for individual alignments and the comparison of two alternate alignments. We have also shown that using visual tools provided by SuiteMSA, we can examine MSAs based on the alignment of structural features such as secondary structure and transmembrane predictions. We further demonstrated how the sequence simulator included in SuiteMSA can be used to produce benchmark alignments.

We should keep in mind that alignments reconstructed by any MSA methods are only hypotheses on the evolutionary relation of the sequences. Furthermore, while these alignments can be assessed as consistent (or not) with the accepted model for the given sequences (the reference alignment), this reference is itself a hypothesis unless generated by a simulation program and may not be 'correct'. It is important for the researcher to understand the underlying assumptions of the alignment methods as well as the characteristics of the biological sequences to be aligned and to assess the resulting alignments. User friendly graphical tools such as SuiteMSA can assist in the critical assessment of MSAs prior to their use in further studies.

#### **9. Acknowledgements**

We would like to thank Dr. David A. Morrison (Swedish University of Agricultural Sciences) for providing us 18S rDNA alignments and discussion. This work has been partially supported by NSF AToL grant 0732863 to ENM.

#### **10. References**


Advancements in the field of bioinformatics and molecular evolution have resulted in many different methods for reconstructing MSAs. While each MSA method has a different objective function and different heuristics to maximize the objective function for building the alignment, if they were in fact meant to reconstruct alignments that reflect the evolutionary history of sequences, we would expect some level of consensus between them. Such is not the case in reality. We used five types of alignment problems in this chapter. Using seven different MSA methods, we discussed the similarity and difference among MSAs built by these methods. We have shown that assessment of MSAs can be performed using a combination of descriptive statistics both for individual alignments and the comparison of two alternate alignments. We have also shown that using visual tools provided by SuiteMSA, we can examine MSAs based on the alignment of structural features such as secondary structure and transmembrane predictions. We further demonstrated how the sequence simulator included in SuiteMSA can

We should keep in mind that alignments reconstructed by any MSA methods are only hypotheses on the evolutionary relation of the sequences. Furthermore, while these alignments can be assessed as consistent (or not) with the accepted model for the given sequences (the reference alignment), this reference is itself a hypothesis unless generated by a simulation program and may not be 'correct'. It is important for the researcher to understand the underlying assumptions of the alignment methods as well as the characteristics of the biological sequences to be aligned and to assess the resulting alignments. User friendly graphical tools such as SuiteMSA can assist in the critical assessment of MSAs prior to their

We would like to thank Dr. David A. Morrison (Swedish University of Agricultural Sciences) for providing us 18S rDNA alignments and discussion. This work has been

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. *Journal of molecular biology,* Vol. 215, No. 3, pp. 403-410. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T.

Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R., D'Souza, L. M., Du, Y., Feng, B.,

search programs. *Nucleic acids research,* Vol. 25, No. 17, pp. 3389-3402. Anderson, C. L., Strope, C. L., & Moriyama, E. N. (2011). SuiteMSA: Visual tools for multiple

J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database

sequence alignment comparison and molecular sequence simulation. *BMC* 

L. (2009). BLAST+: architecture and applications. *BMC bioinformatics,* Vol. 10, pp. 421.

Lin, N., Madabusi, L. V., Muller, K. M., Pande, N., Shang, Z., Yu, N., & Gutell, R. R. (2002). The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. *BMC* 

**8. Conclusion** 

use in further studies.

**10. References** 

**9. Acknowledgements** 

be used to produce benchmark alignments.

partially supported by NSF AToL grant 0732863 to ENM.

*bioinformatics,* Vol. 12, pp. 184.

*bioinformatics,* Vol. 3, pp. 2.


**12** 

*Iran* 

**Optimal Sequence Alignment** 

Atoosa Ghahremani and Mahmood A. Mahdavi

*Azadi Square, Pardis Campus, Mashhad,* 

**and Its Relationship with Phylogeny** 

*Department of Chemical Engineering, Ferdowsi University of Mashhad,* 

The main motivation for predicting functions of hundreds of thousands of genes and proteins found across genomes and proteomes is variations within a family of related nucleic acid or protein sequences that provide an unreliable source of information for evolutionary biology. Protein molecules are more diverse in structure and function than any other kind of molecule. Then if nucleic acid sequences undergo mutations, insertions, crossing-over and some another changes, these variations have a direct effect on the coded protein molecules (Fitch, 1970; Pearson et al., 1997). If a protein sequence is present in many different organisms or be conserved along evolution, it is predicted that it might have a similar function in all the organisms. Two molecules of related function usually have similar sequences reciprocally two molecules of similar sequence usually have related functions (Dardel, 2006). The objective of bioinformatics is to detect such similarities, using computer methods to draw biological conclusions. Collecting available wealth of sequence information, help to track ancient genes and back trough the tree of life then to discover new organisms based on their sequences (Fitch, 1966). Searching diverse genes may show different evolutionary histories that reflecting transfers of genetic material between species. If we recognize the function and/or structure of a member of an evolutionary family then we can predict the function of all the other members and even identify the important functional groups. For this, we need to identify which proteins are belonging to the same family and then distinguish proteins that are evolved from the same ancestor after a set of accepted mutation events. Such proteins have amino acid sequences that are likely to be more similar than expected for unrelated protein sequences. When two or more than two sequences share a common evolutionary ancestor they called homologous (Fitch, 1970). There is no homology degree, sequences are either homologues or not (Reeck et al., 1987; Tautz, 1998). These types of proteins almost always share a significantly related treedimentional structure. An example for very similar structures which is determined by x-ray crystallography is RBP and β-lactoglobulin (Fig. 1). Once the homology between some related sequences is inferred, identity and similarity are the quantities for describing the relatedness of sequences. In one type of homology, two sequences may be homologous but without sharing statistically significant identity. In general, three dimensional structures differ much more slowly than amino acid identity between two proteins (Chothia & Lesk,

**1. Introduction** 


### **Optimal Sequence Alignment and Its Relationship with Phylogeny**

Atoosa Ghahremani and Mahmood A. Mahdavi *Department of Chemical Engineering, Ferdowsi University of Mashhad, Azadi Square, Pardis Campus, Mashhad, Iran* 

#### **1. Introduction**

242 Bioinformatics – Trends and Methodologies

Morrison, D. A. (2009a). Evolution of the Apicomplexa: where are we now? *Trends in* 

Morrison, D. A. (2009b). Why would phylogeneticists ignore computerized sequence

Nugent, T. & Jones, D. T. (2009). Transmembrane protein topology prediction using support

Pei, J. & Grishin, N. V. (2007). PROMALS: towards accurate multiple sequence alignments of distantly related proteins. *Bioinformatics,* Vol. 23, No. 7, pp. 802-808. Pei, J., Kim, B. H., & Grishin, N. V. (2008). PROMALS3D: a tool for multiple protein sequence and structure alignments. *Nucleic acids research,* Vol. 36, No. 7, pp. 2295-2300. Pirovano, W., Feenstra, K. A., & Heringa, J. (2008). PRALINETM: a strategy for improved

Pirovano, W. & Heringa, J. (2010). Protein secondary structure prediction. *Methods in* 

Raghava, G. P., Searle, S. M., Audley, P. C., Barber, J. D., & Barton, G. J. (2003). OXBench: a

Roshan, U. & Livesay, D. R. (2006). Probalign: multiple sequence alignment using partition function posterior probabilities. *Bioinformatics,* Vol. 22, No. 22, pp. 2715-2721. Schneider, T. D. & Stephens, R. M. (1990). Sequence logos: a new way to display consensus

Shannon, C. E. (1948). A mathematical theory of communication. *Bell system technical journal,* 

Stebbings, L. A. & Mizuguchi, K. (2004). HOMSTRAD: recent developments of the

Strope, C. L., Abel, K., Scott, S. D., & Moriyama, E. N. (2009). Biological sequence simulation

Subramanian, A. R., Weyer-Menkhoff, J., Kaufmann, M., & Morgenstern, B. (2005).

Thompson, J. D., Koehl, P., Ripp, R., & Poch, O. (2005). BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. *Proteins,* Vol. 61, No. 1, pp. 127-136. Thompson, J. D., Linard, B., Lecompte, O., & Poch, O. (2011). A comprehensive benchmark

Thompson, J. D., Plewniak, F., & Poch, O. (1999). BAliBASE: a benchmark alignment

Van Walle, I., Lasters, I., & Wyns, L. (2005). SABmark--a benchmark for sequence alignment

Vroling, B., Sanders, M., Baakman, C., Borrmann, A., Verhoeven, S., Klomp, J., Oliveira, L.,

sequences. *Nucleic acids research,* Vol. 18, No. 20, pp. 6097-6100.

multiple alignment of transmembrane proteins. *Bioinformatics,* Vol. 24, No. 4, pp.

benchmark for evaluation of protein multiple sequence alignment accuracy. *BMC* 

Homologous Protein Structure Alignment Database. *Nucleic acids research,* Vol. 32,

for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. *Molecular* 

DIALIGN-T: an improved algorithm for segment-based multiple sequence

study of multiple sequence alignment methods: current challenges and future

database for the evaluation of multiple alignment programs. *Bioinformatics,* Vol. 15,

that covers the entire known fold space. *Bioinformatics,* Vol. 21, No. 7, pp. 1267-1268.

de Vlieg, J., & Vriend, G. (2011). GPCRDB: information system for G proteincoupled receptors. *Nucleic acids research,* Vol. 39, No. Database issue, pp. D309-319. Wistrand, M., Kall, L., & Sonnhammer, E. L. (2006). A general model of G protein-coupled

receptor sequences and its application to detect remote homologs. *Protein science,* 

alignment? *Systematic biology,* Vol. 58, No. 1, pp. 150-158.

vector machines. *BMC bioinformatics,* Vol. 10, pp. 159.

*parasitology,* Vol. 25, No. 8, pp. 375-382.

*molecular biology,* Vol. 609, pp. 327-348.

*bioinformatics,* Vol. 4, pp. 47.

No. Database issue, pp. D203-207.

*biology and evolution,* Vol. 26, No. 11, pp. 2581-2593.

alignment. *BMC bioinformatics,* Vol. 6, pp. 66.

perspectives. *PloS one,* Vol. 6, No. 3, pp. e18093.

Vol. 27, pp. 379-423.

No. 1, pp. 87-88.

Vol. 15, No. 3, pp. 509-521.

492-497.

The main motivation for predicting functions of hundreds of thousands of genes and proteins found across genomes and proteomes is variations within a family of related nucleic acid or protein sequences that provide an unreliable source of information for evolutionary biology. Protein molecules are more diverse in structure and function than any other kind of molecule. Then if nucleic acid sequences undergo mutations, insertions, crossing-over and some another changes, these variations have a direct effect on the coded protein molecules (Fitch, 1970; Pearson et al., 1997). If a protein sequence is present in many different organisms or be conserved along evolution, it is predicted that it might have a similar function in all the organisms. Two molecules of related function usually have similar sequences reciprocally two molecules of similar sequence usually have related functions (Dardel, 2006). The objective of bioinformatics is to detect such similarities, using computer methods to draw biological conclusions. Collecting available wealth of sequence information, help to track ancient genes and back trough the tree of life then to discover new organisms based on their sequences (Fitch, 1966). Searching diverse genes may show different evolutionary histories that reflecting transfers of genetic material between species. If we recognize the function and/or structure of a member of an evolutionary family then we can predict the function of all the other members and even identify the important functional groups. For this, we need to identify which proteins are belonging to the same family and then distinguish proteins that are evolved from the same ancestor after a set of accepted mutation events. Such proteins have amino acid sequences that are likely to be more similar than expected for unrelated protein sequences. When two or more than two sequences share a common evolutionary ancestor they called homologous (Fitch, 1970). There is no homology degree, sequences are either homologues or not (Reeck et al., 1987; Tautz, 1998). These types of proteins almost always share a significantly related treedimentional structure. An example for very similar structures which is determined by x-ray crystallography is RBP and β-lactoglobulin (Fig. 1). Once the homology between some related sequences is inferred, identity and similarity are the quantities for describing the relatedness of sequences. In one type of homology, two sequences may be homologous but without sharing statistically significant identity. In general, three dimensional structures differ much more slowly than amino acid identity between two proteins (Chothia & Lesk,

Optimal Sequence Alignment and Its Relationship with Phylogeny 245

Fig. 2. Orthologous RBPs. In this tree, sequences that are more closely related to each other

Fig. 3. Paralogous of human lipocalin proteins. Each of them is a member of protein family.

are grouped closer.

1986). There are two types of homology, orthology and paralogy. Orthologs are homologous sequences that are in different species but arose from a common ancestral gene during speciation event. It has been predicted that orthologous sequences have similar biological functions (In Fig. 2, human and rat RBPs both transport vitamin A in serum). Paralogs are homologous sequences evolved from gene duplication mechanism. An example for paralogous sequences is human RBP plasma to the other carrier protein human apolipoprotein D (Fig. 3). It is predicted that paralogous sequences have distinct functions but their functions are related together (Pevsner, 2003a; Mount, 2001a).

Homology inference heavily relies on alignment of primary structure of proteins and DNA sequences. This is a procedure for identifying the matching residues within the sequences sharing the same functional and/or structural role in the different members of the family (Xu & Miranker, 2003). After performing alignment and evaluating alignment scores, the most closely related sequence pairs become apparent and may be placed in the outer branches of an evolutionary tree. With continuing alignment procedure for different sequences of particular gene, a predicted pattern of evolution for that particular gene is generated and a tree has been found for inferring the changes that have taken place in the tree branches. Therefore, the first step for making a phylogenetic tree is a sequence alignment (Feng, 1985). An indication for each pair of sequences is the sequence similarity score. A tree is derived based on the best accounts for the numbers of changes (distances) between the sequences of these scores.

Fig. 1. Tree-dimentional structure of two lipocalins: bovine RBP (left side), bovin βlactoglobuline (right side). These two proteins are homologous (evolve from a common ancestor), and they share very similar tree-dimensional structure consisting of a binding pocket for a ligand and eight antiparallel beta sheets.

1986). There are two types of homology, orthology and paralogy. Orthologs are homologous sequences that are in different species but arose from a common ancestral gene during speciation event. It has been predicted that orthologous sequences have similar biological functions (In Fig. 2, human and rat RBPs both transport vitamin A in serum). Paralogs are homologous sequences evolved from gene duplication mechanism. An example for paralogous sequences is human RBP plasma to the other carrier protein human apolipoprotein D (Fig. 3). It is predicted that paralogous sequences have distinct functions

Homology inference heavily relies on alignment of primary structure of proteins and DNA sequences. This is a procedure for identifying the matching residues within the sequences sharing the same functional and/or structural role in the different members of the family (Xu & Miranker, 2003). After performing alignment and evaluating alignment scores, the most closely related sequence pairs become apparent and may be placed in the outer branches of an evolutionary tree. With continuing alignment procedure for different sequences of particular gene, a predicted pattern of evolution for that particular gene is generated and a tree has been found for inferring the changes that have taken place in the tree branches. Therefore, the first step for making a phylogenetic tree is a sequence alignment (Feng, 1985). An indication for each pair of sequences is the sequence similarity score. A tree is derived based on the best accounts for the numbers of changes (distances)

Fig. 1. Tree-dimentional structure of two lipocalins: bovine RBP (left side), bovin βlactoglobuline (right side). These two proteins are homologous (evolve from a common ancestor), and they share very similar tree-dimensional structure consisting of a binding

pocket for a ligand and eight antiparallel beta sheets.

but their functions are related together (Pevsner, 2003a; Mount, 2001a).

between the sequences of these scores.

Fig. 2. Orthologous RBPs. In this tree, sequences that are more closely related to each other are grouped closer.

Fig. 3. Paralogous of human lipocalin proteins. Each of them is a member of protein family.

Optimal Sequence Alignment and Its Relationship with Phylogeny 247

Fig. 4. Multiple sequence alignment of the portion of the glyseraldehyde 3-phosphate

For homology inference, after aligning two sequences some quantities must be calculated including percent identity and percent similarity. The percent similarity or positive of two protein sequences is the sum of both identical and similar matches divided by length of alignment and characterized with mark (:) in the alignment. The percent identity is concluded from the number of identical residues divided by the length of alignment and is shown with (|) mark in the alignment (Fig. 5). Since the similarity measure is calculated based upon a variety of definitions for identifying the degree of related residues, then it is more useful to consider the degree of identity shared by two protein sequences. In aligning sequences with different lengths, there must be no column with merely gap characters. In an optimal alignment, mismatched residues and gaps are placed in positions where bring as

For obtaining the best possible alignment, introducing gaps in alignment and gap penalties for calculating alignment score is necessary. The addition of gaps in an alignment may be biologically relevant because the gaps reflect evolutionary changes that have occurred. They also allow full alignment of two proteins. The gaps represent two of tree types of common mutations occurred during evolution and caused divergence of the sequences of the two proteins. Insertions and deletions occur when residues are added or removed during evolution relative to the ancestor protein sequence and cause entering null characters or gaps to one of the sequences while aligning. There are two types of gap penalties: gap opening penalty for any gap (g) and gap extension penalty for each element in the gap (r)

*wx g rx* (1)

(Resee, 2002; Edgar, 2009). Thus, the total gap score wx can be calculated.

dehydrogenase (GAPDH) protein from six organisms.

many as possible identical and similar residues.

**2.2.1 Gaps and gap penalties** 

#### **2. Alignment approaches**

Sequence alignment is a way for comparing two (pair-wise alignment) or more than two (multiple alignment) sequences. This procedure looks for a series of particular residues or patterns that are in the same order. It is useful for discovering functional, structural, and evolutionary information in biological sequences (Wen et al., 2005; Berezin et al., 2003; Smoot, 2003). After sequence analysis if very much alike or similar sequences are found, they will probably have the same or similar biochemical functions and tree-dimensional structures (for protein sequences). If two sequences from different organisms are similar, there may have evolved from a common ancestor and the sequences are then defined to be homologous (Doolittle, 1981; Fitch & Smith, 1983; Feng & Doolittle, 1985). There are two approaches for sequence alignment: multiple sequence alignment and pair-wise sequence alignment.

#### **2.1 Multiple sequence alignment**

Multiple sequence alignment is a widely used method for comparing subsequences or entire length of more than two sequences and discovering the relations of their host organisms (Fig. 4). If two sequences are very close in terms of evolution, most of their residues remain unchanged and it will be rather difficult to detect important residues. On the other hand, if two sequences are evolutionarily distant, a reliable alignment of their sequences will be much more difficult to obtain. With aligning highest number of sequences of homologous proteins the aforementioned problem will be solved. Performing alignment the highly conserved residues that define structural and functional domains in protein families will be identified. New members of these families with the same domains can be found by searching sequence databases. A multiple sequence alignment implies a pair-wise alignment for each pair of sequences. The score of the multiple sequence alignment is the sum of scores of all implied pair-wise alignments. Multiple sequence alignment often tells us more than pair-wise alignment because it is more informative about evolutionary conservation (Edgar & Sjolander, 2004). The most common algorithm for multiple sequence alignment is BLAST. This algorithm has some programs like CLUSTALW for performing alignment and CLUSTALX for preparing graphical representation of the alignment (Larkin et al., 2007)

#### **2.2 Pair-wise sequence alignment**

In pair-wise alignment, two sequences are placed directly next to each other in two rows. For aligning protein sequences, the single-letter amino acid code is used. Identical or similar residues are placed in the same columns and non-identical residues can be placed either in the same column as a mismatch or opposite to a gap in the other sequences. The gaps are introduced to the sequences for shifting the residues (without disturbing its order) and obtaining the most possible matched residues, also for generating sequences with the same lengths. Some similar not identical residues are identified by pair-wise sequence alignment. Similar pairs of residues are related to each other because they share similar biochemical properties and are related functionally and structurally. When two similar residues are aligned, it is a representation of a conservative substitution that occurred during evolution. Amino acids with similar properties are comprised acidic amino acids like "D, E", basic amino acids like "K, R, H", hydroxylated amino acids "S, T", and hydrophobic amino acids "W, F, Y, L, I, V, M, A" (Pevsner, 2003a).

Sequence alignment is a way for comparing two (pair-wise alignment) or more than two (multiple alignment) sequences. This procedure looks for a series of particular residues or patterns that are in the same order. It is useful for discovering functional, structural, and evolutionary information in biological sequences (Wen et al., 2005; Berezin et al., 2003; Smoot, 2003). After sequence analysis if very much alike or similar sequences are found, they will probably have the same or similar biochemical functions and tree-dimensional structures (for protein sequences). If two sequences from different organisms are similar, there may have evolved from a common ancestor and the sequences are then defined to be homologous (Doolittle, 1981; Fitch & Smith, 1983; Feng & Doolittle, 1985). There are two approaches for sequence alignment: multiple sequence alignment and pair-wise sequence

Multiple sequence alignment is a widely used method for comparing subsequences or entire length of more than two sequences and discovering the relations of their host organisms (Fig. 4). If two sequences are very close in terms of evolution, most of their residues remain unchanged and it will be rather difficult to detect important residues. On the other hand, if two sequences are evolutionarily distant, a reliable alignment of their sequences will be much more difficult to obtain. With aligning highest number of sequences of homologous proteins the aforementioned problem will be solved. Performing alignment the highly conserved residues that define structural and functional domains in protein families will be identified. New members of these families with the same domains can be found by searching sequence databases. A multiple sequence alignment implies a pair-wise alignment for each pair of sequences. The score of the multiple sequence alignment is the sum of scores of all implied pair-wise alignments. Multiple sequence alignment often tells us more than pair-wise alignment because it is more informative about evolutionary conservation (Edgar & Sjolander, 2004). The most common algorithm for multiple sequence alignment is BLAST. This algorithm has some programs like CLUSTALW for performing alignment and CLUSTALX for preparing graphical representation of the alignment (Larkin et al., 2007)

In pair-wise alignment, two sequences are placed directly next to each other in two rows. For aligning protein sequences, the single-letter amino acid code is used. Identical or similar residues are placed in the same columns and non-identical residues can be placed either in the same column as a mismatch or opposite to a gap in the other sequences. The gaps are introduced to the sequences for shifting the residues (without disturbing its order) and obtaining the most possible matched residues, also for generating sequences with the same lengths. Some similar not identical residues are identified by pair-wise sequence alignment. Similar pairs of residues are related to each other because they share similar biochemical properties and are related functionally and structurally. When two similar residues are aligned, it is a representation of a conservative substitution that occurred during evolution. Amino acids with similar properties are comprised acidic amino acids like "D, E", basic amino acids like "K, R, H", hydroxylated amino acids "S, T", and hydrophobic amino acids

**2. Alignment approaches**

**2.1 Multiple sequence alignment**

**2.2 Pair-wise sequence alignment** 

"W, F, Y, L, I, V, M, A" (Pevsner, 2003a).

alignment.


Fig. 4. Multiple sequence alignment of the portion of the glyseraldehyde 3-phosphate dehydrogenase (GAPDH) protein from six organisms.

For homology inference, after aligning two sequences some quantities must be calculated including percent identity and percent similarity. The percent similarity or positive of two protein sequences is the sum of both identical and similar matches divided by length of alignment and characterized with mark (:) in the alignment. The percent identity is concluded from the number of identical residues divided by the length of alignment and is shown with (|) mark in the alignment (Fig. 5). Since the similarity measure is calculated based upon a variety of definitions for identifying the degree of related residues, then it is more useful to consider the degree of identity shared by two protein sequences. In aligning sequences with different lengths, there must be no column with merely gap characters. In an optimal alignment, mismatched residues and gaps are placed in positions where bring as many as possible identical and similar residues.

#### **2.2.1 Gaps and gap penalties**

For obtaining the best possible alignment, introducing gaps in alignment and gap penalties for calculating alignment score is necessary. The addition of gaps in an alignment may be biologically relevant because the gaps reflect evolutionary changes that have occurred. They also allow full alignment of two proteins. The gaps represent two of tree types of common mutations occurred during evolution and caused divergence of the sequences of the two proteins. Insertions and deletions occur when residues are added or removed during evolution relative to the ancestor protein sequence and cause entering null characters or gaps to one of the sequences while aligning. There are two types of gap penalties: gap opening penalty for any gap (g) and gap extension penalty for each element in the gap (r) (Resee, 2002; Edgar, 2009). Thus, the total gap score wx can be calculated.

$$w\_{\chi} = \mathcal{g} + r\chi\tag{1}$$

Optimal Sequence Alignment and Its Relationship with Phylogeny 249

For short and very closely related sequences, finding the best alignment is easy. However, in cases where sequences are long and not closely related finding the best alignment is rather difficult. If gaps are introduced in the alignment to account for deletions or insertions in the two sequences, the number of possible alignments increases exponentially. In these cases, computational methods are required. The known computational methods for this task are called dynamic programming algorithms. Such algorithms take two input sequences and

In general, there are two approaches for aligning sequences, global alignment and local alignment. In global alignment, the entire length of the sequence is subject to alignment. Sequences that are quite similar and their lengths are approximately the same are suitable for global alignment. In local alignment, the subsequences with the highest number of identical or similar residues are aligned and generate an alignment that is terminated at the ends of the regions with strong similarity. This type of alignment is a suitable way for aligning sequences that are similar along some regions of their length but dissimilar in others, sequences with different length, and those sequences share conserved regions. In sequence similarity analysis two dynamic programming algorithms are commonly used, the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. These algorithms are closely related, but the main difference is that the Needleman-Wunsch algorithm finds global similarity between sequences while the Smith-Waterman algorithm finds local similarity. The Smith-Waterman algorithm is the most used, because in reality biological sequences are not often similar over their entire lengths, but are similar only in particular

Needleman-Wunsch algorithm is one of the first and most important algorithms for aligning two protein sequences based upon dynamic programming. The importance of this algorithm is from the point that it produces an optimal alignment of protein or DNA sequences even with entering the gaps. Generating global sequence alignment using this algorithm undergoes tree steps: 1-setting up identity matrix, 2-scoring the matrix, and 3-identifying the optimal alignment. In the first step, the two sequences are placed in a two-dimensional

Fig. 6. A typical illustration of calculating gap affine penalty.

produce the best alignment between them as output (Sankoff, 1972).

regions (Pearson, 1992; Smith & Waterman, 1981a; Smith et al., 1981b).

**2.3 Alignment algorithms** 

**2.3.1 Global sequence alignment** 

where, x is the length of the gap. There are several forms of gap penalty, including: 1 constant penalty, the simplest form where each gap is given a constant penalty independent of the length of the gap, 2-proportional penalty where the penalty is proportional to the length of the gap. With this form, longer gaps are given higher penalties than shorter ones, 3-affine gap penalty that is the most complex form of gap penalty (Fig. 6). It has both constant and proportional contributions. The motivation for using affine gap penalty is that opening a gap should be strongly penalized, but once a gap is opened it should cost less to extend it. If the used gap penalty is too high relative to the range of scores in the substitution matrix, gap will never appear in the alignment, but conversely if the gap penalty is too low compared to the matrix scores, gaps will appear everywhere in the alignment in order to align as much same residues as possible.


Fig. 5. Pair-wise alignment of human RBP and β-lactoglobulin. The alignment is global (the entire lengths of each protein is aligned) and there are many positions of identity between two sequences (shown with |). Dots are different. (1) The pair dots indicating different amounts of similarity (like R and K that share similar biochemical properties). (2) Single dots also indicate similarity, but less than paired dots. (3, 4) Dots in the place of alphabetic characters along the sequences show internal and external gaps. (5) A dot indicated above the sequences entered for marking every 10 residues.

$$\mathsf{Score} = \mathsf{Max}(\mathsf{S})$$

Fig. 6. A typical illustration of calculating gap affine penalty.

#### **2.3 Alignment algorithms**

248 Bioinformatics – Trends and Methodologies

where, x is the length of the gap. There are several forms of gap penalty, including: 1 constant penalty, the simplest form where each gap is given a constant penalty independent of the length of the gap, 2-proportional penalty where the penalty is proportional to the length of the gap. With this form, longer gaps are given higher penalties than shorter ones, 3-affine gap penalty that is the most complex form of gap penalty (Fig. 6). It has both constant and proportional contributions. The motivation for using affine gap penalty is that opening a gap should be strongly penalized, but once a gap is opened it should cost less to extend it. If the used gap penalty is too high relative to the range of scores in the substitution matrix, gap will never appear in the alignment, but conversely if the gap penalty is too low compared to the matrix scores, gaps will appear everywhere in the alignment in order to

Fig. 5. Pair-wise alignment of human RBP and β-lactoglobulin. The alignment is global (the entire lengths of each protein is aligned) and there are many positions of identity between two sequences (shown with |). Dots are different. (1) The pair dots indicating different amounts of similarity (like R and K that share similar biochemical properties). (2) Single dots also indicate similarity, but less than paired dots. (3, 4) Dots in the place of alphabetic characters along the sequences show internal and external gaps. (5) A dot indicated above

the sequences entered for marking every 10 residues.

align as much same residues as possible.

For short and very closely related sequences, finding the best alignment is easy. However, in cases where sequences are long and not closely related finding the best alignment is rather difficult. If gaps are introduced in the alignment to account for deletions or insertions in the two sequences, the number of possible alignments increases exponentially. In these cases, computational methods are required. The known computational methods for this task are called dynamic programming algorithms. Such algorithms take two input sequences and produce the best alignment between them as output (Sankoff, 1972).

In general, there are two approaches for aligning sequences, global alignment and local alignment. In global alignment, the entire length of the sequence is subject to alignment. Sequences that are quite similar and their lengths are approximately the same are suitable for global alignment. In local alignment, the subsequences with the highest number of identical or similar residues are aligned and generate an alignment that is terminated at the ends of the regions with strong similarity. This type of alignment is a suitable way for aligning sequences that are similar along some regions of their length but dissimilar in others, sequences with different length, and those sequences share conserved regions. In sequence similarity analysis two dynamic programming algorithms are commonly used, the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. These algorithms are closely related, but the main difference is that the Needleman-Wunsch algorithm finds global similarity between sequences while the Smith-Waterman algorithm finds local similarity. The Smith-Waterman algorithm is the most used, because in reality biological sequences are not often similar over their entire lengths, but are similar only in particular regions (Pearson, 1992; Smith & Waterman, 1981a; Smith et al., 1981b).

#### **2.3.1 Global sequence alignment**

Needleman-Wunsch algorithm is one of the first and most important algorithms for aligning two protein sequences based upon dynamic programming. The importance of this algorithm is from the point that it produces an optimal alignment of protein or DNA sequences even with entering the gaps. Generating global sequence alignment using this algorithm undergoes tree steps: 1-setting up identity matrix, 2-scoring the matrix, and 3-identifying the optimal alignment. In the first step, the two sequences are placed in a two-dimensional

Optimal Sequence Alignment and Its Relationship with Phylogeny 251

Fig. 7. Global pair-wise alignment of two amino acid sequences using a dynamic

2. *s (i, j-1)*, located at one cell to the left minus a gap penalty. 3. *s (i-1,j)*, immediately above the new cell, minus a gap penalty. 4. zero. Assures that there is no negative value in the matrix.

maximum of four possible values:

mismatch (-0.3).

then:

programming algorithm. Generating the scoring matrix and using the trace-back procedure for obtaining the optimal alignment path is shown and ultimately the alignment of the two equally optimal path are shown in section d (the upper path) and e (the lower path).

preceding diagonal or the score obtained from the introduction of a gap, but the score cannot be negative. In this algorithm if a negative value is generated in each cell, a zero is inserted in the cell, instead (Fig. 8). The score of each cell like *i, j* or *H(i, j)* is given as the

1. The score which is located at position *i-1, j-1* (the score diagonally up to the left). This score is added to the new score in position *s(i, j)* which consists of either a match (1) or a

For two sequences, a=a1 a2 … an, and b= b1 b2 … bm, where Hi,j= H (a1 a2 …ai, b1 b2 …bj),

0 0 0 *H H x y* for 0 *x n* and 0 *y m*

1 *i n* and 1 *j m*

*H H sab H w H w i j*, 1 max{ 1 ( , ),max( ),max( ),0} *i j* , *i xj x* , , *ij y y* (2)

matrix (Fig. 7). The first sequence of length "m" is arranged horizontally along x axis so that each amino acid residue correspond to a column. The second sequence of length "n" is listed vertically along the y axis so that each amino acid residue corresponds to a row. For generating an amino acid identity matrix, simply each cell takes a value of +1 if the corresponding residues in row and column are identical and zero otherwise. Thus, for two identical sequences, in this matrix the +1 value would describe a diagonal line from top left to bottom right.

In the second step, a scoring matrix is generated. The assignment of scores starts from the bottom right of the matrix, corresponding to the carboxy termini of the proteins, and proceeds to the top. For moving through the matrix, to define a path corresponding to the sequence alignment, there are several rules. Briefly, for setting up the scoring matrix in the second step, at position *i* and *j*, take the value of the cell plus the maximum score obtained from any of the following three values:


The third step is identifying the optimal alignment, i.e. the path through the matrix that maximizes the score. Thus, a path through as many positions of identity as possible while introducing as few gaps as possible must be found exploiting a trace-back strategy. We begin at the upper left of the matrix (amino termini of the proteins) with the highest value (in Fig. 7 this value is "+8" corresponding to an alignment of residues A to A). Then we find the path down and to the right with the highest numbers along the diagonal. Going off the diagonal implies automatically the insertion of a gap in one of the sequences and entering some penalty. There may be more than one optimal alignment where all of them have an equally high score (Fig. 7). In such cases that uses unitary scoring scheme, multiple optimal alignment is obtained, but the introduction of a sophisticated scoring matrix like series of BLOSUM and PAM, it is unlikely to find multiple optimal alignments. For evaluating the obtained global alignment, the percent identity and similarity shared by two proteins, the length of the alignment, and the number of gaps which is introduced to the alignment is calculated (Needleman & Wunsch, 1970).

#### **2.3.2 Local sequence alignment**

Local alignment, a modified dynamic programming algorithm, seeks the highest scoring local match between two sequences. This algorithm proposed by smith and waterman (1981) is a very strong method for finding the high scoring subsets of two protein or DNA sequences. It is very useful in a variety of applications such as database searching. In general, this algorithm generates a matrix by two protein sequences and then finds the optimal path along a diagonal like global algorithm, but the alignment does not necessarily extend to the ends of the two sequences and for starting the alignment from some internal position, there is no penalty.

The Smith-Waterman algorithm constructs a matrix with an extra row along the top and an extra column on the left side. Thus, for two sequences of lengths "m" and "n", the matrix dimension is m+1 by n+1. The score of each cell is selected as the maximum score in the

matrix (Fig. 7). The first sequence of length "m" is arranged horizontally along x axis so that each amino acid residue correspond to a column. The second sequence of length "n" is listed vertically along the y axis so that each amino acid residue corresponds to a row. For generating an amino acid identity matrix, simply each cell takes a value of +1 if the corresponding residues in row and column are identical and zero otherwise. Thus, for two identical sequences, in this matrix the +1 value would describe a diagonal line from top left

In the second step, a scoring matrix is generated. The assignment of scores starts from the bottom right of the matrix, corresponding to the carboxy termini of the proteins, and proceeds to the top. For moving through the matrix, to define a path corresponding to the sequence alignment, there are several rules. Briefly, for setting up the scoring matrix in the second step, at position *i* and *j*, take the value of the cell plus the maximum score obtained

2. The highest score may find in position *i+1*, *j+2* to the end of row *j*. Finding the highest score in this position cause to the addition of a gap in the column. The number of gap

3. The highest score may find in position *i+2*, *j+1* to the end of column of *i*. This finding

The third step is identifying the optimal alignment, i.e. the path through the matrix that maximizes the score. Thus, a path through as many positions of identity as possible while introducing as few gaps as possible must be found exploiting a trace-back strategy. We begin at the upper left of the matrix (amino termini of the proteins) with the highest value (in Fig. 7 this value is "+8" corresponding to an alignment of residues A to A). Then we find the path down and to the right with the highest numbers along the diagonal. Going off the diagonal implies automatically the insertion of a gap in one of the sequences and entering some penalty. There may be more than one optimal alignment where all of them have an equally high score (Fig. 7). In such cases that uses unitary scoring scheme, multiple optimal alignment is obtained, but the introduction of a sophisticated scoring matrix like series of BLOSUM and PAM, it is unlikely to find multiple optimal alignments. For evaluating the obtained global alignment, the percent identity and similarity shared by two proteins, the length of the alignment, and the number of gaps which is introduced to the alignment is

Local alignment, a modified dynamic programming algorithm, seeks the highest scoring local match between two sequences. This algorithm proposed by smith and waterman (1981) is a very strong method for finding the high scoring subsets of two protein or DNA sequences. It is very useful in a variety of applications such as database searching. In general, this algorithm generates a matrix by two protein sequences and then finds the optimal path along a diagonal like global algorithm, but the alignment does not necessarily extend to the ends of the two sequences and for starting the alignment from some internal

The Smith-Waterman algorithm constructs a matrix with an extra row along the top and an extra column on the left side. Thus, for two sequences of lengths "m" and "n", the matrix dimension is m+1 by n+1. The score of each cell is selected as the maximum score in the

1. The score diagonally down (at position *i+1*, *j+1*), without including any gaps.

to bottom right.

from any of the following three values:

calculated (Needleman & Wunsch, 1970).

**2.3.2 Local sequence alignment** 

position, there is no penalty.

corresponds to the addition of a gap in the row.

can be greater than 1.

Fig. 7. Global pair-wise alignment of two amino acid sequences using a dynamic programming algorithm. Generating the scoring matrix and using the trace-back procedure for obtaining the optimal alignment path is shown and ultimately the alignment of the two equally optimal path are shown in section d (the upper path) and e (the lower path).

preceding diagonal or the score obtained from the introduction of a gap, but the score cannot be negative. In this algorithm if a negative value is generated in each cell, a zero is inserted in the cell, instead (Fig. 8). The score of each cell like *i, j* or *H(i, j)* is given as the maximum of four possible values:


For two sequences, a=a1 a2 … an, and b= b1 b2 … bm, where Hi,j= H (a1 a2 …ai, b1 b2 …bj), then:

$$H\_{i,j} = \max\{H\_{i-1,j} - 1 + s(a,b), \max(H\_{i-x,j} - w\_x), \max(H\_{i,j-y} - w\_y), 0\} \tag{2}$$

0 0 0 *H H x y* for 0 *x n* and 0 *y m*

1 *i n* and 1 *j m*

Optimal Sequence Alignment and Its Relationship with Phylogeny 253

representing all the possible matches of A characters with B characters. Any region of similar residues is identified by a string of dots located on the diagonal. Other dots, located on the positions everywhere other than diagonal represent random matches that are

There are tree types of variations for analysis of two protein sequences by the dot matrix method. First, one can use chemical similarity of the amino acid R group or some other features for detecting similarity score. Second, one can apply the specific scoring matrices such as PAM and BLOSUM. These matrices provide scores for matches that have occurred based on aligning the protein families (these matrices will be described in section 4) (States & Boguski, 1991). Finally, it can be analyzed by producing several different matrices, each of them with a different scoring system and with average of different scores. This method is

Although the alignment algorithms based on dynamic programming analysis such as Smith and Waterman guaranteed to find the optimal alignment(s) between two sequences, it is relatively slow. For pairwise alignment, the speed is not a problem but when it is used for database searching, that is, comparing one sequence as a query to an entire database, the speed of the algorithm becomes an important factor. In most algorithms there is a parameter called N that refers to the number of data items need to be processed. The required time for the algorithm to perform a task is greatly affected by this parameter. If the running time is proportional to N, then doubling N doubles the running time. For both algorithms based on dynamic programming, Needleman-Wunsch and Smith-Waterman, the memory space and the time required for aligning two sequences is proportional to the product of the length of two queries, m×n, and for the search of a database of size N, that is, m×n×N. The modified algorithm of Smith-Waterman was developed to provide rapid alternative algorithms such as FASTA (Pearson and Lipman, 1988) and BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990). Both of these algorithms require less time to perform an alignment. These algorithms are heuristic and since they restrict the search by scanning a database for likely matches before performing the actual alignment they require less time, but it is not

This algorithm, divides the query sequence as well as the considered database into subsequences with arbitrary lengths (for protein sequences two or three amino acid length), so called "words". Then, the positions of the words in the query sequence and database sequences are calculated. The ktup value or the length of the words is a value which determines how many consecutive identities are required for a match to be declared. The lesser the ktup value, the more sensitive the alignment. Often, ktup = 2 is taken for proteins, and ktup=6 for nucleotides. The same word can appear more than once in the sequence without affecting the algorithm (Pearson, 2000). After dividing sequences according to ktup value to consecutive subsequences, the relative position of each word in the two sequences is calculated by subtracting the position of the word in the query from each of the database sequences. Those words that have the same offset, they can be part of the same alignment without insertions or deletions. Therefore, by constructing a look-up table, all dense regions of identities between two sequences are identified. Next, the score of each aligned regions is calculated using PAM250 matrix selecting the 10 highest scoring regions for each database sequence. The sum of the scores of the 10 regions is called the best initial regions (init1) and used to rank the matches for further analysis. The longer

probably not related to any significant alignment.

suitable for more distantly related proteins.

guaranteed to find optimal alignments.

**3.1 FASTA heuristic algorithm** 

$$w\_x = 1 + \frac{1}{3 \times x} \qquad \text{and} \qquad w\_y = 1 + \frac{1}{3 \times y} \tag{3}$$

In equation (2), *Hi,j* is the score at position *i* in sequence *a* and position *j* in sequence *b*, *s(ai , bj)* is the score for aligning the characters at positions *i* and *j*. In equation (3), *wx* is the penalty for a gap of length *x* in sequence *a*, and *wy* is the penalty for a gap of length *y* in sequence *b*.

The maximal alignment can begin and end everywhere in the matrix so that the linear order of the two amino acid sequences cannot be violated. The trace-back procedure finds the highest value in the matrix and begins the alignment from the position of the highest number. It proceeds diagonally up to the left until a cell is reached with a value of zero. The zero value defines the start of the alignment, and is not necessarily at the extreme top left of the matrix (Smith & Waterman, 1981).


Fig. 8. A typical example for pair-wise local sequence alignment using smith-waterman algorithm.

#### **3. Rapid and heuristic versions of smith-waterman: FASTA and BLAST**

Theoretically sequence alignment techniques are based upon two different backgrounds (Pearson, 1996, 1988): Dot matrix analysis ( Gibbs & McIntyre, 1970) and the dynamic programming analysis such as Needleman-Wunsch and Smith-Waterman. The dot matrix analysis is used when the sequences are known to be very much alike and this similarity is clearly observed by displaying any possible alignments as diagonals on the matrix. This analysis reveals readily any insertions, deletions, direct and inverted repeats that are found with difficulty by the other methods. However, major limitation of this analysis is that most of these programs do not show an actual alignment. For comparing sequences based on this analysis, one sequence (A) is listed across the top of a page and the other sequence (B) is listed down the left side. Starting with the first character in sequence B and then move across the page to the end of the first row and placing a dot in any column where the character in sequence A is the same. This continues until the page is filled with dots

and <sup>1</sup> <sup>1</sup>

In equation (2), *Hi,j* is the score at position *i* in sequence *a* and position *j* in sequence *b*, *s(ai , bj)* is the score for aligning the characters at positions *i* and *j*. In equation (3), *wx* is the penalty for a gap of length *x* in sequence *a*, and *wy* is the penalty for a gap of length *y* in

The maximal alignment can begin and end everywhere in the matrix so that the linear order of the two amino acid sequences cannot be violated. The trace-back procedure finds the highest value in the matrix and begins the alignment from the position of the highest number. It proceeds diagonally up to the left until a cell is reached with a value of zero. The zero value defines the start of the alignment, and is not necessarily at the extreme top left of

Fig. 8. A typical example for pair-wise local sequence alignment using smith-waterman

**3. Rapid and heuristic versions of smith-waterman: FASTA and BLAST**

Theoretically sequence alignment techniques are based upon two different backgrounds (Pearson, 1996, 1988): Dot matrix analysis ( Gibbs & McIntyre, 1970) and the dynamic programming analysis such as Needleman-Wunsch and Smith-Waterman. The dot matrix analysis is used when the sequences are known to be very much alike and this similarity is clearly observed by displaying any possible alignments as diagonals on the matrix. This analysis reveals readily any insertions, deletions, direct and inverted repeats that are found with difficulty by the other methods. However, major limitation of this analysis is that most of these programs do not show an actual alignment. For comparing sequences based on this analysis, one sequence (A) is listed across the top of a page and the other sequence (B) is listed down the left side. Starting with the first character in sequence B and then move across the page to the end of the first row and placing a dot in any column where the character in sequence A is the same. This continues until the page is filled with dots

3 *wy <sup>y</sup>*

(3)

<sup>1</sup> <sup>1</sup> 3 *wx <sup>x</sup>*

sequence *b*.

algorithm.

the matrix (Smith & Waterman, 1981).

representing all the possible matches of A characters with B characters. Any region of similar residues is identified by a string of dots located on the diagonal. Other dots, located on the positions everywhere other than diagonal represent random matches that are probably not related to any significant alignment.

There are tree types of variations for analysis of two protein sequences by the dot matrix method. First, one can use chemical similarity of the amino acid R group or some other features for detecting similarity score. Second, one can apply the specific scoring matrices such as PAM and BLOSUM. These matrices provide scores for matches that have occurred based on aligning the protein families (these matrices will be described in section 4) (States & Boguski, 1991). Finally, it can be analyzed by producing several different matrices, each of them with a different scoring system and with average of different scores. This method is suitable for more distantly related proteins.

Although the alignment algorithms based on dynamic programming analysis such as Smith and Waterman guaranteed to find the optimal alignment(s) between two sequences, it is relatively slow. For pairwise alignment, the speed is not a problem but when it is used for database searching, that is, comparing one sequence as a query to an entire database, the speed of the algorithm becomes an important factor. In most algorithms there is a parameter called N that refers to the number of data items need to be processed. The required time for the algorithm to perform a task is greatly affected by this parameter. If the running time is proportional to N, then doubling N doubles the running time. For both algorithms based on dynamic programming, Needleman-Wunsch and Smith-Waterman, the memory space and the time required for aligning two sequences is proportional to the product of the length of two queries, m×n, and for the search of a database of size N, that is, m×n×N. The modified algorithm of Smith-Waterman was developed to provide rapid alternative algorithms such as FASTA (Pearson and Lipman, 1988) and BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990). Both of these algorithms require less time to perform an alignment. These algorithms are heuristic and since they restrict the search by scanning a database for likely matches before performing the actual alignment they require less time, but it is not guaranteed to find optimal alignments.

#### **3.1 FASTA heuristic algorithm**

This algorithm, divides the query sequence as well as the considered database into subsequences with arbitrary lengths (for protein sequences two or three amino acid length), so called "words". Then, the positions of the words in the query sequence and database sequences are calculated. The ktup value or the length of the words is a value which determines how many consecutive identities are required for a match to be declared. The lesser the ktup value, the more sensitive the alignment. Often, ktup = 2 is taken for proteins, and ktup=6 for nucleotides. The same word can appear more than once in the sequence without affecting the algorithm (Pearson, 2000). After dividing sequences according to ktup value to consecutive subsequences, the relative position of each word in the two sequences is calculated by subtracting the position of the word in the query from each of the database sequences. Those words that have the same offset, they can be part of the same alignment without insertions or deletions. Therefore, by constructing a look-up table, all dense regions of identities between two sequences are identified. Next, the score of each aligned regions is calculated using PAM250 matrix selecting the 10 highest scoring regions for each database sequence. The sum of the scores of the 10 regions is called the best initial regions (init1) and used to rank the matches for further analysis. The longer

Optimal Sequence Alignment and Its Relationship with Phylogeny 255

index. If a match is found, this match is used to seed a possible ungapped alignment between the query and database sequences. In the last step, an attempt is made for extending an alignment from the matching words in each direction along the sequences. The extending process is continued as long as the score is increased and is stopped once the accumulated score did not increase and begun to fall a small amount below the best score found for shorter extensions (Dawid, 2001; Pevsner, 2003b). In this condition, a longer stretch of sequence (called the HSP or high-scoring segment pair) with a greater score than the original word is found. In order to determining a suitable value for S, the range of scores found by comparing random sequences is examined and significant values are selected. In the later version of BLAST, called BLAST2 or gapped BLAST (Altschul et al., 1997; Brenner, et al., 1998), a list of high-scoring matching words is made similar to the original method with the exception that a lower value of T, the word cutoff score, is used. The lower cutoff score produces longer word list and matches to lower scoring words in the database

In order to remove the low-complexity regions that are not useful for producing meaningful sequence alignments, the filtering programs is used. Filtering masks portions of the query sequence that have commonly found stretch of amino acids or nucleotides with limited information content. For protein sequence queries, the SEG program is used and for nucleic acid sequences, the DUST program is employed. Using Filtering programs, low complexity residues are replaced with a string of characters with the letter X (for protein sequences) or N (for nucleic acid sequences). In general, filtering is useful to avoid receiving spurious

C I N C I N N A T I (w=3, n=10, T=11, BLOSUM62 matrix)

Then, for the given query sequence, N=8. The three-letter words of the query sequences are:

Using BLOSUM62 matrix, 54 words of 8000 key words in the hash table obtain score 11 or

CAN CCN CDN CEN CFN CGN CHN CIA CID CIE CIG CIH CIK CIM CIN CIP CIQ CIR CIS CIT CIY CKN CLD CLE CLG CLH CLK CLN CLQ CLR CLS CLT CMD CMH CMN CMS CNN CPN CQN CRN CSN CTN CVD CVE CVG CVH CVK CVN CVQ CVR CVS

Similarly, only three pairs obtain score 11 or greater when aligned with A T I at position 8. Overall, preprocessing of the query sequence assigns 204 entries of the 8000 possible keys. After preprocessing stage, the next step is scanning the target string (reference sequence) successively for finding exact matches to one of the words in the query index. Suppose,

P R E C I N C T S

C I N (1) I N C (2) N C I (3) C I N (4) I N N (5) N N A (6) N A T (7) ATI (8)

greater when aligned with the C I N word which is located at positions 1 and 4.

*Nnw* 1 (4)

database matches, but in some cases authentic matches may be missed.

where, the number of words with length 3 (w=3) is calculated as follows:

sequences.

**3.2.1 An example** 

CVT CWN CYN

Let the following query sequence:

following sequence as a target string:

regions of identity are generated by joining initial regions (initn) with scores greater than a certain threshold. The initn score is the sum of the scores of these aligned regions after subtracting a penalty accounting for the gaps. In later versions of FASTA, an optimization step is added. When the initn score reaches to a certain threshold value, the score of the region is recalculated for producing an OPT score by performing a full local alignment of the region using Smith-Waterman dynamic programming algorithm. This optimization increases the sensitivity but decreases the selectivity of the search (pearson, 1990, 1991,1998; Tramontano, 2006; Mahdavi, 2010). These scores (initn and OPT) are the basis to rank database matches.

#### **3.2 BLAST heuristic algorithm**

The BLAST algorithm was established as a new tool to perform a sequence similarity search based on an algorithm that is faster than FASTA, but is as sensitive as FASTA. The BLAST web server (http://www.ncbi.nlm.nih.gov) is the most widely used for sequence database searches and is backed up by a powerful computer system. The original version of the BLAST looks for contiguous similarity regions between the query and database sequences (without using gaps). The speed of the algorithm like FASTA increases by initially searching common words or k-tuples in the query sequence and each database sequence. While FASTA searches for all possible words of same length, BLAST searches the words that are most significant. The word length for this algorithm is fixed at 3 for proteins and 11 for nucleic acids. This length is the minimum length required to achieve a word score that is high enough to be significant but not so long to miss short but significant patterns. There are several steps involved for searching a protein sequence database for a query protein sequence by BLAST algorithm (Altschul et al., 1990, 1994, 1997). In similarity searching by BLAST program, three steps need to be taken. The program compiles a preliminarily list of pair-wise alignment called "word pairs". Then the algorithm scans a database for word pairs that meet some threshold score T and extends the word pairs to find those sequences that scores better than the cutoff score S. Scores are calculated from scoring matrices (such as BLOSUM62) along with gap penalties.

In preprocessing stage, the query string is divided into words of length 3. The goal of the preprocessing stage is to build a hash table, which is called query index. The keys of the hash table are the 20×20×20=8000 possible tree-letter words. The value associated with each word is the position of that word in the list of all query words that gain a high score when aligned against the key word. The threshold for high-score that is defined by default in BLOSUM62 scoring matrix is 11. Threshold score or neighborhood word score threshold (T) is selected for reducing the number of possible matches. For example, if a three-letter word PQG occurs in the query sequence, the match score of this word to itself is calculated by the log-odds BLOSUM62 matrix as P-P match, plus that for a Q-Q match, plus that for a G-G match that equals to 7+5+6=18. Similarly, the PQG match to PEG scores 15, to PGR 14, to PSG 13, and to PQA 12. For DNA words, the score for a match is +5 and for a mismatch is -4. With selecting the threshold score, the list of possible matching words is shortened from 8000 (for w (word length) = 3) to the highest scoring words that satisfy the threshold score. The preprocessing stage is repeated for each three- letter word in the query sequence. The remaining high-scoring words that include possible matches to each three-letter position in the query sequence are listed in a table called the query index in order to create an efficient rapidly comparing search to the database sequences. In the second step, each database sequence is scanned for identifying an exact match to one of the words listed in the query index. If a match is found, this match is used to seed a possible ungapped alignment between the query and database sequences. In the last step, an attempt is made for extending an alignment from the matching words in each direction along the sequences. The extending process is continued as long as the score is increased and is stopped once the accumulated score did not increase and begun to fall a small amount below the best score found for shorter extensions (Dawid, 2001; Pevsner, 2003b). In this condition, a longer stretch of sequence (called the HSP or high-scoring segment pair) with a greater score than the original word is found. In order to determining a suitable value for S, the range of scores found by comparing random sequences is examined and significant values are selected. In the later version of BLAST, called BLAST2 or gapped BLAST (Altschul et al., 1997; Brenner, et al., 1998), a list of high-scoring matching words is made similar to the original method with the exception that a lower value of T, the word cutoff score, is used. The lower cutoff score produces longer word list and matches to lower scoring words in the database sequences.

In order to remove the low-complexity regions that are not useful for producing meaningful sequence alignments, the filtering programs is used. Filtering masks portions of the query sequence that have commonly found stretch of amino acids or nucleotides with limited information content. For protein sequence queries, the SEG program is used and for nucleic acid sequences, the DUST program is employed. Using Filtering programs, low complexity residues are replaced with a string of characters with the letter X (for protein sequences) or N (for nucleic acid sequences). In general, filtering is useful to avoid receiving spurious database matches, but in some cases authentic matches may be missed.

#### **3.2.1 An example**

254 Bioinformatics – Trends and Methodologies

regions of identity are generated by joining initial regions (initn) with scores greater than a certain threshold. The initn score is the sum of the scores of these aligned regions after subtracting a penalty accounting for the gaps. In later versions of FASTA, an optimization step is added. When the initn score reaches to a certain threshold value, the score of the region is recalculated for producing an OPT score by performing a full local alignment of the region using Smith-Waterman dynamic programming algorithm. This optimization increases the sensitivity but decreases the selectivity of the search (pearson, 1990, 1991,1998; Tramontano, 2006; Mahdavi, 2010). These scores (initn and OPT) are the basis

The BLAST algorithm was established as a new tool to perform a sequence similarity search based on an algorithm that is faster than FASTA, but is as sensitive as FASTA. The BLAST web server (http://www.ncbi.nlm.nih.gov) is the most widely used for sequence database searches and is backed up by a powerful computer system. The original version of the BLAST looks for contiguous similarity regions between the query and database sequences (without using gaps). The speed of the algorithm like FASTA increases by initially searching common words or k-tuples in the query sequence and each database sequence. While FASTA searches for all possible words of same length, BLAST searches the words that are most significant. The word length for this algorithm is fixed at 3 for proteins and 11 for nucleic acids. This length is the minimum length required to achieve a word score that is high enough to be significant but not so long to miss short but significant patterns. There are several steps involved for searching a protein sequence database for a query protein sequence by BLAST algorithm (Altschul et al., 1990, 1994, 1997). In similarity searching by BLAST program, three steps need to be taken. The program compiles a preliminarily list of pair-wise alignment called "word pairs". Then the algorithm scans a database for word pairs that meet some threshold score T and extends the word pairs to find those sequences that scores better than the cutoff score S. Scores are calculated from scoring matrices (such as

In preprocessing stage, the query string is divided into words of length 3. The goal of the preprocessing stage is to build a hash table, which is called query index. The keys of the hash table are the 20×20×20=8000 possible tree-letter words. The value associated with each word is the position of that word in the list of all query words that gain a high score when aligned against the key word. The threshold for high-score that is defined by default in BLOSUM62 scoring matrix is 11. Threshold score or neighborhood word score threshold (T) is selected for reducing the number of possible matches. For example, if a three-letter word PQG occurs in the query sequence, the match score of this word to itself is calculated by the log-odds BLOSUM62 matrix as P-P match, plus that for a Q-Q match, plus that for a G-G match that equals to 7+5+6=18. Similarly, the PQG match to PEG scores 15, to PGR 14, to PSG 13, and to PQA 12. For DNA words, the score for a match is +5 and for a mismatch is -4. With selecting the threshold score, the list of possible matching words is shortened from 8000 (for w (word length) = 3) to the highest scoring words that satisfy the threshold score. The preprocessing stage is repeated for each three- letter word in the query sequence. The remaining high-scoring words that include possible matches to each three-letter position in the query sequence are listed in a table called the query index in order to create an efficient rapidly comparing search to the database sequences. In the second step, each database sequence is scanned for identifying an exact match to one of the words listed in the query

to rank database matches.

**3.2 BLAST heuristic algorithm** 

BLOSUM62) along with gap penalties.

Let the following query sequence:

#### C I N C I N N A T I (w=3, n=10, T=11, BLOSUM62 matrix)

where, the number of words with length 3 (w=3) is calculated as follows:

$$N = n - w + 1\tag{4}$$

Then, for the given query sequence, N=8. The three-letter words of the query sequences are: C I N (1) I N C (2) N C I (3) C I N (4) I N N (5) N N A (6) N A T (7) ATI (8)

Using BLOSUM62 matrix, 54 words of 8000 key words in the hash table obtain score 11 or greater when aligned with the C I N word which is located at positions 1 and 4.

CAN CCN CDN CEN CFN CGN CHN CIA CID CIE CIG CIH CIK CIM CIN CIP CIQ CIR CIS CIT CIY CKN CLD CLE CLG CLH CLK CLN CLQ CLR CLS CLT CMD CMH CMN CMS CNN CPN CQN CRN CSN CTN CVD CVE CVG CVH CVK CVN CVQ CVR CVS CVT CWN CYN

Similarly, only three pairs obtain score 11 or greater when aligned with A T I at position 8. Overall, preprocessing of the query sequence assigns 204 entries of the 8000 possible keys. After preprocessing stage, the next step is scanning the target string (reference sequence) successively for finding exact matches to one of the words in the query index. Suppose, following sequence as a target string:

#### P R E C I N C T S

Optimal Sequence Alignment and Its Relationship with Phylogeny 257

Margaret Dayhoff (1978) developed a method for determining the most likely amino acid changes that occurred during evolution by assessing ancestral relationship among a group of proteins (Kim & Kececioglu, 2008). The analysis was performed based on multiple sequence alignment of 34 closely related protein superfamilies which were grouped in 71 phylogenetic trees (such as: cytochrome c, hemoglobin, myoglobin, virus coat proteins, chymotrypsinogen, glyceraldehydes 3-phosphate dehydrogenase, clupeine, insulin and ferredoxin). The studied groups of proteins ranged from very well conserved (like, histones and glutamate dehydrogenase) to proteins with high rate of point mutations (like immunoglobin chains and carrier proteins). In this model for creating the mutation data matrix (MDM), the sequences of all of the nodal common ancestors in each tree were generated by multiple sequence alignment of each protein family, then counting the most frequent amino acids for inferring the common ancestor of each family from those most frequent amino acids. The matrix of accepted point mutation was calculated for each protein family separately from the constructed phylogenetic tree which was inferred for each studied protein family. In this matrix it was assumed that the likelihood of amino acid X replacing Y is the same as that of Y replacing X, and hence 1 was entered in cell YX as well as in cell XY (Dayhoff, 1972). Dayhoff assumed that by considering this symmetry, the frequency of occurrence of an amino acid in any large group of studied proteins appears to have been relatively constant with time. The accumulated accepted point matrix for closely related sequences was generated by summing the number of corresponding elements of each separately accepted point matrix, which was computed for each protein family sequences together. Next, the relative mutability of the 20 amino acids in sequences of each studied protein family was calculated. Relative mutability was simply calculated as the number of observed changes of an amino acid divided by its frequency of occurrence in the aligned sequences. Mutability was normalized with respect to the basic unit of evolutionary distance as being a single accepted point mutation in a sequence of length 100. Consequently, the average relative mutability of an amino acid was therefore the total number of changes observed for this amino acid in all the families of studied proteins, divided by the total sum of all local frequencies of occurrence of the amino acid multiplied by the numbers of mutations per 100 residues in each of the branches of all the family trees. The mutation probability matrix was then constructed (Fig. 9). An element of this matrix, ܯ, gives the probability that the amino acid in column *j* would be replaced by the amino acid in row *i* after a given evolutionary interval. The values of the non diagonal elements of

**4.1.1 PAM (point accepted mutation) matrices** 

this matrix were computed by following equation (Dayhoff, 1972):

calculated as follows:

*j ij*

*A*

*m A*

Where, ܣ is an element of the accepted point mutation matrix, ߣ is proportionality constant, and ݉ is the mutability of the *jth* amino acid. The values of diagonal elements are

> 1 *M m ij j*

In mutation probability matrix, the ratio of the individual non-diagonal terms within each column has the same ratio of the observed mutation in the mutation data matrix. The

*ij i*

(5)

(6)

*ij*

*M*

For this sequence N=7, then the three-letter words along their locations are:

PRE(1), REC(2), ECI(3), CIN(4), INC(5), NCT(6), CTS(7)

Looking up NCT at position 6 of the target string, the search generates hits (3,6) and (7,6). This means that similar words to position 6 at the reference sequence are at positions 3 and 7 of the query sequence. After finding the location of the exact matches, each hit is extended to the right and to the left to increase the alignment's score. The alignment is extended until the overall alignment score maximizes. In this example, the corresponding alignment for the hit at query position 3 and target position 6 is:


p r e c i N C T S - - - -

Hence, the final local alignment is:

#### C I N C I N

#### C I N C T S

The score of this local alignment is calculated as follows:

$$\mathbf{S\_{CC}} + \mathbf{S\_{II}} + \mathbf{S\_{NN}} + \mathbf{S\_{CC}} + \mathbf{S\_{IT}} + \mathbf{S\_{NS}} = \mathbf{9} + \mathbf{4} + \mathbf{6} + \mathbf{9} + \mathbf{(-1)} + \mathbf{1} = 28$$

Another hit at query position 7 and target position 6 is:

```
c i n c i n N A T I
```

$$\text{-p } \mathbf{r} \text{ се } \mathbf{i} \text{ `NCT s}$$

The score of this alignment can no longer be increased by further extending it to either left or right (Dwyer, 2003).

#### **4. Representation of different substitution matrices**

#### **4.1 Amino acid substitution matrices**

Amino acid arrangement of proteins and nucleic acids change due to mutations occur over the course of evolution. Amino acids are substituted by other amino acids during mutation and these substitutions cause variations in phenotype of the related species. There are some regions in the sequence that undergo massive mutations and some other regions remain conserved over a long period of time in evolution. The alignment outcome demonstrates conserved regions in related protein sequences that represent functions of the proteins (Campanella et al., 2003). Additionally, it shows some amino acid substitutions commonly occur in related proteins from different species. Substituted amino acids are compatible with protein structure and function and are chemically similar to amino acids which are changed. Some substitutions are rare or least common and some of them are most common. Sequence alignment is a useful tool for understanding the type of changes occurred in related protein sequences. Based on the type of substitution different matrices were built such as PAM and BLOSUM. Substitution matrices are used in sequence alignments while they are built out of aligning carefully selected sequences. In the following the detail description of PAM and BLOSUM substitution matrices is presented.

#### **4.1.1 PAM (point accepted mutation) matrices**

256 Bioinformatics – Trends and Methodologies

PRE(1), REC(2), ECI(3), CIN(4), INC(5), NCT(6), CTS(7) Looking up NCT at position 6 of the target string, the search generates hits (3,6) and (7,6). This means that similar words to position 6 at the reference sequence are at positions 3 and 7 of the query sequence. After finding the location of the exact matches, each hit is extended to the right and to the left to increase the alignment's score. The alignment is extended until the overall alignment score maximizes. In this example, the corresponding alignment for the hit


p r e c i N C T S - - - -

C I N C I N

C I N C T S

SCC + SII + SNN + SCC + SIT + SNS = 9 + 4 + 6 + 9 + (-1) + 1 = 28

c i n c i n N A T I


Amino acid arrangement of proteins and nucleic acids change due to mutations occur over the course of evolution. Amino acids are substituted by other amino acids during mutation and these substitutions cause variations in phenotype of the related species. There are some regions in the sequence that undergo massive mutations and some other regions remain conserved over a long period of time in evolution. The alignment outcome demonstrates conserved regions in related protein sequences that represent functions of the proteins (Campanella et al., 2003). Additionally, it shows some amino acid substitutions commonly occur in related proteins from different species. Substituted amino acids are compatible with protein structure and function and are chemically similar to amino acids which are changed. Some substitutions are rare or least common and some of them are most common. Sequence alignment is a useful tool for understanding the type of changes occurred in related protein sequences. Based on the type of substitution different matrices were built such as PAM and BLOSUM. Substitution matrices are used in sequence alignments while they are built out of aligning carefully selected sequences. In the following the detail description of PAM and

For this sequence N=7, then the three-letter words along their locations are:

at query position 3 and target position 6 is:

The score of this local alignment is calculated as follows:

Another hit at query position 7 and target position 6 is:

**4. Representation of different substitution matrices**

Hence, the final local alignment is:

or right (Dwyer, 2003).

**4.1 Amino acid substitution matrices**

BLOSUM substitution matrices is presented.

Margaret Dayhoff (1978) developed a method for determining the most likely amino acid changes that occurred during evolution by assessing ancestral relationship among a group of proteins (Kim & Kececioglu, 2008). The analysis was performed based on multiple sequence alignment of 34 closely related protein superfamilies which were grouped in 71 phylogenetic trees (such as: cytochrome c, hemoglobin, myoglobin, virus coat proteins, chymotrypsinogen, glyceraldehydes 3-phosphate dehydrogenase, clupeine, insulin and ferredoxin). The studied groups of proteins ranged from very well conserved (like, histones and glutamate dehydrogenase) to proteins with high rate of point mutations (like immunoglobin chains and carrier proteins). In this model for creating the mutation data matrix (MDM), the sequences of all of the nodal common ancestors in each tree were generated by multiple sequence alignment of each protein family, then counting the most frequent amino acids for inferring the common ancestor of each family from those most frequent amino acids. The matrix of accepted point mutation was calculated for each protein family separately from the constructed phylogenetic tree which was inferred for each studied protein family. In this matrix it was assumed that the likelihood of amino acid X replacing Y is the same as that of Y replacing X, and hence 1 was entered in cell YX as well as in cell XY (Dayhoff, 1972). Dayhoff assumed that by considering this symmetry, the frequency of occurrence of an amino acid in any large group of studied proteins appears to have been relatively constant with time. The accumulated accepted point matrix for closely related sequences was generated by summing the number of corresponding elements of each separately accepted point matrix, which was computed for each protein family sequences together. Next, the relative mutability of the 20 amino acids in sequences of each studied protein family was calculated. Relative mutability was simply calculated as the number of observed changes of an amino acid divided by its frequency of occurrence in the aligned sequences. Mutability was normalized with respect to the basic unit of evolutionary distance as being a single accepted point mutation in a sequence of length 100. Consequently, the average relative mutability of an amino acid was therefore the total number of changes observed for this amino acid in all the families of studied proteins, divided by the total sum of all local frequencies of occurrence of the amino acid multiplied by the numbers of mutations per 100 residues in each of the branches of all the family trees. The mutation probability matrix was then constructed (Fig. 9). An element of this matrix, ܯ, gives the probability that the amino acid in column *j* would be replaced by the amino acid in row *i* after a given evolutionary interval. The values of the non diagonal elements of this matrix were computed by following equation (Dayhoff, 1972):

$$\mathcal{M}\_{ij} = \frac{\mathcal{L}m\_j A\_{ij}}{\sum\_i A\_{ij}} \tag{5}$$

Where, ܣ is an element of the accepted point mutation matrix, ߣ is proportionality constant, and ݉ is the mutability of the *jth* amino acid. The values of diagonal elements are calculated as follows:

$$
\lambda M\_{ij} = \mathbf{1} - \lambda m\_j \tag{6}
$$

In mutation probability matrix, the ratio of the individual non-diagonal terms within each column has the same ratio of the observed mutation in the mutation data matrix. The

Optimal Sequence Alignment and Its Relationship with Phylogeny 259

Other PAM matrices are calculated from multiplying PAM1 matrix by itself with respect to the characteristic number of the PAM matrix (e.g. PAM250 matrix is produced when the PAM1 matrix is multiplied by itself 250 times). A scoring system has been developed for converting the elements of a PAM mutation probability matrix into a scoring matrix or log-

<sup>10</sup> ( , ) 10log *ab*

where, ��� is the probability that the aligned pair of amino acid residues *a* and *b* represent an authentic alignment and Ρ� is the normalized frequency representing the probability that

The PAM1 matrix is recalculated (Jones et al., 1992) using updated data, called PET91. This dataset was generated from Release 15.0 of the SWISS-PROT protein sequence database (Bairoch, 1996), containing 16941 sequences. Also, a mutation data matrix has been calculated for transmembrane proteins (Jones et al., 1993). It was found that this new mutation data matrix is very different from matrices calculated from general sequence sets which are biased towards water-soluble globular proteins. The differences are discussed in the context of specific structural requirements of membrane spanning segments. Calculating the mutation data matrix for each protein family and hence creating specific PAM scoring matrix for each protein family will help to improve the accuracy of the protein sequence

BLOSUM scoring matrices are improved alternatives to PAM. These series of scoring matrices are widely used for scoring protein sequence alignments. The BLOSUM matrices are derived from the database for storing the sequence alignments of the most conserved regions of protein families, BLOCK database (S. Henikoff & J.G. Henikoff, 1996). This database of blocks is consisted of over 500 groups of local multiple alignments of distantly related proteins. The BLOSUM matrix values are obtained by the same method applied to PAM matrices. The values are computed from the observed amino acid substitutions in a large set of about 2000 conserved amino acid patterns. The constructed blocks from patterns of amino acids in each protein family, derived ungapped multiple alignments. For deriving BLOSUM matrices from blocks, not all the sequences are used but a percentage of identity higher than a certain threshold are merged and considered (S. Henikoff & J.G. Henikoff, 1992). Therefore, different BLOSUM matrices are produced for each threshold. For example, BLOSUM62 matrix (Fig. 10) is derived from merging several alignments with 62%, 80%, and 95% identity. This matrix is useful for scoring proteins that share less than 62% identity. By increasing the clustering percentage, the ability of the resulting matrix to distinguish actual from random alignments also increased. The numbers associated with BLOSUM matrices do not have the same interpretation as those for PAM matrices. BLOSUM matrices with smaller numbers represent more evolutionary distances while BLOSUM matrices with higher numbers represent closer evolutionary distances. Consequently, BLOSUM matrices are obtained based on entirely different type of sequence analysis and a much larger data set than the Dayhoff's PAM matrices. The values of the BLOSUM scoring matrix are obtained based on similar procedure applied to PAM matrices. However, BLOSUM scoring matrices

**4.1.2 BLOSUM matrices (Blocks Amino Acid Substitution Matrices)** 

are calculated from 2 times the log base 2 of the odds ratio, as follows:

*<sup>M</sup> Sab*

*b*

(9)

*P* 

odd matrix as follows (Pevsner, 2003a):

the residue *b* was aligned by chance.

alignment results.

proportionality constantߣ is the same for all columns of the matrix and is calculated by following equation for 1PAM evolutionary interval in which 1% of amino acids have changed (Higgs & Attwood, 2005):

$$\mathcal{A} = 0.01 \frac{N\_{\text{tot}}}{A\_{\text{tot}}} \tag{7}$$

where, ܰ௧௧ is the total number of amino acids in the data set, and ܣ௧௧ is the total number of elements in the ܣ matrix.

In mutation probability matrix, the diagonal elements are all slightly less than one, and off diagonal elements are very small. The number of unchanged amino acids, when a 100 residue protein sequence (of average composition) is exposed to the evolutionary changes, is computed as follows:

$$100 \times \sum\_{i} f\_i \mathcal{M}\_{ii} \tag{8}$$


Fig. 9. Mutation probability matrix for the evolutionary distance of 2PAMs. Each element of this matrix gives the probability that the amino acid in column *j* will be replaced by the amino acid in row *i* after a given evolutionary interval, in this case 2 accepted point mutations per 100 amino acids. Values are multiplied by 10000 for convenience.

proportionality constantߣ is the same for all columns of the matrix and is calculated by following equation for 1PAM evolutionary interval in which 1% of amino acids have

> 0.01 *tot tot N A*

where, ܰ௧௧ is the total number of amino acids in the data set, and ܣ௧௧ is the total number of

In mutation probability matrix, the diagonal elements are all slightly less than one, and off diagonal elements are very small. The number of unchanged amino acids, when a 100 residue protein sequence (of average composition) is exposed to the evolutionary changes, is

> 100 *i ii i*

Fig. 9. Mutation probability matrix for the evolutionary distance of 2PAMs. Each element of this matrix gives the probability that the amino acid in column *j* will be replaced by the amino acid in row *i* after a given evolutionary interval, in this case 2 accepted point mutations per 100 amino acids. Values are multiplied by 10000 for convenience.

(7)

*f M* (8)

changed (Higgs & Attwood, 2005):

elements in the ܣ matrix.

computed as follows:

Other PAM matrices are calculated from multiplying PAM1 matrix by itself with respect to the characteristic number of the PAM matrix (e.g. PAM250 matrix is produced when the PAM1 matrix is multiplied by itself 250 times). A scoring system has been developed for converting the elements of a PAM mutation probability matrix into a scoring matrix or logodd matrix as follows (Pevsner, 2003a):

$$S(a,b) = 10\log\_{10}\left(\frac{M\_{ab}}{P\_b}\right) \tag{9}$$

where, ��� is the probability that the aligned pair of amino acid residues *a* and *b* represent an authentic alignment and Ρ� is the normalized frequency representing the probability that the residue *b* was aligned by chance.

The PAM1 matrix is recalculated (Jones et al., 1992) using updated data, called PET91. This dataset was generated from Release 15.0 of the SWISS-PROT protein sequence database (Bairoch, 1996), containing 16941 sequences. Also, a mutation data matrix has been calculated for transmembrane proteins (Jones et al., 1993). It was found that this new mutation data matrix is very different from matrices calculated from general sequence sets which are biased towards water-soluble globular proteins. The differences are discussed in the context of specific structural requirements of membrane spanning segments. Calculating the mutation data matrix for each protein family and hence creating specific PAM scoring matrix for each protein family will help to improve the accuracy of the protein sequence alignment results.

#### **4.1.2 BLOSUM matrices (Blocks Amino Acid Substitution Matrices)**

BLOSUM scoring matrices are improved alternatives to PAM. These series of scoring matrices are widely used for scoring protein sequence alignments. The BLOSUM matrices are derived from the database for storing the sequence alignments of the most conserved regions of protein families, BLOCK database (S. Henikoff & J.G. Henikoff, 1996). This database of blocks is consisted of over 500 groups of local multiple alignments of distantly related proteins. The BLOSUM matrix values are obtained by the same method applied to PAM matrices. The values are computed from the observed amino acid substitutions in a large set of about 2000 conserved amino acid patterns. The constructed blocks from patterns of amino acids in each protein family, derived ungapped multiple alignments. For deriving BLOSUM matrices from blocks, not all the sequences are used but a percentage of identity higher than a certain threshold are merged and considered (S. Henikoff & J.G. Henikoff, 1992). Therefore, different BLOSUM matrices are produced for each threshold. For example, BLOSUM62 matrix (Fig. 10) is derived from merging several alignments with 62%, 80%, and 95% identity. This matrix is useful for scoring proteins that share less than 62% identity. By increasing the clustering percentage, the ability of the resulting matrix to distinguish actual from random alignments also increased. The numbers associated with BLOSUM matrices do not have the same interpretation as those for PAM matrices. BLOSUM matrices with smaller numbers represent more evolutionary distances while BLOSUM matrices with higher numbers represent closer evolutionary distances. Consequently, BLOSUM matrices are obtained based on entirely different type of sequence analysis and a much larger data set than the Dayhoff's PAM matrices. The values of the BLOSUM scoring matrix are obtained based on similar procedure applied to PAM matrices. However, BLOSUM scoring matrices are calculated from 2 times the log base 2 of the odds ratio, as follows:

Optimal Sequence Alignment and Its Relationship with Phylogeny 261

There are scoring matrices for DNA sequence alignments as amino acid scoring matrices. A series of nucleic acid PAM matrices are calculated in similar way that amino acid PAM scoring matrices are generated. To derive DNA PAM matrices first a PAM1 mutation matrix which is representing 99% sequence conservation or 1% mutations across evolutionary distance is calculated. It is upon the assumption that the frequencies of four nucleotides in studied sequences are equal. Also all mutations from any nucleotide to any other are equally likely. Thus, the four diagonal elements of the PAM1 matrix representing no changes are equal to 0.99, but the other elements of the matrix representing changes are 0.00333. For converting substitution matrix to scoring matrix just as amino acid matrices, the values of the this matrix is used for producing log-odds scoring matrices which is representing the frequency of substitutions expected to occur at increasing evolutionary distances. For scoring DNA alignments with DNA PAM scoring matrices, the lower numbered DNA PAM matrices is used for more alike DNA sequences and the high numbered DNA PAM matrices are used for more diverge DNA sequences

One of the main challenges of the sequence similarity searches is to detect weather an identified sequence similarity between DNA or protein sequences are statistically significant. For two proteins that are quite similar and clearly grouped in the same family, assessing the significance is not necessary. However, when we are dealing with two sequences with no clear similarity, once the alignment is performed statistical analysis becomes important. In such cases, biologists would like to know if the observed similarity resulted from the alignment is obtained by chance or is authentic. A statistical test assists biologists to identify the more distant related protein or DNA sequences from unrelated. The assessing test is performed on the basis of the assumption that the alignment scores follow a normal distribution. For evaluating the distribution of alignment scores, some random sequences are generated by sequence shuffling technique. Analysis of the alignment scores of random sequences reveal that the scores follow a type of normal distribution called Gumbel extreme value distribution (Altschul & Gish, 1996; Altschul & Bouguski, 1994). Generally, the statistical analysis of alignment scores for local alignments is better understood than global alignments. Since, the Smith-Waterman algorithm reveals regions of conserved or closely matching with a positive score, in random or unrelated sequence alignments these regions are rarely found. Therefore, presence of such regions in real sequences is significant while the probability of occurrence of such regions by chance is close to zero. P-value is a suitable parameter which is used for identifying the probability that a score of S or greater is obtained by chance between two unrelated matched sequences of similar composition and length. Hence, very low p-value corresponds to significant matches meaning that it is improbable the obtained score occurred by chance. It is more probable that the high score occurred as a consequence of a real biological or evolutionary relationship. However, a more common statistical parameter which is reported by most softwares for quantifying the statistical significance of an identified similarity is E-value or expected value. E-value is the expected frequency of scores "S" occurred by chance. P and E values can be calculated for the two matched

**4.2 Nucleic acid scoring matrices** 

along evolutionary distance (States et al., 1991).

**5. Statistical analysis of alignments** 

$$S\_{\,b}(a,b) = 2\log\_2\left(\frac{M\_{ab}}{P\_b}\right) \tag{10}$$

#### **4.1.3 Comparison of the PAM and BLOSUM amino acid substitution matrices**

Methods of computing PAM and BLOSUM scoring matrices have several important differences. The PAM matrices are computed based on a mutational model of evolution which assumes each amino acid change at a specific position is independent of the previous changes at that position (based on Markov model).

#### Fig. 10. BLOSUM62 scoring matrix.

By predicting the phylogenetic tree of the studied sequences of each protein family, the early changes that occur as protein diverge from a common ancestor during evolution is identified. In order to derive matrices used for more distantly related proteins, short term changes are extrapolated. In contrast, BLOSUM matrices are derived based on all observed changes in an aligned region of a related family of proteins without considering the global similarity between the considered protein sequences. Since these related proteins in the family are known to be related biochemically, they should be derived from a common ancestor. Generally, PAM model is designed to track the evolutionary origins of proteins but the BLOSUM model is designed to find their conserved domains. Both use log-odd values in their scoring systems (Vogt, 1995).

#### **4.2 Nucleic acid scoring matrices**

260 Bioinformatics – Trends and Methodologies

<sup>2</sup> ( , ) 2log *ab*

Methods of computing PAM and BLOSUM scoring matrices have several important differences. The PAM matrices are computed based on a mutational model of evolution which assumes each amino acid change at a specific position is independent of the previous

By predicting the phylogenetic tree of the studied sequences of each protein family, the early changes that occur as protein diverge from a common ancestor during evolution is identified. In order to derive matrices used for more distantly related proteins, short term changes are extrapolated. In contrast, BLOSUM matrices are derived based on all observed changes in an aligned region of a related family of proteins without considering the global similarity between the considered protein sequences. Since these related proteins in the family are known to be related biochemically, they should be derived from a common ancestor. Generally, PAM model is designed to track the evolutionary origins of proteins but the BLOSUM model is designed to find their conserved domains. Both use log-odd values in

*<sup>M</sup> S ab*

**4.1.3 Comparison of the PAM and BLOSUM amino acid substitution matrices** 

*b*

(10)

*P* 

*b*

changes at that position (based on Markov model).

Fig. 10. BLOSUM62 scoring matrix.

their scoring systems (Vogt, 1995).

There are scoring matrices for DNA sequence alignments as amino acid scoring matrices. A series of nucleic acid PAM matrices are calculated in similar way that amino acid PAM scoring matrices are generated. To derive DNA PAM matrices first a PAM1 mutation matrix which is representing 99% sequence conservation or 1% mutations across evolutionary distance is calculated. It is upon the assumption that the frequencies of four nucleotides in studied sequences are equal. Also all mutations from any nucleotide to any other are equally likely. Thus, the four diagonal elements of the PAM1 matrix representing no changes are equal to 0.99, but the other elements of the matrix representing changes are 0.00333. For converting substitution matrix to scoring matrix just as amino acid matrices, the values of the this matrix is used for producing log-odds scoring matrices which is representing the frequency of substitutions expected to occur at increasing evolutionary distances. For scoring DNA alignments with DNA PAM scoring matrices, the lower numbered DNA PAM matrices is used for more alike DNA sequences and the high numbered DNA PAM matrices are used for more diverge DNA sequences along evolutionary distance (States et al., 1991).

#### **5. Statistical analysis of alignments**

One of the main challenges of the sequence similarity searches is to detect weather an identified sequence similarity between DNA or protein sequences are statistically significant. For two proteins that are quite similar and clearly grouped in the same family, assessing the significance is not necessary. However, when we are dealing with two sequences with no clear similarity, once the alignment is performed statistical analysis becomes important. In such cases, biologists would like to know if the observed similarity resulted from the alignment is obtained by chance or is authentic. A statistical test assists biologists to identify the more distant related protein or DNA sequences from unrelated. The assessing test is performed on the basis of the assumption that the alignment scores follow a normal distribution. For evaluating the distribution of alignment scores, some random sequences are generated by sequence shuffling technique. Analysis of the alignment scores of random sequences reveal that the scores follow a type of normal distribution called Gumbel extreme value distribution (Altschul & Gish, 1996; Altschul & Bouguski, 1994). Generally, the statistical analysis of alignment scores for local alignments is better understood than global alignments. Since, the Smith-Waterman algorithm reveals regions of conserved or closely matching with a positive score, in random or unrelated sequence alignments these regions are rarely found. Therefore, presence of such regions in real sequences is significant while the probability of occurrence of such regions by chance is close to zero. P-value is a suitable parameter which is used for identifying the probability that a score of S or greater is obtained by chance between two unrelated matched sequences of similar composition and length. Hence, very low p-value corresponds to significant matches meaning that it is improbable the obtained score occurred by chance. It is more probable that the high score occurred as a consequence of a real biological or evolutionary relationship. However, a more common statistical parameter which is reported by most softwares for quantifying the statistical significance of an identified similarity is E-value or expected value. E-value is the expected frequency of scores "S" occurred by chance. P and E values can be calculated for the two matched

Optimal Sequence Alignment and Its Relationship with Phylogeny 263

Setting a threshold for E-value and p-value in database similarity searches, the sequence similarities with scores lower than the threshold are considered significant. The sequences with significant similarities are called "hits". Based on the results of the search, the database is grouped into two subsets called hits (positives) and non-hits (negatives). These subsets conceptually grouped into true and false positives and true and false negatives. A true positive is a hit enforced by a real biological pressure while a false positive is a hit without a real biological relationship to the query sequence. A true negative is a non-hit with no biological background to the query sequence and a false negative is also non-hit with a

Table 1. Statistical parameters (K, ߣ (based on different scoring matrices and different

*n*

*p*

*S*

*S*

Evaluation of the results of a database search is performed by two complementary measurements, known as sensitivity and specificity (Westhead et al., 2002). The sensitivity (Sn) is the proportion of the real biological relationships in the database that are detected as

> ( ) *tp*

> ( ) *tp*

*n n*

*tp fp n*

*n n*

where, ݊௧ is the number of true positives, and ݊ is the number of false negatives. The specificity of the search is the proportion of hits corresponding to the real biological

*tp fn n*

(13)

(14)

biological relationship in reality.

suitable affine gap penalties.

hits and are calculated as follows:

relationships and is obtained as follows:

sequences separately (calculating the probability of obtaining score between the two sequences at least as high by chance) or for the database similarity searches. It must be noted that, when the p-value for two matched sequences is low, E-values for searching a large database can be quite large.

Assessing the statistical significance of a global alignment is very difficult, because performing a global alignment using Needleman-Wunch algorithm and a suitable scoring system produces many different alignments with quite similar scores. In aligning random or unrelated sequences using global alignment method, the aligned sequences have very high scores. Such investigations show that the tendency of global algorithm is to match as many characters as possible. Regardless of this difficulty, a way is developed for assessing the significance of a Needleman-Wunsch global alignment score. In this test the random or unrelated sequences are created by shuffling the reference sequence(s) and the query sequence(s) is aligned against random sequences in pairwise fashion, then the average of scores of alignments is taken and is compared with the score of the real alignment in Z-score parameter assuming that the overall distribution of the randomized score is normal:

$$z = \frac{\chi - \mu}{\sigma} \tag{11}$$

Where, x is the current score of two aligned sequences, ߤ is the mean score of many randomized sequence comparisons, and ߪ is standard deviation of those measurements obtained with random sequences. For evaluating the statistical significance of the two aligned sequences the obtained Z-score is related to probability value. If all random alignments have a score less than the authentic score, this indicates that the p-value is less than 0.01 i.e. the probability of occurrence by chance is less than 0.01. As a result, the studied sequences are significantly related.

Evaluation of statistical significance of local alignment scores of two sequences or a sequence against a database of sequences (like BLAST and FASTA algorithms) is based on E-value. In aligning sequences locally, the high scoring segment pairs (HSPs) are identified. For BLAST algorithm, E-value is the most important statistics associated with BLAST output describing the number of hits expected to occur by chance. Statistical evaluation of locally aligned sequences is somewhat similar to that of global alignments, but the random sequence alignment scores follow extreme value distribution which approximately resembles a normal distribution with a positively skewed tail in the higher score range. The goal is to evaluate the probability of obtaining a random alignment with a score equal or higher than real sequences of interest. Thus, E-value is calculated as follows:

$$E = \text{K}mne^{-\lambda s} \tag{12}$$

Where, K is a constant value, *m* is the effective length of the query sequence, n is the effective length of the random sequences, ߣ is the scaling factor, and ܵ is the score that reflects the similarity of each pairwise comparison. The K and ߣ parameters are described by Karlin and Altschul (1990) and calculated by aligning 10000 random amino acid sequences of variable lengths using Smith-Waterman method and a combination of the scoring matrix and a suitable set of gap penalties for the matrix. Then values of the K and ߣ were estimated for each combination by fitting the data to the predicted extreme value distribution as reported in Table 1.

sequences separately (calculating the probability of obtaining score between the two sequences at least as high by chance) or for the database similarity searches. It must be noted that, when the p-value for two matched sequences is low, E-values for searching a

Assessing the statistical significance of a global alignment is very difficult, because performing a global alignment using Needleman-Wunch algorithm and a suitable scoring system produces many different alignments with quite similar scores. In aligning random or unrelated sequences using global alignment method, the aligned sequences have very high scores. Such investigations show that the tendency of global algorithm is to match as many characters as possible. Regardless of this difficulty, a way is developed for assessing the significance of a Needleman-Wunsch global alignment score. In this test the random or unrelated sequences are created by shuffling the reference sequence(s) and the query sequence(s) is aligned against random sequences in pairwise fashion, then the average of scores of alignments is taken and is compared with the score of the real alignment in Z-score

parameter assuming that the overall distribution of the randomized score is normal:

*x z*

Where, x is the current score of two aligned sequences, ߤ is the mean score of many randomized sequence comparisons, and ߪ is standard deviation of those measurements obtained with random sequences. For evaluating the statistical significance of the two aligned sequences the obtained Z-score is related to probability value. If all random alignments have a score less than the authentic score, this indicates that the p-value is less than 0.01 i.e. the probability of occurrence by chance is less than 0.01. As a result, the studied

Evaluation of statistical significance of local alignment scores of two sequences or a sequence against a database of sequences (like BLAST and FASTA algorithms) is based on E-value. In aligning sequences locally, the high scoring segment pairs (HSPs) are identified. For BLAST algorithm, E-value is the most important statistics associated with BLAST output describing the number of hits expected to occur by chance. Statistical evaluation of locally aligned sequences is somewhat similar to that of global alignments, but the random sequence alignment scores follow extreme value distribution which approximately resembles a normal distribution with a positively skewed tail in the higher score range. The goal is to evaluate the probability of obtaining a random alignment with a score equal or

*<sup>s</sup> E Kmne*

Where, K is a constant value, *m* is the effective length of the query sequence, n is the effective length of the random sequences, ߣ is the scaling factor, and ܵ is the score that reflects the similarity of each pairwise comparison. The K and ߣ parameters are described by Karlin and Altschul (1990) and calculated by aligning 10000 random amino acid sequences of variable lengths using Smith-Waterman method and a combination of the scoring matrix and a suitable set of gap penalties for the matrix. Then values of the K and ߣ were estimated for each combination by fitting the data to the predicted extreme value distribution as

higher than real sequences of interest. Thus, E-value is calculated as follows:

(11)

(12)

large database can be quite large.

sequences are significantly related.

reported in Table 1.

Setting a threshold for E-value and p-value in database similarity searches, the sequence similarities with scores lower than the threshold are considered significant. The sequences with significant similarities are called "hits". Based on the results of the search, the database is grouped into two subsets called hits (positives) and non-hits (negatives). These subsets conceptually grouped into true and false positives and true and false negatives. A true positive is a hit enforced by a real biological pressure while a false positive is a hit without a real biological relationship to the query sequence. A true negative is a non-hit with no biological background to the query sequence and a false negative is also non-hit with a biological relationship in reality.


Table 1. Statistical parameters (K, ߣ (based on different scoring matrices and different suitable affine gap penalties.

Evaluation of the results of a database search is performed by two complementary measurements, known as sensitivity and specificity (Westhead et al., 2002). The sensitivity (Sn) is the proportion of the real biological relationships in the database that are detected as hits and are calculated as follows:

$$S\_n = \frac{n\_{tp}}{(n\_{tp} + n\_{fn})} \tag{13}$$

where, ݊௧ is the number of true positives, and ݊ is the number of false negatives. The specificity of the search is the proportion of hits corresponding to the real biological relationships and is obtained as follows:

$$S\_p = \frac{n\_{tp}}{(n\_{tp} + n\_{fp})} \tag{14}$$

Optimal Sequence Alignment and Its Relationship with Phylogeny 265

Barioch, A. & Apweiler, R. (1996). The SWISSPROT Protein Sequence Data Bank and its New Supplement TREMBLE. *Nucleic Acids Research*, Vol. 24, No. 1, pp. (21-25) Berezin, C.; Glaser, F.; Rosenberg, J.; Paz, I.; Pupko, T.; Fariselli, P.; Casadio, R. & Ben-Tal,

Brenner, S.E. (1998). Practical database searching. *Trends Guide to Bioinformatics,* Vol. 16, No.

Campanella, J.J.; Bitincka, L. & Smalley J. (2003). MatGAT: An application that generates

Chothia, C. & Lesk, A.M. (1986). The relation between the divergence of sequence and structure in proteins. *Embo J.*, Vol. 5, No. 4, (April 1986), pp. (823-826) Dayhoff, M.O. (1972). *Atlas of Protein Sequence and Structure Vol. 5*, Silver Spring, USA

Dardel, F. & Kepes, F. (2006). Sequence comparison, In: *Bioinformatics: Genomics and Post-Genomics*, pp. (25-50), John Wiley & Sons, ISBN13: 978-0-470-02001-2, USA Doolittle, R.F. (1981). Similar amino acid sequences: chance or common ancestry. *Science*,

Edgar, R.C. (2009). Optimizing substitution matrix choice and gap parameters for sequence

Edgar, R.C. & Sjolander, K. (2004). COACH: profile–profile alignment of protein families using hidden Markov models. *Bioinformatics*, Vol. 20, Issue 8, pp. (1309–1318) Feng, D.F. & Doolittle, R. F. (1987). Progressive sequence alignment as prerequisite to correct

Feng, D.F.; Johnson M. S. & Doolittle R. F. (1985). Aligning amino acid sequences: Comparison of commonly used methods. *J. Mol. Evol*, Vol. 21, No. 9, pp. (112-125) Fitch, W.M. (1966). An improved method of testing for evolutionary homology. *J. Mol. Biol.*,

Fitch, W.M. (1970). Distinguishing homologous from analogous proteins. *Syst. Zool*, Vol. 19,

Fitch, W.M. (1970). An improved method for determining codon variability in a gene and its

Fitch, W.M. & Smith, T.F. (1983). Optimal sequence alignments. *Proc. Natl. Acad. Sci.*, Vol. 80,

Gibbs, A.J. & McIntyre, G.A. (1970). The Diagram, a method for comparing sequence. Its use

Henikoff, S. & Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks.

Henikoff, S. & Henikoff, J.G. (1996). Blocks Database and its Applications. *Methods Enzymol.*,

*Proc. Natl. Acad. Sci.,* Vol. 89, (November 1992), pp. (10915-10919)

application to the rate of fixation of the mutations in evolution. *Biochem. Genet.*, Vol.

with amino acid and nucleotide sequences. *Eur. J. Biochem.*, Vol. 16, No. 1,

alignment. *BMC Bioinformatics*, Vol. 10, No. 396, (2 December 2009)

phylogenetic trees. *J. Mol. Evol*, Vol. 25, No. 4, pp. (351-360)

(1322-1324).

1, (November 1998), pp. (9-12)

Vol. 4, No. 29 (10 July 2003)

Vol. 214, No. 4517, (October 1981), pp. (149-159)

Vol. 16, No. 1, (March 1966), pp. (9-16)

4, No. 5, (October 1970), pp. (579-593)

No. 5, (March 1983), pp. (1382-1386)

(September, 1970), pp. (1-11)

Vol. 266, pp. (88-105)

No. 2, (June 1970), pp. (99-113)

N., (2003). ConSeq: the identification of functionally and structurally important residues in protein sequences. *Bioinformatics*, Vol. 20, No. 8, (September 2003), pp.

similarity/identity matrices using protein or DNA sequences. *BMC Bioinformatics*,

where, ݊ is the number of false positives. To obtain more accurate results from database searching, both sensitivity and specificity must be as close as possible to 1, but in practice this is not possible. By increasing the threshold, the sensitivity is likely to increase (i.e. obtaining more true positives and less false negatives), but the specificity is probably decreased (i.e. more false positives). Hence, there is a trade off between these two quantities for increasing the accuracy of the results. It should be noted that analysis of sensitivity and specificity is only possible if the real biological relationships in the database is already known and categories of true and false positives and negatives are created. The categories are created from experimental determination of protein structure and function (Karlin et al., 1991; Pearson, 1998).

#### **6. Summary and conclusion**

In this chapter, the sequence alignment methods and their basic concepts are described. Alignment is the tool for inferring homology. Two types of sequence alignments including global and local are studied that align a query sequence against a reference sequence. Both methods guarantee to find an alignment with the highest scores based on the choice of suitable scoring matrices. These matrices such as PAM and BLOSUM are computed based on substitution matrices. Phylogeny is the core data upon which substitution matrices are constructed. Evolutionary relationships between sequences come from the fact that species undergo mutations over time. In mutated species amino acid sequences of proteins change so that some residues are substituted by other biochemically similar residues. Substitution matrices are built upon the close examination and quantitative analysis of mutations that has been extensively described in this chapter. In order to align a query sequence against a database the same basic concepts for two sequences (query vs. reference) are applied, but faster algorithms are needed. The modified Smith-Waterman algorithms (BLAST and FASTA) are presented for this purpose and are described in full detail. Ultimately, for evaluating the statistical significance of the resulted alignments, these algorithms use parameters such as P- and E-values. Using these values, real relationships are distinguished from random relationships. The statistical analysis of alignment outputs are discussed in detail in this chapter.

#### **7. References**


where, ݊ is the number of false positives. To obtain more accurate results from database searching, both sensitivity and specificity must be as close as possible to 1, but in practice this is not possible. By increasing the threshold, the sensitivity is likely to increase (i.e. obtaining more true positives and less false negatives), but the specificity is probably decreased (i.e. more false positives). Hence, there is a trade off between these two quantities for increasing the accuracy of the results. It should be noted that analysis of sensitivity and specificity is only possible if the real biological relationships in the database is already known and categories of true and false positives and negatives are created. The categories are created from experimental determination of protein structure and function (Karlin et al.,

In this chapter, the sequence alignment methods and their basic concepts are described. Alignment is the tool for inferring homology. Two types of sequence alignments including global and local are studied that align a query sequence against a reference sequence. Both methods guarantee to find an alignment with the highest scores based on the choice of suitable scoring matrices. These matrices such as PAM and BLOSUM are computed based on substitution matrices. Phylogeny is the core data upon which substitution matrices are constructed. Evolutionary relationships between sequences come from the fact that species undergo mutations over time. In mutated species amino acid sequences of proteins change so that some residues are substituted by other biochemically similar residues. Substitution matrices are built upon the close examination and quantitative analysis of mutations that has been extensively described in this chapter. In order to align a query sequence against a database the same basic concepts for two sequences (query vs. reference) are applied, but faster algorithms are needed. The modified Smith-Waterman algorithms (BLAST and FASTA) are presented for this purpose and are described in full detail. Ultimately, for evaluating the statistical significance of the resulted alignments, these algorithms use parameters such as P- and E-values. Using these values, real relationships are distinguished from random relationships. The statistical analysis of alignment outputs are discussed in

Altscul, S.F. (1991). Amino acid substitution matrices from an information theoretic

Altscul, S.F.; Gish, W.; Miller, W.; Myers, E.W. & Lipman D.J. (1990). Basic Local Alignment

Altscul, S.F. & Gish, G. (1996). Local alignment statistic. *Method Enzymol*, Vol. 266, pp. (460-

Altscul, S. F.; Boguski, M.S.; Gish, W. & Wotton, J.C. (1994). Issues in searching molecular

Altscul, S.F.*;* Madden T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W. & Lipman, D.J.,

programs. *Nucleic Acid Res,*Vol. 25, No. 17, (July 1997), pp. (3389-3402).

(1997). Gapped BLAST and PSI-BLAST: A new generation of protein databas search

perspective. *J. Mol. Biol*, Vol. 219, No. 3, (June 1991), pp. (555-565)

Search Tool. *J. Mol. Biol,* Vol. 215, (15 May 1990), pp. (403-410)

sequence databases. *Nat. Genet.,* Vol. 6, pp. (119-129)

1991; Pearson, 1998).

detail in this chapter.

480)

**7. References** 

**6. Summary and conclusion** 


Optimal Sequence Alignment and Its Relationship with Phylogeny 267

Pearson, W.R. & Lipman, D.J. (1988). Improved tools for biological sequence comparison. *Proc. Natl. Acad. Sci., USA,* Vol. 85, No. 8, (April 1988), pp. (2444-2448) Pesrson, W.R. & Miller, W. (1992). Dynamic programming algorithm for biological sequence

Pesrson, W.R.; Wood, T.; Zhang, Z. & Miller, W. (1997). Comparison of DNA sequences with protein sequences. *Genomics*, Vol. 46, No. 1, (November 1997), pp. (24-36) Pevsner, J. (2003a). Pairwise Sequence Alignment, In: *Bioinformatics and Functional Genomics*,

Pevsner, J. (2003b). Basic Local Aligment Search Tool, In: *Bioinformatics and Functional Genomics*, pp. (87-126), John Wiley & Sons, ISBN: 0-471-21004-8, USA Reeck, G.R.; Haën, C.; Teller, D.C.; Doolittle, R. F.; Fitch, W. M.; Dickerson, R. E.; Chambon,

Reese, J.C. & Pearson, W.R. (2002) Empirical determination of effective gap penalties for sequence comparison. *Bioinformatics*, Vol. 18, No. 11, (April 2002), pp. (1500-1507). Dwyer, R.A. (2003). Local Alignment and the BLAST Heuristic, In: *Genomic Perl from* 

Sankoff, D. (1972). Matching sequences under deletion/insertion constraints. *Proc. Natl.* 

Smith, T.F. & Waterman, M.S. (1981). Identification of Common Molecular Subsequences. *J.* 

Smith, T.F. & Waterman, M.S. (1981a). Comparison of bio-sequences. *Adv. Appl. Math.*, Vol.

Smith, T.F.; Wterman, M.S. & Fitch, W.M. (1981b). Comparative bio-sequence metrics. *J.* 

Smoot, M.E.; Guerlain, S.A. & Pearson W.R. (2003) Visualization of near-optimal sequence

States, D.J. & Boguski, M.S. (1991).Similarity and homology. In: *Sequence analysis prime*, (ed. Gribskov, M. & Devereux, J.), pp. (92- 124), Stockton Press, New York States, D.J.; Gish, W. & Altschul, S.F. (1991). Improved sensitivity of nucleic acid database searches using application –specific scoring matrices. *Methods*, Vol. 3, PP. (66-70) Tramontano, A. (Ed(s). Etheridge, A. M.; Gross, L.J.; Lenhart, S.; Miani, P.K.; Ranganathan,

Vogt, G.; Etzold, T. & Argos, P. (1995). An Assessment of Amino Acid Exchange Matrices in

Wen, Z.N.; Wang, K.; Li, M.; Nie, F. & Yangc, Y. (2005). Analyzing functional similarity of

Westhead, D.R.; Parish, J.H. & Twyman, R.M. (2002). *Bioinformatics*. 2nd Edition, BIOS

S.; Safer, H.M. & Voit, E.O.).(2006). *Introduction to Bioinformatics*, Chapman & Hall/

Aligning Protein sequences: The Twilight Zone Revisited. *J. Mol. Biol.*, Vol. 249, pp.

protein sequences with discrete wavelet transform. *Computational Biology and* 

alignments. *Bioinformatics*, Vol. 20, No. 6, (July 2003), pp. (953-958).

P.; McLachlan, A. D.; Margoliash, E.; Jukes, T. H. & Zuckerk, E. (1987). Homology in Proteins and nucleic acids: A terminology muddle and a way out of it. *Cell*, Vol.

*Bioinformatics Basics to Working Code*, pp. (93-108), Cambridge University Press,

comparison. *Method Enzymol.*, Vol. 210, PP. (575-601).

pp. (41-84), John Wiley & Sons, ISBN: 0-471-21004-8, USA

*Acad. Sci.*, Vol. 69, No. 1, (January 1972), pp. (4-6)

50, (August 1987), pp. (667)

ISBN: 0-521-80177, UK

*Mol. Biol.*, Vol. 147, pp. (195-197)

2, No. 4, (December 1981), pp. (482-489)

*Mol. Evol.*, Vol. 18, No. 1, pp. (38-46).

CRC, ISBN: 1-58488-569-6, UK

*Chemistry*, Vol. 29, pp. (220–228)

Scientific Publisher, ISBN: 1- 85996-272-6, UK

(816–831).


Higgs, P.G. & Attwood, T.K. (2005). Model of Sequence Evolution, In: *Bioinformatics and* 

Jones, D.T.; Taylor, W.R. & Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. *CABIOS*, Vol. 8, No. 3, pp. (275-282). Jones, D.T.; Taylor W.R. & Thornton, J.M. (1993). A mutation data matrix for transmembrane

Karlin, S. & Altschul S.F. (1990). Methods for assessing the statistical significance of

Karlin, S.; Bucher, P. & Brendel, P. (1991). Statistical methods and insights for protein and DNA sequences. *Annu. Rev. Biophys. Chem.*, Vol. 20, (June 1991), pp. (175-203) Kim, E. & Kececioglu, J. (2008). Learning scoring schemes for sequence alignment from

Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A..; McWilliam H.;

Mahdavi, M.A. (2010). Medical informatics: transition from data acquisition to data analysis

Mount, D.W. (2001a) Alignment of Pairs of Sequences, In: *Bioinformatics: Sequence and* 

Mount, D.W. (2001b). Database Searching For Similar Sequences, In: *Bioinformatics: Sequence* 

Needleman, S.B. & Wunsch, C.D. (1970). A general method applicable to the search for

Pearson, W.R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA.

Pearson, W.R. (1991). Searching protein sequence libraries: Comparison of the sensitivity

Pearson, W.R. (1995). Comparison of methods for searching protein sequence databases.

Pearson, W.R. (1996). Effective protein sequence comparison. *Methods Enzymol.*, Vol. 266, pp.

Pearson, W.R. (1998). Empirical statistical estimates for sequence similarity searches. *J. Mol.* 

Pearson, W.R. (2000). Flexible sequence similarity searching with FASTA3 program package.

proteins. *FEBS Letters*, Vol. 339, pp. (269-275)

*Sci.*, Vol. 87, No. 6, (March 1990), pp. (2264-2268)

UK

2008), pp. (546-556)

597-8, USA

(443-453)

(227-258)

87969-597-8, USA

21, (November 2007), pp. (2947-8)

*Bioinformatics,* Vol. 4, No. 2, pp. (158-174)

*Method Enzymol.,* Vol. 183, pp. (63-98)

No. 3, (November 1991), pp. (635-650)

*Biol.*, Vol. 276, No. 1, (February 1998), pp. (71-84)

*Methods Mol. Biol.*, Vol. 132, No. 2, pp. (185-219)

*Protein Sci.*, Vol. 4, pp. (1145-1160)

*Molecular Evolution*, pp. (58-80), Blackwell Science Ltd, ISBN: 978-1-4051-0683-2,

molecular sequence features by using general scoring schemes. *Proc. Natl. Acad.* 

partial examples. *IEEE/ACM Trans Comput. Biol. Bioinform.*, Vol. 5, No. 4, (June

Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; Thompson, J.D.; Gibson, T.J. & Higgins, D.G. (2007). ClustalW and ClustalX version 2.0. *Bioinformatics*, Vol. 23, No.

by means of bioinformatics tools and resources. *Int. J. Data Mining and* 

*Genome Analysis*, pp. (53-137), Cold Spring Harbor Laboratory Press, ISBN: 0-87969-

*and Genome Analysis*, pp. (282-335), Cold Spring Harbor Laboratory Press, ISBN: 0-

similarities in the amino acid sequence of two proteins. *J. Mol. Biol.,* Vol. 48, pp.

and selectivity of the Smith-Waterman and FASTA algorithms. *Genomics*, Vol. 11,


Viruses are an important cause of human disease, often because they are highly transmittable from human to human. A key tool from population genetics that can be applied to the study of viruses is coalescent theory. Coalescent theory predicts genealogical tree shapes as a function of how the studied organisms are evolving. Therefore, under its model assumptions, coalescent theory can be used to infer aspects of the demographic history of evolving organisms. For example, there are characteristics of tree shapes that imply whether the organism population has been constant, growing, or shrinking in size

This chapter reviews some of the successes of coalescent theory in the context of inferring aspects of virus evolution, using human immunodeficiency (HIV) and influenza viruses as case studies. Next, the chapter describes limitations of coalescent theory, even as extended to allow some forms of selection, population subdivision, and viral recombination. The relatively new goal to predict influenza virus evolution (rather than infer past evolution) is used to emphasize modeling needs beyond standard or extended coalescent theory models. A new small-scale simulation that combines viral fitness with demographic population structures such as family and work groups is then described as an example extension to

Prediction goals include early detection of highly lethal new strains and improved vaccine designs that anticipate future evolutionary directions. Regardless which evolutionary model is used to predict virus evolution, because real virus evolution is complex beyond current understanding, there will be substantial model error. Model error, model parameter estimation error, and purely random effects can combine to make some forecast goals unattainable. In these cases the most appropriate prediction is similar to what is often said

Humans are susceptible to many viral pathogens, including the human immunodeficiency virus (HIV) and the influenza virus. Although it is a relatively new goal in population genetics, predicting virus evolution can help with vaccine design and with other mitigation strategies (Bush et al., 1999; Ferguson and Anderson, 2002; Plotkin et al., 2002; Rambaut et al., 2008). Using estimated phylogenetic (genealogical) tree shapes to infer aspects of evolution such as organism growth rates has received far more attention to date (Felsenstein et al., 1999; Innan

and Stephan, 2000; Pybus et al., 2000; Stephens and Connelly, 2000; Ewing et al., 2004)

**1. Introduction** 

over time.

coalescent theory models.

**2. Background** 

about stock markets: there will be change.

*Los Alamos National Laboratory* 

Tom Burr

*USA* 

Xu, W. & Miranker, D.P. (2003). A metric model of amino acid substitution. *Bioinformatics*, Vol. 20, No. 8, (November 2003), pp. (1214–1221) **13** 

### Tom Burr

*Los Alamos National Laboratory USA* 

#### **1. Introduction**

268 Bioinformatics – Trends and Methodologies

Xu, W. & Miranker, D.P. (2003). A metric model of amino acid substitution. *Bioinformatics*,

Viruses are an important cause of human disease, often because they are highly transmittable from human to human. A key tool from population genetics that can be applied to the study of viruses is coalescent theory. Coalescent theory predicts genealogical tree shapes as a function of how the studied organisms are evolving. Therefore, under its model assumptions, coalescent theory can be used to infer aspects of the demographic history of evolving organisms. For example, there are characteristics of tree shapes that imply whether the organism population has been constant, growing, or shrinking in size over time.

This chapter reviews some of the successes of coalescent theory in the context of inferring aspects of virus evolution, using human immunodeficiency (HIV) and influenza viruses as case studies. Next, the chapter describes limitations of coalescent theory, even as extended to allow some forms of selection, population subdivision, and viral recombination. The relatively new goal to predict influenza virus evolution (rather than infer past evolution) is used to emphasize modeling needs beyond standard or extended coalescent theory models. A new small-scale simulation that combines viral fitness with demographic population structures such as family and work groups is then described as an example extension to coalescent theory models.

Prediction goals include early detection of highly lethal new strains and improved vaccine designs that anticipate future evolutionary directions. Regardless which evolutionary model is used to predict virus evolution, because real virus evolution is complex beyond current understanding, there will be substantial model error. Model error, model parameter estimation error, and purely random effects can combine to make some forecast goals unattainable. In these cases the most appropriate prediction is similar to what is often said about stock markets: there will be change.

#### **2. Background**

Humans are susceptible to many viral pathogens, including the human immunodeficiency virus (HIV) and the influenza virus. Although it is a relatively new goal in population genetics, predicting virus evolution can help with vaccine design and with other mitigation strategies (Bush et al., 1999; Ferguson and Anderson, 2002; Plotkin et al., 2002; Rambaut et al., 2008). Using estimated phylogenetic (genealogical) tree shapes to infer aspects of evolution such as organism growth rates has received far more attention to date (Felsenstein et al., 1999; Innan and Stephan, 2000; Pybus et al., 2000; Stephens and Connelly, 2000; Ewing et al., 2004)

Agent-based models (Eubank et al., 2004) provide a richer framework than classical epidemiology models of disease spread, such as the susceptible-infected-recovered (SIR) model. Because agent-based models track individual rather than aggregate behavior, they are believed to more reliably predict, for example, the impact of candidate mitigation strategies such as vaccinations and isolation. In an analogous way, we describe predictions for virus evolution that probably will require a higher-fidelity modeling framework than

The following sections include HIV and influenza examples of using coalescent theory to infer aspects of prior evolution, limitations of coalescent theory to infer future HIV and influenza evolution, introduces the new small-scale simulation that combines viral fitness with demographic features, and discusses limitations of our current ability to predict viral evolution.

Fig. 1. The most recent common ancestor and sample genealogy from an evolving population. At time *t* a sample is collected, and at time 0 in the past, all individuals in the sample coalesce to share a common ancestor. For simplicity here, the population size is assumed to be constant over time, with each individual at time 0 represented by a dot (not

**Within host.** A coalescent model within individuals has been applied (Rodrigo et al., 1999) to analyze HIV-1 viral load data from infected individuals after the administration of an HIV-1 inhibitor to estimate the HIV generation time in vivo. The estimate was 1-2 days, which agreed well with an estimate based on a different approach (Perelson et al, 1996), although it assumed nonrecombining DNA sequences from a population of constant effective size *N*. Consider samples from two sample times, separated by *d* days. The number of days per generation is estimated as *dn(n-c)/2Nc* where *c* is the number of coalescent events that have occurred, *d* is the number of days between samples. The method assumed: (a) the

coalescent theory and its extensions.

all individuals are shown).

**3.1 Example 1: HIV** 

**3. HIV and influenza examples** 

This chapter focuses on HIV and the influenza virus in the context of what might be predicted about virus evolution. Influenza is a highly transmittable disease that infects millions each year, resulting in many deaths. HIV is also transmittable through risky behaviors and it too results in many deaths each year.

In population genetics, coalescent theory (Kingman, 1982; Stephens and Donnelly, 2000; Burr et al., 2001; Ewing et al., 2004;) is a key tool that predicts genealogical tree shapes as a function of how the studied organisms (taxa) are evolving. Therefore, under its model assumptions, coalescent theory can be used to infer aspects of the demographic history of evolving organisms. For example, there are characteristics of tree shapes that imply whether the organism population has been constant, growing, or shrinking in size over time (Pybus et al., 2000).

This chapter will first review some of the successes of coalescent theory in the context of inferring aspects of virus evolution, using HIV ( Rodrigo et al., 1999; Burr et al., 2001; Rambaut et al., 2001) and influenza viruses (Ferguson and Anderson, 2002; Plotkin et al., 2002, Burr et al., 2002) as case studies. Next, the chapter describes limitations of coalescent theory, even as extended to allow some forms of serial sampling, selection, population subdivision, and viral recombination (Excoffier and Foll, 2011). The relatively new goal to predict influenza virus evolution (rather than infer past evolution) is used to emphasize modeling needs beyond standard or extended coalescent theory models. A new small-scale simulation that combines viral fitness with demographic population structures such as family and work groups is then described as an example extension to coalescent theory models.

Most genetic data analyses rely on a forward model that specifies evolutionary forces and associated probabilities describing how offspring are generated. Evolutionary forces include drift, mutation, recombination, migration, and selection. Drift refers to random change over successive generations due to finite population sizes. In the absences of mutation and selection, the fraction of a population of size *N* having a given trait drifts randomly somewhat like the number of heads in a set of *N* coin tosses. Mutations are changes in the DNA sequence that occur for many reasons. Recombination ("reassortment" in the case of influenza) refers to sections of genome that are broken and then recombined, resulting in large genetic differences between offspring and parents, and complicating phylogenetic analyses because different genome sections can have different genealogies. Migration refers to exchange of genetic material among partly isolated subpopulations. There are many other evolutionary forces, too many to describe here.

For simulating DNA sequences from a population, the state of art invokes coalescent theory, which uses simplified models of the forward evolutionary process. These simplifications allow inverse analytical solutions and corresponding simulation software, but with questionable assumptions. This is done in order to avoid having to simulate directly from the forward model and track the evolutionary histories. Sample genealogies can instead be simulated by running time from the present toward the past and tracking probabilistically when lineages coalesce to share a common ancestor. An example genealogy of a sample taken at a single time from a population that is maintaining a constant population size is given in Figure 1. These coalescent-based simulated sample units are then used to infer how a population is evolving using features of the associated phylogenetic tree (Pybus et al., 2000). In addition, analytical approximations used in inference invoke the same model assumptions used in coalescent theory.

This chapter focuses on HIV and the influenza virus in the context of what might be predicted about virus evolution. Influenza is a highly transmittable disease that infects millions each year, resulting in many deaths. HIV is also transmittable through risky

In population genetics, coalescent theory (Kingman, 1982; Stephens and Donnelly, 2000; Burr et al., 2001; Ewing et al., 2004;) is a key tool that predicts genealogical tree shapes as a function of how the studied organisms (taxa) are evolving. Therefore, under its model assumptions, coalescent theory can be used to infer aspects of the demographic history of evolving organisms. For example, there are characteristics of tree shapes that imply whether the organism population has been constant, growing, or shrinking in size over time (Pybus

This chapter will first review some of the successes of coalescent theory in the context of inferring aspects of virus evolution, using HIV ( Rodrigo et al., 1999; Burr et al., 2001; Rambaut et al., 2001) and influenza viruses (Ferguson and Anderson, 2002; Plotkin et al., 2002, Burr et al., 2002) as case studies. Next, the chapter describes limitations of coalescent theory, even as extended to allow some forms of serial sampling, selection, population subdivision, and viral recombination (Excoffier and Foll, 2011). The relatively new goal to predict influenza virus evolution (rather than infer past evolution) is used to emphasize modeling needs beyond standard or extended coalescent theory models. A new small-scale simulation that combines viral fitness with demographic population structures such as family and work groups is then described as an example extension to coalescent theory

Most genetic data analyses rely on a forward model that specifies evolutionary forces and associated probabilities describing how offspring are generated. Evolutionary forces include drift, mutation, recombination, migration, and selection. Drift refers to random change over successive generations due to finite population sizes. In the absences of mutation and selection, the fraction of a population of size *N* having a given trait drifts randomly somewhat like the number of heads in a set of *N* coin tosses. Mutations are changes in the DNA sequence that occur for many reasons. Recombination ("reassortment" in the case of influenza) refers to sections of genome that are broken and then recombined, resulting in large genetic differences between offspring and parents, and complicating phylogenetic analyses because different genome sections can have different genealogies. Migration refers to exchange of genetic material among partly isolated subpopulations.

For simulating DNA sequences from a population, the state of art invokes coalescent theory, which uses simplified models of the forward evolutionary process. These simplifications allow inverse analytical solutions and corresponding simulation software, but with questionable assumptions. This is done in order to avoid having to simulate directly from the forward model and track the evolutionary histories. Sample genealogies can instead be simulated by running time from the present toward the past and tracking probabilistically when lineages coalesce to share a common ancestor. An example genealogy of a sample taken at a single time from a population that is maintaining a constant population size is given in Figure 1. These coalescent-based simulated sample units are then used to infer how a population is evolving using features of the associated phylogenetic tree (Pybus et al., 2000). In addition, analytical approximations used in inference invoke the same model

There are many other evolutionary forces, too many to describe here.

assumptions used in coalescent theory.

behaviors and it too results in many deaths each year.

et al., 2000).

models.

Agent-based models (Eubank et al., 2004) provide a richer framework than classical epidemiology models of disease spread, such as the susceptible-infected-recovered (SIR) model. Because agent-based models track individual rather than aggregate behavior, they are believed to more reliably predict, for example, the impact of candidate mitigation strategies such as vaccinations and isolation. In an analogous way, we describe predictions for virus evolution that probably will require a higher-fidelity modeling framework than coalescent theory and its extensions.

The following sections include HIV and influenza examples of using coalescent theory to infer aspects of prior evolution, limitations of coalescent theory to infer future HIV and influenza evolution, introduces the new small-scale simulation that combines viral fitness with demographic features, and discusses limitations of our current ability to predict viral evolution.

Fig. 1. The most recent common ancestor and sample genealogy from an evolving population. At time *t* a sample is collected, and at time 0 in the past, all individuals in the sample coalesce to share a common ancestor. For simplicity here, the population size is assumed to be constant over time, with each individual at time 0 represented by a dot (not all individuals are shown).

#### **3. HIV and influenza examples**

#### **3.1 Example 1: HIV**

**Within host.** A coalescent model within individuals has been applied (Rodrigo et al., 1999) to analyze HIV-1 viral load data from infected individuals after the administration of an HIV-1 inhibitor to estimate the HIV generation time in vivo. The estimate was 1-2 days, which agreed well with an estimate based on a different approach (Perelson et al, 1996), although it assumed nonrecombining DNA sequences from a population of constant effective size *N*. Consider samples from two sample times, separated by *d* days. The number of days per generation is estimated as *dn(n-c)/2Nc* where *c* is the number of coalescent events that have occurred, *d* is the number of days between samples. The method assumed: (a) the

from humans, swine, and avian hosts. PCs provide a low dimensional way to represent a distance matrix. For this data, all pairs of distances can be quite accurately reproduced using only the first two PCs as in Figure 3. It is known that the NP region maintains a type of "species signature" such as depicted in Figure 3 (Burr et al., 1999, 2002; Chen et al., 2006). A key aspect of influenza evolution is the fact that avian and swine hosts occasionally act as "reassortment vessels" for human influenza, resulting in dramatically different strains that evade effective human immune response. As an aside, the term "reassortment" seems to be applied only to influenza, presumably because its genome consists of eight distinct segments. For our purposes, "reassortment" is the same as recombination, in which sections

Figure 4 shows a PC plot of a distance matrix based on the hamagglutinin (HA) region of influenza viruses. Figure 5 is a phylogenetic tree built using neighbor joining (Swofford et

Figures 4 and 5 illustrate (Nelson and Holmes, 2007) that the HA region appears to display the effects of positive selection due to the cactus-like structure with most lineages dying out. This cactus shape is unlike the classic "star-like" shape HIV trees of type M as in Figure 2. Such a cactus shape can also arise without positive selection from a combination of serially sampled taxa and sequential random population bottlenecks (which can occur in influenza due to its strong seasonality). Therefore, the cactus shape by itself can indicate but does not prove that positive selection is in effect. More formally, the statistical notion of identifiability probably does not hold in this context. Model identifiability implies that as sample size increases toward infinity, model parameters can be uniquely estimated (see Section 6).

Fig. 3. Principal coordinate plot of the evolutionary distances among 129 influenza viruses extracted from human (H), avian (A), and swine (S) hosts. Distances are computed for the Nucleoprotein region of the virus, which exhibits species signatures. Among the 129, there are 14 "misidentified" taxa. However,the human (H) that is clustered with the avian group was known to have been infected by poultry. There are 44 avian, 57 humans, and 28 swine,

of the genome get recombined (Forrest and Webster, 2010).

al., 2000) of the same HA sequences.

all available from www.flu.lanl.gov.

population size *N* is constant; (b) the estimated phylogeny is the same as the true genealogy of the sampled individuals, and (c) the exchangeability assumption that each individual virus has the same propensity to reproduce. Implicit in (b) is the further assumption that recombination, migration, and selection do not interfere with the ability to estimate the true phylogeny. Further, an approximate technique for accommodating serial samples is required, which has recently become available (Excoffier and Foll, 2011).

**Between host.** An example (Burr et al., 2001) involves whether the 8 to 10 approximately equidistant subtypes of HIV-1 (type M) could have arisen under available models of how HIV is evolving (Fig. 1). To examine this, coalescent theory was used to simulate DNA data from a very simplified forward model of how HIV is evolving at both the macro and micro levels (see Section 4.1). This provided a reference distribution against which to compare the data. If features of the observed data (such as the ratio of the between-subtype to withinsubtype genetic distance) are in the tail of the coalescent-theory-based reference distribution of those same features, then the forward model used to simulate the data is not credible. Examples of phylogenetic trees estimated from coalescent-based simulated data are given in the right box of four subplots in Figure 2. Notice that subtypes are expected to arise in examples (b), (c), and (d), but not in (a). Subplot (a) is the classic "star phylogeny" that arises when the underlying population size is growing rapidly, forcing most coalescent events to occur early in the growth period, and all at nearly the same time.

Fig. 2. **HIV, env region.** Consensus trees (of 100 bootstrap samples) using maximum likelihood for real (the left plot) HIV (env gene) sequences and for coalescent-based simulated (the four right plots) sequences under different assumptions about the time behavior of the number of infecteds *N*.

#### **3.2 Example 2: Influenza**

Figure 3 shows a principal coordinate (PC) plot (Venables and Ripley, 1999) of a 129-by-129 distance matrix based on the nucleoprotein (NP) region of 129 influenza viruses isolated

population size *N* is constant; (b) the estimated phylogeny is the same as the true genealogy of the sampled individuals, and (c) the exchangeability assumption that each individual virus has the same propensity to reproduce. Implicit in (b) is the further assumption that recombination, migration, and selection do not interfere with the ability to estimate the true phylogeny. Further, an approximate technique for accommodating serial samples is

**Between host.** An example (Burr et al., 2001) involves whether the 8 to 10 approximately equidistant subtypes of HIV-1 (type M) could have arisen under available models of how HIV is evolving (Fig. 1). To examine this, coalescent theory was used to simulate DNA data from a very simplified forward model of how HIV is evolving at both the macro and micro levels (see Section 4.1). This provided a reference distribution against which to compare the data. If features of the observed data (such as the ratio of the between-subtype to withinsubtype genetic distance) are in the tail of the coalescent-theory-based reference distribution of those same features, then the forward model used to simulate the data is not credible. Examples of phylogenetic trees estimated from coalescent-based simulated data are given in the right box of four subplots in Figure 2. Notice that subtypes are expected to arise in examples (b), (c), and (d), but not in (a). Subplot (a) is the classic "star phylogeny" that arises when the underlying population size is growing rapidly, forcing most coalescent events to

(c) N = N0, then N= N0 ert

Figure 3 shows a principal coordinate (PC) plot (Venables and Ripley, 1999) of a 129-by-129 distance matrix based on the nucleoprotein (NP) region of 129 influenza viruses isolated

Fig. 2. **HIV, env region.** Consensus trees (of 100 bootstrap samples) using maximum likelihood for real (the left plot) HIV (env gene) sequences and for coalescent-based simulated (the four right plots) sequences under different assumptions about the time

behavior of the number of infecteds *N*.

Real HIV (env gene)

**3.2 Example 2: Influenza** 

Simulated data: 4 macro growth rates

(a) N = N0 ert (b) N = N0

(d) N is quadratic from1970 to 1990

required, which has recently become available (Excoffier and Foll, 2011).

occur early in the growth period, and all at nearly the same time.

from humans, swine, and avian hosts. PCs provide a low dimensional way to represent a distance matrix. For this data, all pairs of distances can be quite accurately reproduced using only the first two PCs as in Figure 3. It is known that the NP region maintains a type of "species signature" such as depicted in Figure 3 (Burr et al., 1999, 2002; Chen et al., 2006). A key aspect of influenza evolution is the fact that avian and swine hosts occasionally act as "reassortment vessels" for human influenza, resulting in dramatically different strains that evade effective human immune response. As an aside, the term "reassortment" seems to be applied only to influenza, presumably because its genome consists of eight distinct segments. For our purposes, "reassortment" is the same as recombination, in which sections of the genome get recombined (Forrest and Webster, 2010).

Figure 4 shows a PC plot of a distance matrix based on the hamagglutinin (HA) region of influenza viruses. Figure 5 is a phylogenetic tree built using neighbor joining (Swofford et al., 2000) of the same HA sequences.

Figures 4 and 5 illustrate (Nelson and Holmes, 2007) that the HA region appears to display the effects of positive selection due to the cactus-like structure with most lineages dying out. This cactus shape is unlike the classic "star-like" shape HIV trees of type M as in Figure 2. Such a cactus shape can also arise without positive selection from a combination of serially sampled taxa and sequential random population bottlenecks (which can occur in influenza due to its strong seasonality). Therefore, the cactus shape by itself can indicate but does not prove that positive selection is in effect. More formally, the statistical notion of identifiability probably does not hold in this context. Model identifiability implies that as sample size increases toward infinity, model parameters can be uniquely estimated (see Section 6).

Fig. 3. Principal coordinate plot of the evolutionary distances among 129 influenza viruses extracted from human (H), avian (A), and swine (S) hosts. Distances are computed for the Nucleoprotein region of the virus, which exhibits species signatures. Among the 129, there are 14 "misidentified" taxa. However,the human (H) that is clustered with the avian group was known to have been infected by poultry. There are 44 avian, 57 humans, and 28 swine, all available from www.flu.lanl.gov.

Fig. 6. Genetic distance versus time (top) and versus difference in space (bottom) for the HA

Difference in space (arbitrary units)

Difference in time (yrs)

**4. Limitations of coalescent theory for predicting HIV and influenza evolution**  Coalescent theory leads to tremendous insights and powerful simulation and inference

a. Little is known concerning accuracy and robustness of coalescent theory's restrictive assumptions in many settings, although some forward models are known not to be well approximated by any coalescent model (depending on the relative time scales of various evolutionary effects such as drift, migration, and selection) (Sjodin et al., 2005); b. Inference methods (Pybus et al., 2000; Stephens and Donnelly, 2000) invoke coalescent approximations to estimate the probability of candidate branching orders as part of the inference process. This leads to the undesirable situation of forcing a zero mismatch between the inference method's assumptions and the assumptions regarding how the

c. Coalescent theory is expanding along with associated software for implementation, but no current coalescent-based software includes all extensions to the original coalescent theory. However, one new option (Excoffier and Foll, 2011) for coalescent-based software includes many of the standard evolutionary features such as serial sampling,

d. Building trees supports inferences regarding, for example, whether a virus strain appears to be a natural branch from historical strains, or whether the strain seems to have made an unnatural leap indicating bioengineering. However, key coalescent assumptions that are violated by both HIV and influenza viruses are that all subtypes are equally transmissible and there is no recombination. Therefore, although to a limited extent and under restrictive assumptions, extensions to coalescent theory have

tools. However, limitations of coalescent techniques include (a)-(d) as follows:

region.

Genetic distance

Genetic distance

population is evolving;

recombination, and geographic isolation;

Fig. 4. Principal coordinate plot of the influenza viruses (HA region) found in humans. Digit = year, Black = 1960's, Red = 1980's, Green = 1990's. Genetic drift and strain extinctions are known to occur (cactus shape of typical tree).

Fig. 5. Neighbor joining tree of the same HA sequences in humans that was used in Fig. 3.

Fig. 4. Principal coordinate plot of the influenza viruses (HA region) found in humans. Digit = year, Black = 1960's, Red = 1980's, Green = 1990's. Genetic drift and strain extinctions are

Fig. 5. Neighbor joining tree of the same HA sequences in humans that was used in Fig. 3.

known to occur (cactus shape of typical tree).

Difference in space (arbitrary units)

Fig. 6. Genetic distance versus time (top) and versus difference in space (bottom) for the HA region.

#### **4. Limitations of coalescent theory for predicting HIV and influenza evolution**

Coalescent theory leads to tremendous insights and powerful simulation and inference tools. However, limitations of coalescent techniques include (a)-(d) as follows:


One possible new way to simulate sequences is to track each HIV case by geographic region including all known transmission routes such as sex, needles, blood transfusions, and mother-to-child, and track the genealogy of each case. One would then sample ~100 simulated sequences from around the world or in specified regions at a snapshot in time, or distributed in time, and distributed spatially in either case. With careful bookkeeping one could deduce the sample genealogy (which 2 samples coalesced first to their most recent common ancestor (MRCA), which samples coalesced next, etc.) back in time until all 100 sequences coalesced to the single MRCA. This would produce 99 coalescent times and sample identities, which define the genealogy of the sample. This genealogy could also be thought of as the true evolutionary tree for the sample to be compared to coalescent-based

Related to the origin of the HIV subtypes is the goal to predict the stability of the subtypes because current vaccine design approaches rely on "mosaic" pseudo-HIV viruses that exploit the known characteristic or representative sequence of each subtype (Barouch et al., 2010). Although a few new subtypes have been defined since the original, the M clade trees with 8-10 subtypes have been remarkably stable over time (Korber and Myers, 1992; Burr et al., 2001). Both within-host and between-host modeling efforts should allow for multiple viral sequence types within hosts, because within-host variation in contemporaneous HIV sequences isolated from various regions of the genome exhibit substantial variation, easily

Imagine a particularly bad flu season. Not only does it appear that more people are infected than normal by early November, but there are anomalous deaths. Could this be a bio attack or perhaps a another human-to-human transmittable version of the swine-origin influenza

As Figure 6 suggests, there is empirical evidence that a time gap of three or more years is sufficient for a temporal signature. For example, strains isolated in 1993 should be genetically distinct from strains in 1996 or later (Burr et al., 1999). Therefore, we might be suspicious in 2012 if the strain looks like a 2008 strain. However, the empirical evidence assumes a constant population size because the genetic distance between two samples depends on the coalescent time since they evolved from the same ancestral sequence. And the time since the two samples shared a common ancestor depends on several factors, including how the population is structured and the size and growth rate of the population. If some of these factors change dramatically, then the three-year rule would become either shorter or longer. Currently, coalescent methods either hold these factors constant over time, or extensions to the approximations have not been implemented. Therefore, empirical reconstructions of phylogenetic trees such as those in Burr et al. (1999) are incomplete for assessing the robustness of candidate signatures. The corresponding inferences thus have

The Neffective concept is part of the success of coalescent theory, including in the influenza context Bedford et al (2010) estimate Neffective for influenza A using a coalescent model that includes subdivision and migration. Bayesian Evolutionary Analysis Sampling Trees (BEAST, Pybus et al., 2007) was used to estimate assuming both *N* and are constant over time. In this example, it could be important that observed and reported influenza mutation rates need not be stable over time for several reasons, including the fact that Neffective

genealogies.

up to 10% differences.

A (H1N1) virus?

unknown reliability.

changes with time.

**4.2 Example 2: Influenza** 

been made to accommodate recombination, selection, overlapping generations, and population subdivision, there are cases where the theory is either inadequate or the sensitivity of its conclusions to its assumptions is unknown. The corresponding inference quality using estimated trees is also unknown; the state of the art is therefore to quantify precision, but not accuracy.

The forward model is a key component of total uncertainty associated with population genetics inferences. The current approach is: specify an amenable-to-coalescent-theory forward model for how a population is evolving that includes for example, population size, structure, and selection effects; identify the coalescent effective population size *Ne* (Sjodin et al., 2005) in the nearest available coalescent model, which is often a complicated task. Then, use the closest coalescent model to simulate sample genealogies under restrictive assumptions about the population and the sampling process. The Neffective notion arose from coalescent theory by mapping the actual population size *N* in a population that violates some coalescent assumptions (such as nonoverlapping generations) to a different size Neffective such that in some aspects, the actual and model populations evolve probabilistically in approximately the same manner. Coalescent theory was originally applied to macroscopic populations such as plants and animals (for example, Innan and Stephan, 2000); it has also been applied to microscopic populations such as DNA sequences from virus populations (for example, Rodrigo et al, 1999).

Coalescent theory will continue to provide insight into evolutionary processes; however, it is currently unknown how robust associated inferences are with respect to model violations. For example, Innan and Stephan [5] assumed the wild plant Arabidopsis thaliana (useful for genetic studies because of its well known demographic history and genome) consists of many isolated colonies, each having negligible genetic variation within a colony. They applied a coalescent model (correcting for the growing population size of A. thalina) to simulate the probability distribution of Tajima's *D* statistic against which to compare the observed *D* in real samples, as a test for selection. Tajima's *D* statistic is based on the difference between two estimates of the amount of variation (one using the number of sites having genetic variation and the other based on pairwisedifferences between individuals). They concluded that there was evidence for selection (distinguishing the type of selection, such as balancing or purifying is a separate challenge). Because of the simplifications inherent in the coalescent approach, it is currently unknown how robust the evidence for selection is in this for A. thalina.

#### **4.1 Example 1: HIV**

Coalescent models of HIV reproduction within an individual might be adequate (Rodrigo et al., 1999); however, these would become prohibitively unwieldy if all HIV-infected humans were modeled. For example, all models must specify the macro components such as the reproductive rate in susceptible populations and/or subpopulations.

As mentioned in Section 3.1, an investigation into the development of the HIV subtypes led to an application of coalescent theory to model the population dynamics of HIV (Burr et al., 2001). Figure 2 illustrates the approach taken. Various features (such as the ratio of the between-subtype to within-subtype genetic distance) involving the subtypes of real HIV sequences (env gene) were compared to the same features in corresponding coalescentbased simulated data. However, it became apparent that it would be necessary to implement a model that made less restrictive assumptions than coalescent theory (Burr et al., 2001).

The forward model is a key component of total uncertainty associated with population genetics inferences. The current approach is: specify an amenable-to-coalescent-theory forward model for how a population is evolving that includes for example, population size, structure, and selection effects; identify the coalescent effective population size *Ne* (Sjodin et al., 2005) in the nearest available coalescent model, which is often a complicated task. Then, use the closest coalescent model to simulate sample genealogies under restrictive assumptions about the population and the sampling process. The Neffective notion arose from coalescent theory by mapping the actual population size *N* in a population that violates some coalescent assumptions (such as nonoverlapping generations) to a different size Neffective such that in some aspects, the actual and model populations evolve probabilistically in approximately the same manner. Coalescent theory was originally applied to macroscopic populations such as plants and animals (for example, Innan and Stephan, 2000); it has also been applied to microscopic populations such as DNA sequences from virus populations

Coalescent theory will continue to provide insight into evolutionary processes; however, it is currently unknown how robust associated inferences are with respect to model violations. For example, Innan and Stephan [5] assumed the wild plant Arabidopsis thaliana (useful for genetic studies because of its well known demographic history and genome) consists of many isolated colonies, each having negligible genetic variation within a colony. They applied a coalescent model (correcting for the growing population size of A. thalina) to simulate the probability distribution of Tajima's *D* statistic against which to compare the observed *D* in real samples, as a test for selection. Tajima's *D* statistic is based on the difference between two estimates of the amount of variation (one using the number of sites having genetic variation and the other based on pairwisedifferences between individuals). They concluded that there was evidence for selection (distinguishing the type of selection, such as balancing or purifying is a separate challenge). Because of the simplifications inherent in the coalescent approach, it is currently unknown how robust the evidence for

Coalescent models of HIV reproduction within an individual might be adequate (Rodrigo et al., 1999); however, these would become prohibitively unwieldy if all HIV-infected humans were modeled. For example, all models must specify the macro components such as the

As mentioned in Section 3.1, an investigation into the development of the HIV subtypes led to an application of coalescent theory to model the population dynamics of HIV (Burr et al., 2001). Figure 2 illustrates the approach taken. Various features (such as the ratio of the between-subtype to within-subtype genetic distance) involving the subtypes of real HIV sequences (env gene) were compared to the same features in corresponding coalescentbased simulated data. However, it became apparent that it would be necessary to implement a model that made less restrictive assumptions than coalescent theory (Burr et

reproductive rate in susceptible populations and/or subpopulations.

to quantify precision, but not accuracy.

(for example, Rodrigo et al, 1999).

selection is in this for A. thalina.

**4.1 Example 1: HIV** 

al., 2001).

been made to accommodate recombination, selection, overlapping generations, and population subdivision, there are cases where the theory is either inadequate or the sensitivity of its conclusions to its assumptions is unknown. The corresponding inference quality using estimated trees is also unknown; the state of the art is therefore One possible new way to simulate sequences is to track each HIV case by geographic region including all known transmission routes such as sex, needles, blood transfusions, and mother-to-child, and track the genealogy of each case. One would then sample ~100 simulated sequences from around the world or in specified regions at a snapshot in time, or distributed in time, and distributed spatially in either case. With careful bookkeeping one could deduce the sample genealogy (which 2 samples coalesced first to their most recent common ancestor (MRCA), which samples coalesced next, etc.) back in time until all 100 sequences coalesced to the single MRCA. This would produce 99 coalescent times and sample identities, which define the genealogy of the sample. This genealogy could also be thought of as the true evolutionary tree for the sample to be compared to coalescent-based genealogies.

Related to the origin of the HIV subtypes is the goal to predict the stability of the subtypes because current vaccine design approaches rely on "mosaic" pseudo-HIV viruses that exploit the known characteristic or representative sequence of each subtype (Barouch et al., 2010). Although a few new subtypes have been defined since the original, the M clade trees with 8-10 subtypes have been remarkably stable over time (Korber and Myers, 1992; Burr et al., 2001). Both within-host and between-host modeling efforts should allow for multiple viral sequence types within hosts, because within-host variation in contemporaneous HIV sequences isolated from various regions of the genome exhibit substantial variation, easily up to 10% differences.

#### **4.2 Example 2: Influenza**

Imagine a particularly bad flu season. Not only does it appear that more people are infected than normal by early November, but there are anomalous deaths. Could this be a bio attack or perhaps a another human-to-human transmittable version of the swine-origin influenza A (H1N1) virus?

As Figure 6 suggests, there is empirical evidence that a time gap of three or more years is sufficient for a temporal signature. For example, strains isolated in 1993 should be genetically distinct from strains in 1996 or later (Burr et al., 1999). Therefore, we might be suspicious in 2012 if the strain looks like a 2008 strain. However, the empirical evidence assumes a constant population size because the genetic distance between two samples depends on the coalescent time since they evolved from the same ancestral sequence. And the time since the two samples shared a common ancestor depends on several factors, including how the population is structured and the size and growth rate of the population. If some of these factors change dramatically, then the three-year rule would become either shorter or longer. Currently, coalescent methods either hold these factors constant over time, or extensions to the approximations have not been implemented. Therefore, empirical reconstructions of phylogenetic trees such as those in Burr et al. (1999) are incomplete for assessing the robustness of candidate signatures. The corresponding inferences thus have unknown reliability.

The Neffective concept is part of the success of coalescent theory, including in the influenza context Bedford et al (2010) estimate Neffective for influenza A using a coalescent model that includes subdivision and migration. Bayesian Evolutionary Analysis Sampling Trees (BEAST, Pybus et al., 2007) was used to estimate assuming both *N* and are constant over time. In this example, it could be important that observed and reported influenza mutation rates need not be stable over time for several reasons, including the fact that Neffective changes with time.

wrong in the structured population. This is an example of using failed predictions to

There has been progress toward merging SIR models with viral fitness. Minayev and Ferguson (2009a, 2009b) extended SIR models by including two key notions: cross immunity to similar strains in a host that has been previously infected by a similar strain, and transient

At least two related empirical studies have been published, both previously mentioned (Bush et al., 1999; Plotkin et al, 2002) In Bush et al. (1999), the number of AA changes in a lineage appeared to convey a selective advantage in the following sense. The lineage for which the most AA changes occurred were more likely to be represented in the surviving lineages. That is, mutation conveys selective advantage, which is believed to be the case simply because the human host has some immunity to prior strains. At this point, "strain" will be defined following Plotkin et al. (2002) as arising from cluster analyses bases on the Manhattan metric that counts the number of AA differences between pairs of sequences. With that definition of a strain, and using 2 AA changes as the threshold above which a new strain is defined, Plotkin et al. (2002) reported empirical assessment of the number of strains

Graves and Picard (1999) report evidence of violations of the classic SIR model for influenza. Signatures of departure from SIR (Burr et al., 2006) characterize departures from the "one season fits all" assumption (which assumes each flu season occurs like clockwork, peaking in the winter during the same weeks, etc) using a hierarchical model that captures year-toyear variation in baseline, and peak onset and duration (Burr et al., 2006). Burr and Chowell (2008) use a reference distribution of simulated outbreak curve shapes to assess whether a collection of simulated and real outbreak curves follow SIR-type models. On that basis,

In the context of predicting viral evolution as considered here, the SIR-type model must be extended include population demographics and characteristics of the virus. Minayev and Ferguson (2009a,b) develop one approach to include viral characteristics. Our approach to be described in this section is similar, but is entirely stochastic and allows for demographic structure. With the present implementation, population sizes of approximately 10,000 can complete in reasonable (tens of minutes) run times, so the simulation is "small-scale." Here is pseudo code to describe the new simulation. In some cases, parameter names such

Column 2 is infection status (0 = susceptible, 1 = infected, 2 = recovered and not susceptible).

Column 4 is the family group. Column 5 is the work group. Column 6 is the "other group." Column 7 is the time of first infection. Column 8 is an integer denoting the AA sequence of

determine that more fidelity is needed than the SIR model provides.

strain-transcending immunity.

by calendar year in influenza samples.

**5.1 Evidence of departures from standard models** 

many real outbreaks do not follow SIR-type models.

**5.2 Description of the new small-scale simulation** 

as "average.duration.of.infection" are used to clarify.

infection 1. Column 9 is the donor ID for infection 1.

Column 1 is current age.

pop.matrix is *N* rows (individuals) and 30 columns with:

Pseudo-simulation code (Example R (R, 2004) code named flu1( )) 1. Initialize the population matrix and the matrix of AA sequence.

Column 3 is the number of times the individual has been infected.

In influenza, genome reassortment, selection, the presence of multiple strains and multiple hosts, and host immunity all complicate matters. Given what is known about influenza evolution, what might we predict today about influenza evolution? Two related prediction goals to consider for influenza are: (1) in a given year, predict which new strains are most likely to be in the surviving lineage, and (2) predict the prevalent strains in the next year, so that vaccine design can be most effective.

Concerning prediction goal (1), Bush et al (1999) proposed a prediction method that involved whether influenza isolates on lineages having the most changes in positively selected codes were "more fit" that other isolates. At least 18 of the 329 AA codons in H3 HA1 are thought to experience positive selection, with mutations favoring new variants that can escape host immunity. An AA sequence was defined as "more fit" if it is more closely related to surviving lineages than another contemporary strain.

Concerning prediction goal (2) the world health organization (WHO) recommends three strains to target in the vaccine for each flu season. Plotkin et al. (2002) use non-hierarchical clustering over time to evaluate the number of HA1 sequences within each cluster over time. This leads to a sequence-based algorithm to choose vaccine strains and the recommended strains differed from the WHO recommendation in 9 of 16 years in the study period from 1985 to 2000. A limitation of the Plotkin et al (2002) study is the biased sampling used by WHO in which novel strains are deliberately overrepresented in the database.

The new small-scaled agent-based simulation described in Section 5 addresses both prediction goals 1 and 2.

#### **5. New small-scale agent-based simulation for influenza**

In choosing/developing an evolutionary model it is of course important to consider the modeling goals. What are the prediction goals? How should the dynamic host/pathogen system be modeled? Which hosts should be included? Human, swine, avian, other? Is it sufficient to use a detailed model of a region such as New York state and a less detailed model of the outside region?

The basic susceptible-infected-recovered (SIR) model in classical epidemiology mathematically describes average population behavior using differential equations to move from *S* to *I* to *R*. This SIR model has been extended in various ways including structures such as contact groups and stochastic effects such as varying contact rates (Burr and Chowell, 2008,2009). Figure 7 gives examples of different simulated outbreak shapes in a small population of 1000 individuals. The number of newly infected is plotted each day for the simulated data. The small population is either (a) an unstructured population with all individuals equally in contact with all other individuals; (b) a randomly generated network model in which individuals are only exposed to member of their own clique, but some individuals belong to multiple cliques. (c) A network model with cliques assigned to nodes in a lattice; (d) a more realistic spatial network in which cliques belong to small geographic regions. A clique is a small group of individuals for which the contact probability is relatively high and assumed to be the same between each member of the clique. Various signatures of non-homogeneity using the shape of typical outbreak curves were developed for such network models, which were then shown to be detectably different from outbreak curves from the basic SIR model with homogeneous individuals all equally mutually exposed (Burr and Chowell, 2008). One concept that arose in Burr and Chowell (2008) is that predictions of the total number of infected based on the basis SIR model were often quite

In influenza, genome reassortment, selection, the presence of multiple strains and multiple hosts, and host immunity all complicate matters. Given what is known about influenza evolution, what might we predict today about influenza evolution? Two related prediction goals to consider for influenza are: (1) in a given year, predict which new strains are most likely to be in the surviving lineage, and (2) predict the prevalent strains in the next year, so

Concerning prediction goal (1), Bush et al (1999) proposed a prediction method that involved whether influenza isolates on lineages having the most changes in positively selected codes were "more fit" that other isolates. At least 18 of the 329 AA codons in H3 HA1 are thought to experience positive selection, with mutations favoring new variants that can escape host immunity. An AA sequence was defined as "more fit" if it is more closely

Concerning prediction goal (2) the world health organization (WHO) recommends three strains to target in the vaccine for each flu season. Plotkin et al. (2002) use non-hierarchical clustering over time to evaluate the number of HA1 sequences within each cluster over time. This leads to a sequence-based algorithm to choose vaccine strains and the recommended strains differed from the WHO recommendation in 9 of 16 years in the study period from 1985 to 2000. A limitation of the Plotkin et al (2002) study is the biased sampling used by

The new small-scaled agent-based simulation described in Section 5 addresses both

In choosing/developing an evolutionary model it is of course important to consider the modeling goals. What are the prediction goals? How should the dynamic host/pathogen system be modeled? Which hosts should be included? Human, swine, avian, other? Is it sufficient to use a detailed model of a region such as New York state and a less detailed

The basic susceptible-infected-recovered (SIR) model in classical epidemiology mathematically describes average population behavior using differential equations to move from *S* to *I* to *R*. This SIR model has been extended in various ways including structures such as contact groups and stochastic effects such as varying contact rates (Burr and Chowell, 2008,2009). Figure 7 gives examples of different simulated outbreak shapes in a small population of 1000 individuals. The number of newly infected is plotted each day for the simulated data. The small population is either (a) an unstructured population with all individuals equally in contact with all other individuals; (b) a randomly generated network model in which individuals are only exposed to member of their own clique, but some individuals belong to multiple cliques. (c) A network model with cliques assigned to nodes in a lattice; (d) a more realistic spatial network in which cliques belong to small geographic regions. A clique is a small group of individuals for which the contact probability is relatively high and assumed to be the same between each member of the clique. Various signatures of non-homogeneity using the shape of typical outbreak curves were developed for such network models, which were then shown to be detectably different from outbreak curves from the basic SIR model with homogeneous individuals all equally mutually exposed (Burr and Chowell, 2008). One concept that arose in Burr and Chowell (2008) is that predictions of the total number of infected based on the basis SIR model were often quite

WHO in which novel strains are deliberately overrepresented in the database.

**5. New small-scale agent-based simulation for influenza** 

that vaccine design can be most effective.

prediction goals 1 and 2.

model of the outside region?

related to surviving lineages than another contemporary strain.

wrong in the structured population. This is an example of using failed predictions to determine that more fidelity is needed than the SIR model provides.

There has been progress toward merging SIR models with viral fitness. Minayev and Ferguson (2009a, 2009b) extended SIR models by including two key notions: cross immunity to similar strains in a host that has been previously infected by a similar strain, and transient strain-transcending immunity.

At least two related empirical studies have been published, both previously mentioned (Bush et al., 1999; Plotkin et al, 2002) In Bush et al. (1999), the number of AA changes in a lineage appeared to convey a selective advantage in the following sense. The lineage for which the most AA changes occurred were more likely to be represented in the surviving lineages. That is, mutation conveys selective advantage, which is believed to be the case simply because the human host has some immunity to prior strains. At this point, "strain" will be defined following Plotkin et al. (2002) as arising from cluster analyses bases on the Manhattan metric that counts the number of AA differences between pairs of sequences. With that definition of a strain, and using 2 AA changes as the threshold above which a new strain is defined, Plotkin et al. (2002) reported empirical assessment of the number of strains by calendar year in influenza samples.

#### **5.1 Evidence of departures from standard models**

Graves and Picard (1999) report evidence of violations of the classic SIR model for influenza. Signatures of departure from SIR (Burr et al., 2006) characterize departures from the "one season fits all" assumption (which assumes each flu season occurs like clockwork, peaking in the winter during the same weeks, etc) using a hierarchical model that captures year-toyear variation in baseline, and peak onset and duration (Burr et al., 2006). Burr and Chowell (2008) use a reference distribution of simulated outbreak curve shapes to assess whether a collection of simulated and real outbreak curves follow SIR-type models. On that basis, many real outbreaks do not follow SIR-type models.

#### **5.2 Description of the new small-scale simulation**

In the context of predicting viral evolution as considered here, the SIR-type model must be extended include population demographics and characteristics of the virus. Minayev and Ferguson (2009a,b) develop one approach to include viral characteristics. Our approach to be described in this section is similar, but is entirely stochastic and allows for demographic structure. With the present implementation, population sizes of approximately 10,000 can complete in reasonable (tens of minutes) run times, so the simulation is "small-scale."

Here is pseudo code to describe the new simulation. In some cases, parameter names such as "average.duration.of.infection" are used to clarify.

Pseudo-simulation code (Example R (R, 2004) code named flu1( ))

1. Initialize the population matrix and the matrix of AA sequence.

pop.matrix is *N* rows (individuals) and 30 columns with:

Column 1 is current age.

Column 2 is infection status (0 = susceptible, 1 = infected, 2 = recovered and not susceptible).

Column 3 is the number of times the individual has been infected.

Column 4 is the family group. Column 5 is the work group. Column 6 is the "other group." Column 7 is the time of first infection. Column 8 is an integer denoting the AA sequence of infection 1. Column 9 is the donor ID for infection 1.

function of time (examples given below), with an average of 10 year total immunity from identical strains. And, values of .a and .b in can be altered to decease or increase the degree of cross-immunity as a function of the Manhattan distance between strains. The cross-immunity concept is that *S* individuals who have had the strain of the potential donor *I* are less susceptible to infection. As Plotkin et al (2002) describe, ideally a distance measure between two sequences should somehow reflect immuniological properties of the corresponding viral proteins. Although steps have been taken in that direction (Lapedes and Farber, 2001), more research is required before similar metrics can be defensibly applied in

Any newly infected host will have the donor strain, but the simulation allows for mutation to a new strain. There are 329 H3 HA1 (Bush et al., 1999) amino acid (AA) sites with one estimate of the effective mutation rate *N* being 0.0057 nucleotide substitutions per site per year. Of the 329 AA sites, at least 18 have exhibited positive selection effects (Plotkin et al., 2002). Here we will not consider estimation error in *N*, so the simulation default value is *N* = 0.0057 *x* 3 = 0.171 per AA site per year. A technical issue arises here because we use the actual popluation size *N* rather than the effective size *N*effective. It would be more appropriate to use *N*effective, but that value is currently unknown in the context of this model population. In future work, *N*effective could be defined and estimated on the basis of the number of

If a newly infected incurs any mutations, add the new strain to AA.seq.matrix, increasing the number of rows by one. Columns in pop.matrix identify which strains each host has had

The per-step recovery probability is 1/(average.duration .of. infection). The time to recovery

The time *t* from recovery to immunity is random, with t ~ Normal(avg.time.to.immunity,).

If at any time step (day) the number of infected is 0, then the infection would die out. Therefore, a reintroduction of new infected occurs at a random time with a user-specified average value with default value of 1 year, representing the typical time gap between outbreaks. Figure 8 plots the percent currently infected at each time step for one 7-year realization of 10,000 individuals. A key output of flu1 is the current strain of each infected individual at each time step. This allows us to consider strategies in Bush et al (1999) and in Plotkin et al. (2002) for prediction goals (1) and (2) described earlier in this section. Figure 9 plots four of the eight strains that emerged during the 7 simulated years. Following Plotkin et al. (2002), sequences were regarded as being the same strain if the number of AA differences among the 18 positively-selected AA sites is 2 or less. Equivalently, sequences were regarded as being distinct strains if the number of AA differences is 3 or more. Figures 8 and 9 used the

Experimentation with flu1 to generate multiple realizations of outbreaks having identical parameter values allows us to examine the role of chance in our models. Experimentation with flu1 with different parameter values allows us to examine the effects of parameter changes. Small numerical experiments with flu1 to date has lead to the following

in the sequence matrix of all strains ever experienced in the model population

is therefore a geometric random variable with average duration

modeling contexts such as our new small-scale simulation.

observed distinct sequences during outbreak.

3. Any infected can recover.

average.duration.of.infection.

values .a = 0.4,.b = 0.95.

following conclusions:

The default simulation values are

4. Any recovered can lose immunity.

avg.time.to.immunity=10 years and = 1 year.

No. of newly infected individuals each day

Fig. 7. Example time series of newly infected individuals in simulated SIR models of populations of 1000 individuals. (a) basic SIR model; (b) a randomly generated network model in which individuals are only exposed to member of their own clique, but some individuals belong to multiple cliques. (c) A network model with cliques assigned to nodes in a lattice; (d) a more realistic spatial network in which cliques belong to small geographic regions.

Columns 10-12 are the same as columns 7-9, but for the second infection by the individual. A maximum of 30 infections is allowed and then each new infection is recorded in the last 3 columns by writing over the previous data.

AA. seq.matrix begins with 2 rows (by default) and 329 AA sites (columns).

The default is to being with two random but distinct AA sequences.

At each time step (the default time step is 1 day), any of several events can occur:

2. There is a probability for each individual to change status from *S* (0) to *I* (1), or from *I* (1) to *R* (2), or from *R* back to *S*.

Any susceptible (*S*) individuals that an infect (*I*) individual contacts in the respective family, work, and other groups leads to a probability of infection determined by two parameters. First, there is a force of infection parameter for each of the three group types that characterizes how strongly individuals in the three group types interact. Second, the similarity of the infected individual's current strain to the closest strain of a given susceptible is computed and the cross-immunity function is calculated.

The value of the function in Minayev and Ferguson (2009b) alters the transmission probability accordingly. Cross immunity modeled by decreases to zero as a smooth

No. of newly infected individuals each day

0 100

Fig. 7. Example time series of newly infected individuals in simulated SIR models of

a more realistic spatial network in which cliques belong to small geographic regions.

AA. seq.matrix begins with 2 rows (by default) and 329 AA sites (columns).

susceptible is computed and the cross-immunity function is calculated.

At each time step (the default time step is 1 day), any of several events can occur:

The default is to being with two random but distinct AA sequences.

populations of 1000 individuals. (a) basic SIR model; (b) a randomly generated network model in which individuals are only exposed to member of their own clique, but some individuals belong to multiple cliques. (c) A network model with cliques assigned to nodes in a lattice; (d)

Columns 10-12 are the same as columns 7-9, but for the second infection by the individual. A maximum of 30 infections is allowed and then each new infection is recorded in the last 3

2. There is a probability for each individual to change status from *S* (0) to *I* (1), or from *I*

Any susceptible (*S*) individuals that an infect (*I*) individual contacts in the respective family, work, and other groups leads to a probability of infection determined by two parameters. First, there is a force of infection parameter for each of the three group types that characterizes how strongly individuals in the three group types interact. Second, the similarity of the infected individual's current strain to the closest strain of a given

The value of the function in Minayev and Ferguson (2009b) alters the transmission probability accordingly. Cross immunity modeled by decreases to zero as a smooth

 300

 500

(b) Random network Day 20 40 60 80 100

(d) Spatial network Day 20 40 60 80 100

(a) Unstructured population, SIR model Day 20 40 60 80 100

(c) Lattice network Day 20 40 60 80 100

columns by writing over the previous data.

(1) to *R* (2), or from *R* back to *S*.

0

 50 100 150 function of time (examples given below), with an average of 10 year total immunity from identical strains. And, values of .a and .b in can be altered to decease or increase the degree of cross-immunity as a function of the Manhattan distance between strains. The cross-immunity concept is that *S* individuals who have had the strain of the potential donor *I* are less susceptible to infection. As Plotkin et al (2002) describe, ideally a distance measure between two sequences should somehow reflect immuniological properties of the corresponding viral proteins. Although steps have been taken in that direction (Lapedes and Farber, 2001), more research is required before similar metrics can be defensibly applied in modeling contexts such as our new small-scale simulation.

Any newly infected host will have the donor strain, but the simulation allows for mutation to a new strain. There are 329 H3 HA1 (Bush et al., 1999) amino acid (AA) sites with one estimate of the effective mutation rate *N* being 0.0057 nucleotide substitutions per site per year. Of the 329 AA sites, at least 18 have exhibited positive selection effects (Plotkin et al., 2002). Here we will not consider estimation error in *N*, so the simulation default value is *N* = 0.0057 *x* 3 = 0.171 per AA site per year. A technical issue arises here because we use the actual popluation size *N* rather than the effective size *N*effective. It would be more appropriate to use *N*effective, but that value is currently unknown in the context of this model population. In future work, *N*effective could be defined and estimated on the basis of the number of observed distinct sequences during outbreak.

If a newly infected incurs any mutations, add the new strain to AA.seq.matrix, increasing the number of rows by one. Columns in pop.matrix identify which strains each host has had in the sequence matrix of all strains ever experienced in the model population

3. Any infected can recover.

The per-step recovery probability is 1/(average.duration .of. infection). The time to recovery is therefore a geometric random variable with average duration

average.duration.of.infection.

4. Any recovered can lose immunity.

The time *t* from recovery to immunity is random, with t ~ Normal(avg.time.to.immunity,). The default simulation values are

avg.time.to.immunity=10 years and = 1 year.

If at any time step (day) the number of infected is 0, then the infection would die out. Therefore, a reintroduction of new infected occurs at a random time with a user-specified average value with default value of 1 year, representing the typical time gap between outbreaks.

Figure 8 plots the percent currently infected at each time step for one 7-year realization of 10,000 individuals. A key output of flu1 is the current strain of each infected individual at each time step. This allows us to consider strategies in Bush et al (1999) and in Plotkin et al. (2002) for prediction goals (1) and (2) described earlier in this section. Figure 9 plots four of the eight strains that emerged during the 7 simulated years. Following Plotkin et al. (2002), sequences were regarded as being the same strain if the number of AA differences among the 18 positively-selected AA sites is 2 or less. Equivalently, sequences were regarded as being distinct strains if the number of AA differences is 3 or more. Figures 8 and 9 used the values .a = 0.4,.b = 0.95.

Experimentation with flu1 to generate multiple realizations of outbreaks having identical parameter values allows us to examine the role of chance in our models. Experimentation with flu1 with different parameter values allows us to examine the effects of parameter changes. Small numerical experiments with flu1 to date has lead to the following following conclusions:

Fig. 8. Example simulated outbreaks from flu1. The number currently infected is plotted by

For a particular evolutionary model, inference is possible using Bayesian evolutionary analysis, for example, by using BEAST (Pybus et al., 2000) that relies on Markov Chain Monte Carlo, resulting in a posterior distribution on model parameters. The Bayesian approach allows one to repeatedly sample from the posterior probability of model parameters, and then generate hypothetical future genetic data for each set of model parameter values. This approach provides an envelope of possible future multivariate time series of genetic data from each sampled subject. It is computationally challenging even for a

In Burr and Chowell (2008), in simulated data from models with demographic structure but no host immunity or viral strain information, predictions from SIR models with parameters estimated from the early portion of an outbreak were often badly wrong. Such bad prediction errors can indicate model violations, perhaps eventually leading to more appropriate models. To our knowledge, using prediction quality to assess model adequacy in this context is new. However it is possible that multiple wrong models provide adequate

predictions, so model identifiability remains a research topic in this area.

day for years.

given model of evolution.


#### **6. Model Identifiability and Inference**

Model identifiability is a key statistical concept. A model is identifiable if its parameters can be accurately and precisely estimated as the sample size increases toward infinity. In population genetics, a key parameter that arises from coalescent theory considerations is the composite parameter effective 2*N* , which determines the rate of genetic changes. Many studies address methods to estimate but because *N*effective and the mutation rate enter as a product, they are confounded, leading to a lack of identifiability unless auxiliary data is used to separately estimate *N*effective or and a strict evolutionary clock is assumed, meaning that is constant over time and lineages.

1. The .a and .b in are as critical to the size of each outbreak as the overall transmission probabilities within the family, group, and other groups. For example, the function takes values 0.990, 0.891, 0.792, 0.693, 0.594, 0.495, 0.396, 0.297, 0.198, 0.099, 0.000,… for distances of 1, 2, …, 11,… respectively, for .a = 0.1,.b = 0.99, and takes values 0.8, 0.4, and 0.0, for distances of 1, 2, 3 …, respectively for .a = 0.5,.b = 0.8. The modeled transmission probability is multiplied by 1- so for .a = 0.1,.b = 0.99 there is very little chance of a susceptible individual acquiring influenza from a host having an AA sequence that differs in only 1 position among 18 positions from a strain the susceptible has had within the duration of immunity (10 years on average for example). This means that a single AA mutation can have dramatically different effects on transmission probability depending on .a and .b. Qualitatively, this is anticipated because if immunity to a new strain is very high in individuals with previous infection by a similar older strain, then the average outbreak size will be small if previous

2. The values of .a and .b in are also critical to the typical number of strains maintained in the population and to whether change occurrence of a large number of mutations in a newly infected will have strong selective advantage by avoiding the collective immune experiences of available human hosts. It is currently unknown whether the observed number of strains could adequately provide a model such as flu1 with estimates of .a

3. As expected, the group structures can lead to outbreak shapes that differ from classic SIR outbreak shapes (Burr and Chowell, 2008). This is evident in comparing the outbreak shapes in Figure 8 to the SIR-model outbreak in Figure 7a for example. The shapes in Figure 8 fall off very sharply, more like the spatial network in Figure 7d.

mutation rate determines the rate of genetic changes and the expected amount of diversity in a random sample (see Section 6). The *N*effective concept for influenza sequences was addressed in Bedford et al. (2010), but as mentioned in Section 4, observed genetic diversity is interpreted in the context of idealized evolutionary models that are amenable to coalescent theory. More experiments with flu1 are planned, and if possible, an approximate coalescent model as implemented in available software will be applied so that the adequacy of coalescent-based approximations can be evaluated in the context of simulated flu outbreaks. We caution that many genetic effects of influenza evolution are omitted from flu1 and from any available coalescent-based

Model identifiability is a key statistical concept. A model is identifiable if its parameters can be accurately and precisely estimated as the sample size increases toward infinity. In population genetics, a key parameter that arises from coalescent theory considerations is the

as a product, they are confounded, leading to a lack of identifiability unless auxiliary data

2*N*

, which determines the rate of genetic changes. Many

and a strict evolutionary clock is assumed,

but because *N*effective and the mutation rate

where is the

enter

4. In population genetics, the composite parameter effective

outbreaks due to the older strain were large.

and .b (See section 6).

simulation.

**6. Model Identifiability and Inference** 

2*N*

is used to separately estimate *N*effective or

meaning that is constant over time and lineages.

composite parameter effective

studies address methods to estimate

Fig. 8. Example simulated outbreaks from flu1. The number currently infected is plotted by day for years.

For a particular evolutionary model, inference is possible using Bayesian evolutionary analysis, for example, by using BEAST (Pybus et al., 2000) that relies on Markov Chain Monte Carlo, resulting in a posterior distribution on model parameters. The Bayesian approach allows one to repeatedly sample from the posterior probability of model parameters, and then generate hypothetical future genetic data for each set of model parameter values. This approach provides an envelope of possible future multivariate time series of genetic data from each sampled subject. It is computationally challenging even for a given model of evolution.

In Burr and Chowell (2008), in simulated data from models with demographic structure but no host immunity or viral strain information, predictions from SIR models with parameters estimated from the early portion of an outbreak were often badly wrong. Such bad prediction errors can indicate model violations, perhaps eventually leading to more appropriate models. To our knowledge, using prediction quality to assess model adequacy in this context is new. However it is possible that multiple wrong models provide adequate predictions, so model identifiability remains a research topic in this area.

Barouch, D. et al., (2010). Mosaic HIV-1 Vaccines Expand the Breadth and Depth of Cellular Immune Responses in Rhesus Monkeys, *Nature Medicine* 16, 319-323. Bedford, T; Cobey, S; Peerli, P., & Pascual, M. (2010). Global Migration Dynamics Underlie

Burr, T.; Skourikhine A.; Bruno, W., & Macken, C. (1999). Confidence Measures for

Burr, T. ; Myers, G., & Hyman, J. (2001). The Origin of AIDS – Darwinian or Lamarkian?

Burr, T.; Gattiker, J., & LaBerge G. (2002). Genetic Subtyping using Cluster Analysis. *Special Interest Group on Knowledge Discovery and Data Mining Explorations* 3:33-42. Burr, T.; Gattiker, J., & Gerrish,P. (2003). An Investigation of Error Sources and Their

Detection, *BioMedCentral, Medical Informatics and Decision Making*, 6:40. Burr, T., & Chowell, G. (2008). Signatures of non-homogeneous mixing in disease

Burr, T., & Chowell, G. (2009) The Reproduction Number R(t) in Structured and Nonstructured Populations. *Mathematical Biosciences and Engineering* . 6(2) 239-259. Bush, R.; Bender, C.; Subbarao, K; Cox, N; & Fitch, W. (1999). Predicting the Evolution of

Chen, G. et al, (2006). Genomic Signatures of Human versus Avian Influenza A Viruses,

Eubank, S.; Goclu, H.; Kumar, A.; Marathe, M.; Srinivasan, A.; Totoczkal, Z., & Wang, N.

Ewing, G.; Nicholls, G., & Rodrigo A. (2004). Using Temporally Spaced Sequences to

Excoffier, L., & Foll, M. (2011). fastsimcoal: a Continuous-time Coalescent Simulator of

Felsenstein, J.; Kuhner, M.; Yamato, J., & Beerli P. (1999). Likelihoods on Coalescents: a

Measurably Evolving Populations. *Genetics* 168:2407-2420.

(2004). Modelling Disease Outbreaks in Realistic Urban Social Networks. *Nature*,

Simultaneously Estimate Migration Rates, Mutation Rage and Population Sizes in

Genomic Diversity under Arbitrarily Complex Evolutionary Scenarios,

Monte Carlo Sampling Approach to Inferring Parameters from Population Samples of Molecular Data. pp. 163-185 in *Statistics in Molecular Biology and Genetics*, ed. Francoise Seillier-Moiseiwitsch. IMS Lecture Notes-Monograph Series, volume 33. Inst. of Math. Statistics and American Mathematical Society, Hayward, California. Ferguson, N., & Anderson, R. (2002). Predicting Evolutionary Change in the Influenza A

outbreaks. *Mathematical and Computer Modelling* 48:122-140, 2008

Evolution and Persistence of Human Influenza A (H3N2), *PloS Pathogens* 6(5), 1-9,

Evolutionary Trees: Applications to Molecular Epidemiology. *Proc. IEEE Inter. Conf. on Information*, *Intelligence and Systems*, *Genetics and Evolution Section* ; 107-114. Burr, T. (2000). Quasi-Equilibrium Theory for the Distribution of Rare Alleles in a Subdivided Population: Justification and Implications. *Theoretical Population Biology*,

Impact in Estimating the Time to the Most Recent Ancestor of Spatially and Temporally Distributed HIV Sequences. *Statistics in Medicine* 22(9):1495-1516 Burr ,T.; Graves, T.; Klaman, R.; Michalek, S.; Picard ,R., & Hengartner, N. (2006).

Accounting for Seasonal Patterns in Syndromic Surveillance Data for Outbreak

**8. References** 

2010.

57(3): 297-306.

429:180-184.

*Phil. Trans. R. Soc. Lond. B* 356:877-887*.* 

Human Influenza A, *Science* 286: 1921-1925.

*Emerging Infectious Diseases* 12(9): 1353-1360.

*Bioinformatics* Advance Access, March 2011.

Virus, *Nature Medicine* 8(6): 562-563.

Fig. 9. Percent infected by day with strain 1, 2, 3 or 4 for the same simulated 7 years as in Figure 8.

#### **7. Conclusions/summary**

Coalescent theory and its success in some contexts at inferring aspects of past virus evolution were described. Then the argument was made that relatively new goals to predict aspects of virus evolution will require higher fidelity modeling that is anticipated to be available via coalescent theory or its extensions.

As a step toward high fidelity modeling, a small-scale agent based simulation was described and example results presented. For influenza, two prediction goals were considered: (1) in a given year, predict which new strains are most likely to be in the surviving lineage, and (2) predict the prevalent strains in the next year, so that vaccine design can be most effective. The new small-scale simulation code flu1in R can provide insight into the feasibility of meeting these goals, but it too makes restrictive modeling assumptions with unknown accuracy. A related concept was described that involves using prediction quality on simulated data that follows an assumed model to assess whether prediction performance on corresponding real data indicates model violations. Model violations that are evident from poor prediction quality can help prioritize future upgrades to the models.

#### **8. References**

284 Bioinformatics – Trends and Methodologies

Fig. 9. Percent infected by day with strain 1, 2, 3 or 4 for the same simulated 7 years as in

Coalescent theory and its success in some contexts at inferring aspects of past virus evolution were described. Then the argument was made that relatively new goals to predict aspects of virus evolution will require higher fidelity modeling that is anticipated to be

As a step toward high fidelity modeling, a small-scale agent based simulation was described and example results presented. For influenza, two prediction goals were considered: (1) in a given year, predict which new strains are most likely to be in the surviving lineage, and (2) predict the prevalent strains in the next year, so that vaccine design can be most effective. The new small-scale simulation code flu1in R can provide insight into the feasibility of meeting these goals, but it too makes restrictive modeling assumptions with unknown accuracy. A related concept was described that involves using prediction quality on simulated data that follows an assumed model to assess whether prediction performance on corresponding real data indicates model violations. Model violations that are evident from

poor prediction quality can help prioritize future upgrades to the models.

Figure 8.

**7. Conclusions/summary** 

available via coalescent theory or its extensions.


**Part 5** 

**Protein Structure Analysis** 


**Part 5** 

**Protein Structure Analysis** 

286 Bioinformatics – Trends and Methodologies

Forrest, H., & Webster, R. (2010). Perspectives on Influenza Evolution and the Role of

Grassley, N.; Harvey, P., & Holmes, E. (1999) Population Dynamics of HIV-1 Inferred from

Graves, T., & Picard, R. (2003) Predicting the Evolution of P&I Mortality During A Flu

Innan, H., & Stephan, W. (2000). The Coalescent in an Exponentially Growing

Lapedes, A., & Farber R. (2001). The Geometry of Shape Space: Application to Influenza,

Minayev, P., & Ferguson, N. (2009a). Improving the Realism of Deterministic Multi-strain

Minayev, P., & Ferguson, N. (2009b) Incorporating Demographic Stochasticity into Multi-

Nelson, M., & Holmes, E. (2007). The Evolution of Epidemic Influenza, *Nature Reviews* 

Perelson, A. et al. (1996). HIV-1 Dynamics in Vivo: Virion Clearance Rate, Infected Cell Life-

Plotkin, J.; Dushoff, J., & Levin, S. (2002). Hemagglutinin Sequence Clusters and the Antigenic Evolution of Influenza A, *Proc. Nat. Acad. Sci, USA* 2002; 99: 6263-6268 Pybus, O.;Rambaut, A., & Harvey, P. (2000). An Integrated Framework for the Inference of

R: a Language and Environment for Statistical Computing, R Development Core Team,

Rambaut, A.; Pybus, O.; Nelson, M.; Viboud, C.; Taubenberger, J., & Holmes, E. (2008).The

Rambaut, A.; Robertson, D.; Pybus, O.; Peeters, M., & Holmes ,E. (2001). Phylogeny and the

Rodrigo et al. (1999). Coalescent Estimates of HIV-1 Generation Time in Vivo. *Proc. Nat.* 

Sjodin, P.; Kaj, K.; Krone, S.; Lascoux, M., & Nordborg, M. (2005). On the Meaning and

Swofford, D.; Olsen, G.; Waddell, P., & Hillis, D., Phylogenetic Inference, Chapter 11 in

Stephens, M., & Donnelly P. (2000). Inference in Molecular Population Genetics. *J. Royal* 

Venables, W., & Ripley, B. (1999) Modern Applied Statistics with Splus, Springer: New York.

Molecular Systematics, Edited by Hillis, D.; Moritz. C., & Mable, B, SInaer,

Existence of an Effective Population Size. *Genetics* 169: 1061-1070.

Kingman, J. (1982). On the Genealogy of Large Populations. *J. Appl. Probability* 19: 27-43. Korber, B., & Myers, G. (1992) Signature Pattern Analysis: a Method for Assessing Viral

Sequence Relatedness. *AIDS Res. Hum. Retro*., 8, 1549–1560.

Span, and Viral Generation Time *Science* 271, 1582-1586.

*Journal of Theoretical Biology* 212(1), 57-69.

Season, Los Alamos National Laboratory Unrestricted Release Report, LAUR02-

Metapopulation and Its Application to Arabidopsis thaliana, *Genetics* 155, 2015-

Models: Implications for Modelling Influenza A, *J. R. Society Interface* 6, 509-518,

strain Epidemic Models: Application to Influenza A, *J. R. Society Interface* 6, 989-996.

Viral Population History From Reconstructed Genealogies. *Genetics* 155: 1429-1437.

Genomic and Epidemiological Eynamics of Human Influenza A Virus. *Nature*

Research, *Animal Health Research Reviews* 11(1): 3-18.

Gene

4717.

2109.

2009.

*Genetics* 8, 196-205.

www.R-project.org.

Origin of HIV-1. *Nature*,410:1047-8

*Acad. Sci USA* 96:2187-2191.

Sunderland, Mass. 1996.

*Statistical Soc* B 62(4); 605-655.

453:615-9

Sequences. *Genetics* 151: 427-438.

**14** 

*México* 

**A Bioinformatical Approach to Study the** 

Israel López-Reyes1, Cecilia Bañuelos1, Abigail Betanzos2 and Esther Orozco2,3

*3Universidad Autónoma de la Ciudad de México,* 

*1Instituto de Ciencia y Tecnología del Distrito Federal,* 

**Endosomal Sorting Complex Required for** 

**Transport (ESCRT) Machinery in Protozoan Parasites: The** *Entamoeba histolytica* **Case** 

*2Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional,* 

**1.1 The potential of bioinformatics for the study of protein structure and function**  Proteins are macromolecules formed by amino acid polymers that regulate cellular functions. Each protein is composed by the repetition and combination of 20 different amino acids, whose order is determined by the genetic code. To perform their biological functions, proteins fold into one or more specific spatial conformations, determined by non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals forces and hydrophobic packing, and covalent interactions, such as disulfide bonds (Chiang et al.,

Determining the structure and function of a protein is a milestone of many aspects of modern biology to understand its role in cell physiology. Bioinformatics is the research, development or application of computational approaches for expanding the use of biological, medical, behavioral or health-related data. It also includes those tools to acquire, store, organize, archive, analyze or visualize infomation. Over the past years, bioinformatical tools have been widely used for the prediction and study of protein biology. Moreover, bioinformatical tools have revealed the existence of protein "interactomes", demonstrating the interaction among distinct biomolecules (protein-protein, protein-lipids,

protein-carbohydrates, etc.) to perform cellular processes (Kuchaiev & Przulj, 2011).

During the last decades, genome sequencing projects together with bioinformatics programs and algorithms have enormously contributed to understand protein structure, protein interactions and protein functions. At present, over six million unique protein sequences have been deposited in public databases, and this number is increasing rapidly. Meanwhile, despite the progress of high-throughput structural genomics initiatives, just over 50,000 protein structures have been experimentally determined (Kelley & Sterberg, 2009). The greatest challenge the molecular biology community is facing today is to analyze the wealth of data that has been produced by the genome sequencing projects, where bioinformatics

**1. Introduction** 

2007).

### **A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for Transport (ESCRT) Machinery in Protozoan Parasites: The** *Entamoeba histolytica* **Case**

Israel López-Reyes1, Cecilia Bañuelos1, Abigail Betanzos2 and Esther Orozco2,3 *1Instituto de Ciencia y Tecnología del Distrito Federal, 2Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, 3Universidad Autónoma de la Ciudad de México, México* 

#### **1. Introduction**

#### **1.1 The potential of bioinformatics for the study of protein structure and function**

Proteins are macromolecules formed by amino acid polymers that regulate cellular functions. Each protein is composed by the repetition and combination of 20 different amino acids, whose order is determined by the genetic code. To perform their biological functions, proteins fold into one or more specific spatial conformations, determined by non-covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals forces and hydrophobic packing, and covalent interactions, such as disulfide bonds (Chiang et al., 2007).

Determining the structure and function of a protein is a milestone of many aspects of modern biology to understand its role in cell physiology. Bioinformatics is the research, development or application of computational approaches for expanding the use of biological, medical, behavioral or health-related data. It also includes those tools to acquire, store, organize, archive, analyze or visualize infomation. Over the past years, bioinformatical tools have been widely used for the prediction and study of protein biology. Moreover, bioinformatical tools have revealed the existence of protein "interactomes", demonstrating the interaction among distinct biomolecules (protein-protein, protein-lipids, protein-carbohydrates, etc.) to perform cellular processes (Kuchaiev & Przulj, 2011).

During the last decades, genome sequencing projects together with bioinformatics programs and algorithms have enormously contributed to understand protein structure, protein interactions and protein functions. At present, over six million unique protein sequences have been deposited in public databases, and this number is increasing rapidly. Meanwhile, despite the progress of high-throughput structural genomics initiatives, just over 50,000 protein structures have been experimentally determined (Kelley & Sterberg, 2009). The greatest challenge the molecular biology community is facing today is to analyze the wealth of data that has been produced by the genome sequencing projects, where bioinformatics

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

Fig. 1. The ESCRT machinery involved in the endosomal MVB pathway.

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 291

has been fundamental. Traditionally, molecular biology research has been carried out entirely at the laboratory bench, but the huge increase in the amount of data has made necessary to incorporate computers and sophisticated software into research.

Additionally, availability of genome databases for distinct organisms has improved our knowledge on the way to elucidate the last universal common ancestor. In conclusion, analyzing and comparing the genetic material of different species is an increasingly important approach for studying the numbers, locations, biochemical functions and evolution of genes and proteins.

In this review, we selected a particular scientific case to emphasize the usefulness and potential of bioinformatics in addressing a biological problem.

Most cellular processes use scaffold proteins to recruit other proteins and to facilitate their correct interaction and functioning. Thus, we focused on the very little studied scaffold proteins that form the Endosomal Sorting Complexes Required for Transport (ESCRT) machinery during protozoan endocytosis, a fundamental process for cell survival. Here, as a study case, we aimed to highlight the possible identity, function and interactions of ESCRT complexes in *Entamoeba histolytica,* as determined by the use of bioinformatical tools.

#### **1.2 Role of the ESCRT in endocytosis**

Endocytosis is a crucial process in multiple cellular and physiological events, including nutrient uptake, virus budding, cell surface receptor downregulation and cell signaling. It involves the internalization of molecules or particles of different sizes from the external environment, through membrane remodeling and vesicle formation events (de Souza et al., 2009). In endocytosis, a huge number of interactomes are involved. In the study of the highly complex endocytosis process, bioinformatics databases and computational tools have been of enormous value.

Several plasma membrane proteins interact with target molecules (cargo) to internalize and transport them along the endocytic pathway. Depending on their function, membrane proteins are recycled back to the cell surface or degraded at lysosomal compartments together with cargo. Delivery of endocytosed cargo for degradation occurs through the fusion of intracellular vesicles called early and late endosomes that finally reach lysosomes.

In the majority of cell types, late endosomes fuse among them to form multivesicular bodies (MVB), which are essential intermediates for nutrient, ligand and receptor trafficking (Williams & Urbé, 2007). The best characterized signal for entering cargo molecules into the degradative MVB pathway is ubiquitination. Ubiquitination is a conjugation event in which a highly conserved 76 amino acid protein called ubiquitin, is covalently attached for cargo labeling. Most of the cargo proteins that accumulate in MVB are marked by a single ubiquitin, which is recognized by a specific and conserved protein machinery termed "Endosomal Sorting Complex Required for Transport (ESCRT)" and whose function is fundamental during endocytosis (Williams & Urbé, 2007).

The ESCRT machinery was first characterized in yeast. It consists of a group of vacuolar protein sorting factors (some of them called Vps), which form different multimeric complexes (ESCRT-0, -I, -II and -III) that bind among them but also associate to accesory proteins and endosomal membrane lipids to perform the whole endocytic process (Fig. 1) (Hurley & Emr, 2006).

has been fundamental. Traditionally, molecular biology research has been carried out entirely at the laboratory bench, but the huge increase in the amount of data has made

Additionally, availability of genome databases for distinct organisms has improved our knowledge on the way to elucidate the last universal common ancestor. In conclusion, analyzing and comparing the genetic material of different species is an increasingly important approach for studying the numbers, locations, biochemical functions and

In this review, we selected a particular scientific case to emphasize the usefulness and

Most cellular processes use scaffold proteins to recruit other proteins and to facilitate their correct interaction and functioning. Thus, we focused on the very little studied scaffold proteins that form the Endosomal Sorting Complexes Required for Transport (ESCRT) machinery during protozoan endocytosis, a fundamental process for cell survival. Here, as a study case, we aimed to highlight the possible identity, function and interactions of ESCRT

Endocytosis is a crucial process in multiple cellular and physiological events, including nutrient uptake, virus budding, cell surface receptor downregulation and cell signaling. It involves the internalization of molecules or particles of different sizes from the external environment, through membrane remodeling and vesicle formation events (de Souza et al., 2009). In endocytosis, a huge number of interactomes are involved. In the study of the highly complex endocytosis process, bioinformatics databases and computational tools have

Several plasma membrane proteins interact with target molecules (cargo) to internalize and transport them along the endocytic pathway. Depending on their function, membrane proteins are recycled back to the cell surface or degraded at lysosomal compartments together with cargo. Delivery of endocytosed cargo for degradation occurs through the fusion of intracellular vesicles called early and late endosomes that finally reach

In the majority of cell types, late endosomes fuse among them to form multivesicular bodies (MVB), which are essential intermediates for nutrient, ligand and receptor trafficking (Williams & Urbé, 2007). The best characterized signal for entering cargo molecules into the degradative MVB pathway is ubiquitination. Ubiquitination is a conjugation event in which a highly conserved 76 amino acid protein called ubiquitin, is covalently attached for cargo labeling. Most of the cargo proteins that accumulate in MVB are marked by a single ubiquitin, which is recognized by a specific and conserved protein machinery termed "Endosomal Sorting Complex Required for Transport (ESCRT)" and whose function is

The ESCRT machinery was first characterized in yeast. It consists of a group of vacuolar protein sorting factors (some of them called Vps), which form different multimeric complexes (ESCRT-0, -I, -II and -III) that bind among them but also associate to accesory proteins and endosomal membrane lipids to perform the whole endocytic process (Fig. 1)

complexes in *Entamoeba histolytica,* as determined by the use of bioinformatical tools.

necessary to incorporate computers and sophisticated software into research.

potential of bioinformatics in addressing a biological problem.

fundamental during endocytosis (Williams & Urbé, 2007).

evolution of genes and proteins.

**1.2 Role of the ESCRT in endocytosis** 

been of enormous value.

(Hurley & Emr, 2006).

lysosomes.

Fig. 1. The ESCRT machinery involved in the endosomal MVB pathway.

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

multicopy gene product (Dacks et al., 2008).

presented along evolution.

organisms (Leung et al., 2008).

are summarized below.

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 293

2007). Similarly, evidence for the existence of MVB-like organelles in diverse primitive eukaryotes has also been reported (Allen et al., 2007; Tse et al., 2004; Yang et al., 2004). Lysosomal targeting of ubiquitinated cargo by ESCRT complexes is conserved in animals and fungi (Leung et al., 2008). Extensive experimental and bioinformatical comparative analysis of genomic data indicate that ESCRT factors are well conserved across the eukaryotic lineage (Williams & Urbé, 2007). ESCRT-I, -II and –III as well as -accessory proteins are almost completely retained in all studied taxa, indicating an early evolutionary origin and a near-universal system for cargo trafficking through the MVB pathway. Particularly, all eukaryotic organisms studied to date have at least an ESCRT-III protein, suggesting that the minimal ESCRT necessary for MVB formation might be ESCRT-III (Williams & Urbé, 2007). In addition, the number of components of ESCRT-III is greatly expanded in mammals in comparison to yeast, being Vps46 the most frequent ESCRT-III

A common ancestry within the same ESCRT complexes or among them, has been reported for Vps20, Vps32 and Vps60 proteins (sharing a Snf7 domain), and Vps2, Vps24 and Vps46 proteins (sharing a Vps24 domain). All these proteins are highly similar at sequence level and are encoded by multicopy genes, probably due to gene amplification events (Leung et al., 2008). In terms of biological conservation, it seems that several ESCRT components had to be expanded to provide functional redundancy. Thus, this redundancy would preserve ESCRT functions in the endocytic MVB pathway even if losses of components were

Significantly, the Vps4 ATPase responsible for recycling ESCRT components, is present in all taxa, indicating a highly conserved mechanism for delivering energy in the system. This

The most prominent evolutionary variation in the MVB pathway is the restriction of ESCRT-0 to animals and fungi, suggesting that a distinct mechanism for ubiquitin labeling, signal recognition and endosomal membrane binding likely operates in the rest of eukaryotic

Protozoa are a diverse group of single cell eukaryotic organisms, in some of them are pathogens. Parasitic infections due to protozoa affect millions of people worldwide, causing a wide range of diseases, high rates of morbidity and mortality each year and an immense

In pathogenic protozoa, endocytosis is a basic mechanism for ingesting host macromolecules and it has thus been associated to parasite virulence. Previous work based on ultrastructural, cytochemical, biochemical and molecular studies has shown that protozoan parasites possess the structural compartments and proteins necessary to perform endocytosis (de Souza et al., 2009). The extent of endocytic activity varies among different protozoa and even across various developmental stages. In addition, in trypanosomatids, the endocytic process is highly active in a well-defined region of the parasite cell surface called the flagellar pocket (Ghedin et al., 2001). However, only very few studies have been published to characterize the endocytic MVB pathway in protozoan parasites, some of them

*Giardia lamblia* is a protozoan parasite that causes diarrheal infections. It is also one of the most primitive organisms, with a substantially different endomembrane morphology as

is consistent with recent evidence for an archael origin for Vps4 (Obita et al., 2007).

**1.4 Endocytosis and the MVB pathway in parasitic protozoa** 

economic burden for public health (Geoff, 1997).

(A) Eukaryotic cells internalize cargo molecules from the external environment by endocytic processes. These molecules transit along several compartments for surface recycling or degradation. The degradation pathway involves an endomembrane system constituted of membrane bound organelles called endosomes that mature from early to late endosomes/MVB for finally cargo delivering into lysosomes. According to (B), at late endosome level, molecules to be incorporated into the degradation pathway should be tagged with ubiquitin. In yeast, ubiquitination of cargo proteins is mediated by the ubiquitin ligase Rsp5 and by Bul1. Then, the ESCRT-0 complex initiates the MVB sorting process by endosomal membrane binding through the Vps27 domain, and ubiquitin recognition of cargo by UIM domains present in Vps27 and Hse1 proteins. Subsequently, Vps27 activates the ESCRT-I complex through its interaction with Vps23. Ubiquitinated cargo is recognized by ESCRT-I (via the UEV domain of Vps23) and by ESCRT-II (via the NZF domain of Vps36). Vps36 has an extensive positively charged region with high affinity to phosphoinositides, allowing ESCRT-II attachment to endosomal membranes. Then, ESCRT-III concentrates cargo proteins into MVB. ESCRT-III also associates to accessory proteins such as Bro1 and Doa4. Importantly, the Vps4 ATPase catalyzes the dissociation of ESCRT complexes to initiate new cycles of cargo sorting and transport. Together, ESCRT-0 to -III and -accessory proteins direct cargo sorting, vesicle fusion, and MVB biogenesis (modified from Hurley & Emr, 2006).

Cargo ubiquitination is mediated by the Rsp5 ubiquitin ligase and Bul1 protein. Then, cargo sorting through the MVB pathway initiates with the association of Vps27 and Hse1 proteins to make up the ESCRT-0 complex. Vps27 has a FYVE (Fab1, YOTB, Vac1, EEA1) domain, which binds to membrane lipids, and an UIM (Ubiquitin Interaction Motif) domain that determines an important role for ESCRT-0 in the initial selection of ubiquitinated cargo at the endosomal membrane (Hurley & Emr, 2006; Williams & Urbé, 2007). Then, ESCRT-0 recruits ESCRT-I, formed by Vps23, Vps28, Vps37 and Mvb12 proteins (Curtiss et al., 2007; Katzmann et al., 2001). Vps23 also recognizes and binds ubiquitinated proteins through its terminal UEV (Ubiquitin E2 Variant) domain. ESCRT-I binds to ESCRT-II formed by Vps22, Vps25 and Vps36 proteins (Babst et al., 2002a). This later protein also displays an ubiquitininteracting domain and a recognition region for phosphoinositides binding. Next, ESCRT-II binds to the ESCRT-III complex composed of Vps2, Vps20, Vps24 and Vps32 proteins (Babst et al., 2002b). Vps32 associates to Bro1, which recruits Doa4, an ubiquitin hydrolase that removes ubiquitin from cargo proteins prior to their incorporation into MVB (Kim et al., 2005; Odorizzi et al., 2003). One of the main functions of ESCRT-III is to concentrate the MVB cargo in the endosomal inward membrane, and to recruit Vps4, an ATPase that catalyzes the disassembly of ESCRT complexes from the endosomal membrane to initiate new rounds of cargo sorting and trafficking, and vesicle formation, and vesicle formation (Hurley & Emr, 2006; Hurley & Hanson, 2010; Williams & Urbé, 2007).

Accessory proteins such as Vta1 and Ist1, regulate Vps4 function (Dimaano et al., 2008; Shiflett et al., 2004), whereas Vps46 and Vps60 have also been suggested to bind ESCRT-III, although their precise functions have yet to be determined (Babst et al., 2002b).

#### **1.3 Evolution of the ESCRT machinery**

During the evolution from prokaryotic to eukaryotic organisms, some properties were lost while others were acquired. Among the latter is the ability of eukaryotic cells to incorporate macromolecules, complexes and other cells through endocytosis (de Souza et al., 2009). Comparative genomics and phylogenetic studies have determined that the basic features of intracellular trafficking systems arose very early in eukaryotic evolution (Dacks & Field,

(A) Eukaryotic cells internalize cargo molecules from the external environment by endocytic processes. These molecules transit along several compartments for surface recycling or degradation. The degradation pathway involves an endomembrane system constituted of membrane bound organelles called endosomes that mature from early to late endosomes/MVB for finally cargo delivering into lysosomes. According to (B), at late endosome level, molecules to be incorporated into the degradation pathway should be tagged with ubiquitin. In yeast, ubiquitination of cargo proteins is mediated by the ubiquitin ligase Rsp5 and by Bul1. Then, the ESCRT-0 complex initiates the MVB sorting process by endosomal membrane binding through the Vps27 domain, and ubiquitin recognition of cargo by UIM domains present in Vps27 and Hse1 proteins. Subsequently, Vps27 activates the ESCRT-I complex through its interaction with Vps23. Ubiquitinated cargo is recognized by ESCRT-I (via the UEV domain of Vps23) and by ESCRT-II (via the NZF domain of Vps36). Vps36 has an extensive positively charged region with high affinity to phosphoinositides, allowing ESCRT-II attachment to endosomal membranes. Then, ESCRT-III concentrates cargo proteins into MVB. ESCRT-III also associates to accessory proteins such as Bro1 and Doa4. Importantly, the Vps4 ATPase catalyzes the dissociation of ESCRT complexes to initiate new cycles of cargo sorting and transport. Together, ESCRT-0 to -III and -accessory proteins direct cargo sorting, vesicle

Cargo ubiquitination is mediated by the Rsp5 ubiquitin ligase and Bul1 protein. Then, cargo sorting through the MVB pathway initiates with the association of Vps27 and Hse1 proteins to make up the ESCRT-0 complex. Vps27 has a FYVE (Fab1, YOTB, Vac1, EEA1) domain, which binds to membrane lipids, and an UIM (Ubiquitin Interaction Motif) domain that determines an important role for ESCRT-0 in the initial selection of ubiquitinated cargo at the endosomal membrane (Hurley & Emr, 2006; Williams & Urbé, 2007). Then, ESCRT-0 recruits ESCRT-I, formed by Vps23, Vps28, Vps37 and Mvb12 proteins (Curtiss et al., 2007; Katzmann et al., 2001). Vps23 also recognizes and binds ubiquitinated proteins through its terminal UEV (Ubiquitin E2 Variant) domain. ESCRT-I binds to ESCRT-II formed by Vps22, Vps25 and Vps36 proteins (Babst et al., 2002a). This later protein also displays an ubiquitininteracting domain and a recognition region for phosphoinositides binding. Next, ESCRT-II binds to the ESCRT-III complex composed of Vps2, Vps20, Vps24 and Vps32 proteins (Babst et al., 2002b). Vps32 associates to Bro1, which recruits Doa4, an ubiquitin hydrolase that removes ubiquitin from cargo proteins prior to their incorporation into MVB (Kim et al., 2005; Odorizzi et al., 2003). One of the main functions of ESCRT-III is to concentrate the MVB cargo in the endosomal inward membrane, and to recruit Vps4, an ATPase that catalyzes the disassembly of ESCRT complexes from the endosomal membrane to initiate new rounds of cargo sorting and trafficking, and vesicle formation, and vesicle formation

Accessory proteins such as Vta1 and Ist1, regulate Vps4 function (Dimaano et al., 2008; Shiflett et al., 2004), whereas Vps46 and Vps60 have also been suggested to bind ESCRT-III,

During the evolution from prokaryotic to eukaryotic organisms, some properties were lost while others were acquired. Among the latter is the ability of eukaryotic cells to incorporate macromolecules, complexes and other cells through endocytosis (de Souza et al., 2009). Comparative genomics and phylogenetic studies have determined that the basic features of intracellular trafficking systems arose very early in eukaryotic evolution (Dacks & Field,

fusion, and MVB biogenesis (modified from Hurley & Emr, 2006).

(Hurley & Emr, 2006; Hurley & Hanson, 2010; Williams & Urbé, 2007).

**1.3 Evolution of the ESCRT machinery** 

although their precise functions have yet to be determined (Babst et al., 2002b).

2007). Similarly, evidence for the existence of MVB-like organelles in diverse primitive eukaryotes has also been reported (Allen et al., 2007; Tse et al., 2004; Yang et al., 2004).

Lysosomal targeting of ubiquitinated cargo by ESCRT complexes is conserved in animals and fungi (Leung et al., 2008). Extensive experimental and bioinformatical comparative analysis of genomic data indicate that ESCRT factors are well conserved across the eukaryotic lineage (Williams & Urbé, 2007). ESCRT-I, -II and –III as well as -accessory proteins are almost completely retained in all studied taxa, indicating an early evolutionary origin and a near-universal system for cargo trafficking through the MVB pathway. Particularly, all eukaryotic organisms studied to date have at least an ESCRT-III protein, suggesting that the minimal ESCRT necessary for MVB formation might be ESCRT-III (Williams & Urbé, 2007). In addition, the number of components of ESCRT-III is greatly expanded in mammals in comparison to yeast, being Vps46 the most frequent ESCRT-III multicopy gene product (Dacks et al., 2008).

A common ancestry within the same ESCRT complexes or among them, has been reported for Vps20, Vps32 and Vps60 proteins (sharing a Snf7 domain), and Vps2, Vps24 and Vps46 proteins (sharing a Vps24 domain). All these proteins are highly similar at sequence level and are encoded by multicopy genes, probably due to gene amplification events (Leung et al., 2008). In terms of biological conservation, it seems that several ESCRT components had to be expanded to provide functional redundancy. Thus, this redundancy would preserve ESCRT functions in the endocytic MVB pathway even if losses of components were presented along evolution.

Significantly, the Vps4 ATPase responsible for recycling ESCRT components, is present in all taxa, indicating a highly conserved mechanism for delivering energy in the system. This is consistent with recent evidence for an archael origin for Vps4 (Obita et al., 2007).

The most prominent evolutionary variation in the MVB pathway is the restriction of ESCRT-0 to animals and fungi, suggesting that a distinct mechanism for ubiquitin labeling, signal recognition and endosomal membrane binding likely operates in the rest of eukaryotic organisms (Leung et al., 2008).

#### **1.4 Endocytosis and the MVB pathway in parasitic protozoa**

Protozoa are a diverse group of single cell eukaryotic organisms, in some of them are pathogens. Parasitic infections due to protozoa affect millions of people worldwide, causing a wide range of diseases, high rates of morbidity and mortality each year and an immense economic burden for public health (Geoff, 1997).

In pathogenic protozoa, endocytosis is a basic mechanism for ingesting host macromolecules and it has thus been associated to parasite virulence. Previous work based on ultrastructural, cytochemical, biochemical and molecular studies has shown that protozoan parasites possess the structural compartments and proteins necessary to perform endocytosis (de Souza et al., 2009). The extent of endocytic activity varies among different protozoa and even across various developmental stages. In addition, in trypanosomatids, the endocytic process is highly active in a well-defined region of the parasite cell surface called the flagellar pocket (Ghedin et al., 2001). However, only very few studies have been published to characterize the endocytic MVB pathway in protozoan parasites, some of them are summarized below.

*Giardia lamblia* is a protozoan parasite that causes diarrheal infections. It is also one of the most primitive organisms, with a substantially different endomembrane morphology as

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

conserved ESCRT machinery.

**ESCRT proteins** 

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 295

phagotrophy (the ability to ingest portions of host cytoplasm) through the parasitophorous

*E. histolytica,* which causes amoebiasis, destroys almost all human tissues through macromolecules participating in adherence, contact-dependent cytolysis and proteolytic and phagocytic activities. A well-characterized protein involved in these key events is EhADH112 (García-Rivera et al., 1999). Interestingly, this protein is located at MVB-like structures in *E. histolytica* trophozoites and is structurally related to Bro1 (Bañuelos et al., 2005), an accessory protein that interacts with the ESCRT-III complex in yeast. Recently, our research group reported the presence of a set of 19 putative ESCRT proteins in this parasite and characterized a yeast Vps4 homologue by analyzing its ATPase function and relationship to parasite virulence in wild type and mutant cells (López-Reyes et al., 2010). Results derived from these studies strongly suggest that *E. histolytica* possesses a well

vacuole, it may be possible that these two processes are related (de Souza et al., 2009).

**2. Experimental approaches for the identification and characterization of** 

proteins have been identified and characterized by several methodologies.

Pornillos et al., 2002; Stahelin et al., 2002; Sundquist et al., 2004).

receptor molecule on the support (Pandeya & Thakkar, 2005).

The ESCRT components involved in mediating endosomal MVB sorting of ubiquitinated

Initially, over 70 *vps* genes required for the vacuolar transport of proteins were identified by genetic screening in yeast (Bonangelino et al., 2002; Bowers et al., 2004). At this moment, only 20 of these genes are known to be functionally involved in yeast MVB formation. In addition, the structure and function of putative binding domains present in ESCRT components have been characterized using recombinant proteins and site-directed mutagenesis. In particular, ubiquitin recognition and binding to ESCRT complexes by proteins such as Hse1 and Vps27, Vps23 or Vps36 were elucidated by using crystallographic structures of recombinant proteins that associate or not, to ubiquitin. The same methodologies have been used for characterizing lipid binding domains such as the FYVE motif, present in Vps27, and for positively charged regions with affinity to phosphoinositides, such as those exhibited by Vps36 and Vps24 (Misra & Hurley, 1999;

The yeast two-hybrid system is an assay to examine protein interactions. This system includes the construction of a bait protein containing a DNA binding domain, which hybridizes to a prey protein with an activation domain. The expression of the reporter gene means that the proteins of interest interact with each other since the activation domain promotes the transcription of the reporter gene (Gietz et al., 1997). On the other hand, pulldown assays are performed either to prove a suspected interaction between two proteins or to investigate unknown proteins or molecules that may bind to a protein of interest (Kaltenbach et al., 2007). Alternatively, affinity purification of histidine- or glutathionesuccinyl-transferase-(GST)-tagged bait proteins can be performed via immobilized affinity chromatography. The bait protein (or ligand) is captured to a solid support (beads) by covalent attachment to an activated beaded support or through an affinity tag that binds to a

In yeast, Bro1 binding to Vps32 was discovered by two-hybrid experiments, whereas Bro1 association to Vps4 was revealed by GST pull-down experiments. Additionally, using both methodologies, interactions between Vps20 and Vps28; Vps20 and Vps22; and Vps22 and

compared to higher eukaryotes. Although the morphology of membrane-bound vesicles in *Giardia* has been previously described, there exists few information about vesicle budding and fusion (Lanfredi-Rangel et al., 1998). Recently, it was reported that a putative gene encoding a FYVE domain-containing protein homologous to yeast Vps27 is expressed in *G. lamblia.* This protein binds to endosomal membrane phospholipids suggesting the presence of a MVB pathway in this parasite (Sinha et al., 2010). However, very little is known about the ESCRT machinery in *Giardia* (Leung et al., 2008).

*Leishmania major*, a flagellated parasite provoking leishmaniasis disease, presents a plasma membrane invagination (flagellar pocket) where the flagellum emerges. This site contains a complex and highly polarized MVB-like network where endocytosis and exocytosis occur for crucial exchanges such as nutrient uptake. In this parasite, a Vps4 homologue (LmVps4) has been characterized using a Vps4 dominant negative mutant in which the highly conserved E residue required for ATP hydrolysis was substituted by a Q amino acid at position 235. The LmVps4 mutant protein was accumulated around endocytic vesicular structures and this provoked a defect in cargo protein transport to the MVB-lysosomes, as it has been reported for yeast and mammalian Vps4 mutants (Babst et al, 1998; Fujita et al., 2003). Additionally, LmVps4 is probably involved in *Leishmania* pathogenicity, since the Vps4 mutant protein also impaired parasite differentiation and virulence (Besteiro et al., 2006).

Trypanosomes infect a variety of hosts and cause several diseases, including the fatal human diseases known as sleeping sickness and Chagas disease. In this group of flagellate protozoa, the trafficking system has been previously characterized (Field et al., 2007). Trypanosomes contain glycosil-phosphatidylinositol-anchored proteins and morphologically-related MVB structures, and also exhibit ubiquitin-dependent internalization of transmembrane proteins for degradation (Allen et al., 2007; Chung et al., 2004). The functional conservation of the ESCRT system has been confirmed in *Trypanosome brucei*. Despite extreme sequence divergence, epitope-tagged *Trypanosome* TbVps23 and TbVps28 proteins localize to the endosomal pathway. Knockdown of TbVps23 partially prevents degradation of ubiquitinated proteins. Therefore, despite the absence of an ESCRT-0 complex, the MVB pathway seems to function in this parasite, similarly to the yeast and human systems (Leung et al., 2008).

Members of the *Apicomplexan* phylum of intracellular parasites, such as *Plasmodium falciparum* and *Toxoplasma gondii,* responsible for malaria and toxoplasmosis, respectively, contain morphologically unique secretory organelles termed rhoptries that are essential for host cell invasion, and also display internal membrane-resembling MVB structures (Coppens & Joiner, 2003; Hoppe et al., 2000). In *T*. *gondii*, it has been hypothesized that the MVB pathway could intersect with the rhoptry biogenesis one. To explore this, wild type (PfVps4) and mutant (PfVps4E214Q) *P. falciparum* Vps4 proteins were independently overexpressed in *T. gondii*. As expected, PfVps4 was located in *T. gondii* vesicular structures, whereas PfVps4E214Q was found in aberrant organelles where rhoptries proteins were also present, indicating that the secretion pathway could be disrupted by the altered Vps4 protein. These findings suggest that MVB formation may occur in *T. gondii* and *P. falciparum* and that it could be affecting the secretory route too (Yang et al., 2004).

During host cell infection*, P. falciparum* lives within a special compartment known as the parasitophorous vacuole. For the parasite to survive and multiply, molecules from the host cell cytoplasm cross the parasitophorous vacuole membrane and trigger signals for the endocytic process. Despite the scarce information being available for supporting a feasible relationship between the MVB pathway and the mechanism of nutrient uptake and intracellular

compared to higher eukaryotes. Although the morphology of membrane-bound vesicles in *Giardia* has been previously described, there exists few information about vesicle budding and fusion (Lanfredi-Rangel et al., 1998). Recently, it was reported that a putative gene encoding a FYVE domain-containing protein homologous to yeast Vps27 is expressed in *G. lamblia.* This protein binds to endosomal membrane phospholipids suggesting the presence of a MVB pathway in this parasite (Sinha et al., 2010). However, very little is known about

*Leishmania major*, a flagellated parasite provoking leishmaniasis disease, presents a plasma membrane invagination (flagellar pocket) where the flagellum emerges. This site contains a complex and highly polarized MVB-like network where endocytosis and exocytosis occur for crucial exchanges such as nutrient uptake. In this parasite, a Vps4 homologue (LmVps4) has been characterized using a Vps4 dominant negative mutant in which the highly conserved E residue required for ATP hydrolysis was substituted by a Q amino acid at position 235. The LmVps4 mutant protein was accumulated around endocytic vesicular structures and this provoked a defect in cargo protein transport to the MVB-lysosomes, as it has been reported for yeast and mammalian Vps4 mutants (Babst et al, 1998; Fujita et al., 2003). Additionally, LmVps4 is probably involved in *Leishmania* pathogenicity, since the Vps4 mutant protein also

Trypanosomes infect a variety of hosts and cause several diseases, including the fatal human diseases known as sleeping sickness and Chagas disease. In this group of flagellate protozoa, the trafficking system has been previously characterized (Field et al., 2007). Trypanosomes contain glycosil-phosphatidylinositol-anchored proteins and morphologically-related MVB structures, and also exhibit ubiquitin-dependent internalization of transmembrane proteins for degradation (Allen et al., 2007; Chung et al., 2004). The functional conservation of the ESCRT system has been confirmed in *Trypanosome brucei*. Despite extreme sequence divergence, epitope-tagged *Trypanosome* TbVps23 and TbVps28 proteins localize to the endosomal pathway. Knockdown of TbVps23 partially prevents degradation of ubiquitinated proteins. Therefore, despite the absence of an ESCRT-0 complex, the MVB pathway seems to function in this parasite, similarly to the yeast and

Members of the *Apicomplexan* phylum of intracellular parasites, such as *Plasmodium falciparum* and *Toxoplasma gondii,* responsible for malaria and toxoplasmosis, respectively, contain morphologically unique secretory organelles termed rhoptries that are essential for host cell invasion, and also display internal membrane-resembling MVB structures (Coppens & Joiner, 2003; Hoppe et al., 2000). In *T*. *gondii*, it has been hypothesized that the MVB pathway could intersect with the rhoptry biogenesis one. To explore this, wild type (PfVps4) and mutant (PfVps4E214Q) *P. falciparum* Vps4 proteins were independently overexpressed in *T. gondii*. As expected, PfVps4 was located in *T. gondii* vesicular structures, whereas PfVps4E214Q was found in aberrant organelles where rhoptries proteins were also present, indicating that the secretion pathway could be disrupted by the altered Vps4 protein. These findings suggest that MVB formation may occur in *T. gondii* and *P. falciparum*

During host cell infection*, P. falciparum* lives within a special compartment known as the parasitophorous vacuole. For the parasite to survive and multiply, molecules from the host cell cytoplasm cross the parasitophorous vacuole membrane and trigger signals for the endocytic process. Despite the scarce information being available for supporting a feasible relationship between the MVB pathway and the mechanism of nutrient uptake and intracellular

the ESCRT machinery in *Giardia* (Leung et al., 2008).

human systems (Leung et al., 2008).

impaired parasite differentiation and virulence (Besteiro et al., 2006).

and that it could be affecting the secretory route too (Yang et al., 2004).

phagotrophy (the ability to ingest portions of host cytoplasm) through the parasitophorous vacuole, it may be possible that these two processes are related (de Souza et al., 2009).

*E. histolytica,* which causes amoebiasis, destroys almost all human tissues through macromolecules participating in adherence, contact-dependent cytolysis and proteolytic and phagocytic activities. A well-characterized protein involved in these key events is EhADH112 (García-Rivera et al., 1999). Interestingly, this protein is located at MVB-like structures in *E. histolytica* trophozoites and is structurally related to Bro1 (Bañuelos et al., 2005), an accessory protein that interacts with the ESCRT-III complex in yeast. Recently, our research group reported the presence of a set of 19 putative ESCRT proteins in this parasite and characterized a yeast Vps4 homologue by analyzing its ATPase function and relationship to parasite virulence in wild type and mutant cells (López-Reyes et al., 2010). Results derived from these studies strongly suggest that *E. histolytica* possesses a well conserved ESCRT machinery.

#### **2. Experimental approaches for the identification and characterization of ESCRT proteins**

The ESCRT components involved in mediating endosomal MVB sorting of ubiquitinated proteins have been identified and characterized by several methodologies.

Initially, over 70 *vps* genes required for the vacuolar transport of proteins were identified by genetic screening in yeast (Bonangelino et al., 2002; Bowers et al., 2004). At this moment, only 20 of these genes are known to be functionally involved in yeast MVB formation.

In addition, the structure and function of putative binding domains present in ESCRT components have been characterized using recombinant proteins and site-directed mutagenesis. In particular, ubiquitin recognition and binding to ESCRT complexes by proteins such as Hse1 and Vps27, Vps23 or Vps36 were elucidated by using crystallographic structures of recombinant proteins that associate or not, to ubiquitin. The same methodologies have been used for characterizing lipid binding domains such as the FYVE motif, present in Vps27, and for positively charged regions with affinity to phosphoinositides, such as those exhibited by Vps36 and Vps24 (Misra & Hurley, 1999; Pornillos et al., 2002; Stahelin et al., 2002; Sundquist et al., 2004).

The yeast two-hybrid system is an assay to examine protein interactions. This system includes the construction of a bait protein containing a DNA binding domain, which hybridizes to a prey protein with an activation domain. The expression of the reporter gene means that the proteins of interest interact with each other since the activation domain promotes the transcription of the reporter gene (Gietz et al., 1997). On the other hand, pulldown assays are performed either to prove a suspected interaction between two proteins or to investigate unknown proteins or molecules that may bind to a protein of interest (Kaltenbach et al., 2007). Alternatively, affinity purification of histidine- or glutathionesuccinyl-transferase-(GST)-tagged bait proteins can be performed via immobilized affinity chromatography. The bait protein (or ligand) is captured to a solid support (beads) by covalent attachment to an activated beaded support or through an affinity tag that binds to a receptor molecule on the support (Pandeya & Thakkar, 2005).

In yeast, Bro1 binding to Vps32 was discovered by two-hybrid experiments, whereas Bro1 association to Vps4 was revealed by GST pull-down experiments. Additionally, using both methodologies, interactions between Vps20 and Vps28; Vps20 and Vps22; and Vps22 and

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

databases to find sequences similar to one or more protein query sequences.

represented by different colors or shadings.

**3.1 Computational tools for predicting protein domains** 

ClustalW is also a widely used multiple sequence alignment computer program (http://align.genome.jp/). In many cases, the input set of query sequences is assumed to have an evolutionary relationship, share a lineage and descend from a common ancestor. This algorithm is usually supplemented by the BOXSHADE application (http://www.ch.embnet.org/software/BOX\_form.html). BOXSHADE is a program for creating good looking printouts from multiple-aligned protein or DNA sequences. BOXSHADE does not produce alignments by itself, it has to take as input a file preprocessed by a multiple alignment program or a multiple file editor such as ClustalW. In the standard BOXSHADE output, identical and similar residues in the multiple-alignment chart are

Protein domains, defined as the independent folding units within a polypeptide, are also understood as the functional and evolutionarily conserved modules of protein families.

them are briefly described below.

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 297

More than 99% of the protein sequences provided by UniProtKB comes from coding sequences translation and related data submitted to the public nucleic acid databases, including the European Molecular Biology Laboratory (EMBL) Bank, the GenBank (USA) and the DNA DataBank of Japan *(*DDBJ). Taking advantage of the information as much as possible, there are a number of computational tools to finally interpret databases, some of

The Expert Protein Analysis System (ExPaSy) is a proteomics server from the Swiss Institute of Bioinformatics that analyzes protein sequences and structures and contains genome databases for several organisms ranging from *Archae* to human (http://expasy.org/ tools/proteome). It has several tools useful to depict primary, secondary and tertiary protein structures and to determine putative postranslational modifications, among others. The Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing primary biological sequence information, such as amino acid sequences of different proteins or nucleotides of distinct DNA sequences. A BLAST search enables a researcher to compare a query sequence with data existing in sequence libraries or databases, and to identify the sequences that resemble the query sequence above a certain threshold. The main idea of BLAST is that there are often high-scoring segment pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and sequences from genome databases, using a heuristic approach that approximates the Smith-Waterman algorithm (Altschul et al., 1990). The BLASTP program, which compares protein queries to protein databases, is a heuristic model that attempts to optimize a specific similarity measure. The goal of this tool is to find regions of sequence similarity. These regions can yield clues about the structure and function of the novel sequence and its evolutionary history and homology by comparison to other sequences in databases (Henikoff & Henikoff, 2000). To produce a multiple sequence alignment from the BLASTP output, this program simply collects all database sequence segments that have been aligned to the query with an expectation value (*E*-value) below a threshold by a default set to 0.001. Thus, the lower the *E*-value, the greater the similarity between the input and the match sequences will be. An *E-value* < e-3 of an alignment means that the alignment is highly unique and not due to error (http://bips.u-strasbg.fr/fr/Tutorials/Comparison/ Blast/blastall.html). As an alternative for accurate searches of query sequences, the Position Specific Iterative (PSI)-BLAST program iteratively searches for one or more proteins

Vps28, were identified. Moreover, protein-protein interactions for ESCRT assembly have been evidenced by yeast-two-hybrid assays, affinity purification or both methods (Vps20 with Vps25 and Vps36; Vps27 with Hse1; Vps4 with Vps32; Vps22 with Vps25; and Vps22 with Vps36) (Bowers et al., 2004).

Another strategy to study protein functions is via dominant negative (DN) mutants. Mutations are changes in a genomic sequence and sometimes their expression is dominant over the wild-type protein synthesis in the same cell. Usually, DN mutants can still interact with the normal partner proteins thus blocking the functions of the wild-type protein. To improve our knowledge on the ESCRT model, several DN mutants for Vps proteins have been generated, including Hrs, Vps27, Vps23, Vps20 and Vps4 (Kanazawa et al., 2003; Li et al., 1999; Fujita et al., 2003).

Research using such strategies has increased our knowledge on the identity, structure, function and biological relationships of several molecules participating in the protein sorting through the endosomal MVB pathway. However, complementary experimental efforts need to be performed to better understand this cellular process.

#### **3. Computational research on protein biology**

One of the most familiar applications of bioinformatics is the comparison of the amino acid sequence from a query protein against the amino acid sequence of a protein previously characterized in structure and function, to theoretically elucidate whether they are related. This approach gives insights into functional similarities and evolutionary relationships deduced from the presence of common structural features (Söding, 2005).

Similarity and homology are two important concepts in the bioinformatical analysis of protein sequences. Similarity is a quantitative measure between two or more related amino acid sequences. By contrast, homology is a qualitative measure which indicates if two or more proteins are evolutionarily related or derived from a common ancestor (Claverie & Notredame, 2006). Protein sequences are usually submitted, annotated and stored in databases that allow their comparison and analysis by certain software.

In general, a database is a digital system that organizes, stores and easily retrieves large amounts of data. Currently, several genome and proteome databases are freely available for studying protein biology. However, the sheer amount of data makes highly difficult to manually interpret it. Therefore, databases require supplementary and incisive computational tools in order to understand the information.

One of the most recognized databases is the UniProt Knowledgebase (UniProtKB, http://www.uniprot.org/). The UniProtKB is the central hub for the collection of functional information on annotated proteins. The UniProtKB consists of a section containing manually-annotated records with information extracted from literature and curatorevaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL). Manual annotation consists of a critical and continuously updated review of experimentally proven or computer-predicted data about each protein by an expert team of biologists.

The UnipProtKB captures the mandatory core data for each entry (amino acid sequence, protein name, description, taxonomic data and citation information) and supplementary information derived from experimental evidence or computational data.

Vps28, were identified. Moreover, protein-protein interactions for ESCRT assembly have been evidenced by yeast-two-hybrid assays, affinity purification or both methods (Vps20 with Vps25 and Vps36; Vps27 with Hse1; Vps4 with Vps32; Vps22 with Vps25; and Vps22

Another strategy to study protein functions is via dominant negative (DN) mutants. Mutations are changes in a genomic sequence and sometimes their expression is dominant over the wild-type protein synthesis in the same cell. Usually, DN mutants can still interact with the normal partner proteins thus blocking the functions of the wild-type protein. To improve our knowledge on the ESCRT model, several DN mutants for Vps proteins have been generated, including Hrs, Vps27, Vps23, Vps20 and Vps4 (Kanazawa et al., 2003; Li et

Research using such strategies has increased our knowledge on the identity, structure, function and biological relationships of several molecules participating in the protein sorting through the endosomal MVB pathway. However, complementary experimental efforts need

One of the most familiar applications of bioinformatics is the comparison of the amino acid sequence from a query protein against the amino acid sequence of a protein previously characterized in structure and function, to theoretically elucidate whether they are related. This approach gives insights into functional similarities and evolutionary relationships

Similarity and homology are two important concepts in the bioinformatical analysis of protein sequences. Similarity is a quantitative measure between two or more related amino acid sequences. By contrast, homology is a qualitative measure which indicates if two or more proteins are evolutionarily related or derived from a common ancestor (Claverie & Notredame, 2006). Protein sequences are usually submitted, annotated and stored in

In general, a database is a digital system that organizes, stores and easily retrieves large amounts of data. Currently, several genome and proteome databases are freely available for studying protein biology. However, the sheer amount of data makes highly difficult to manually interpret it. Therefore, databases require supplementary and incisive

One of the most recognized databases is the UniProt Knowledgebase (UniProtKB, http://www.uniprot.org/). The UniProtKB is the central hub for the collection of functional information on annotated proteins. The UniProtKB consists of a section containing manually-annotated records with information extracted from literature and curatorevaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL). Manual annotation consists of a critical and continuously updated review of experimentally proven or computer-predicted data about each protein by an expert team of biologists. The UnipProtKB captures the mandatory core data for each entry (amino acid sequence, protein name, description, taxonomic data and citation information) and supplementary

deduced from the presence of common structural features (Söding, 2005).

databases that allow their comparison and analysis by certain software.

information derived from experimental evidence or computational data.

computational tools in order to understand the information.

with Vps36) (Bowers et al., 2004).

al., 1999; Fujita et al., 2003).

to be performed to better understand this cellular process.

**3. Computational research on protein biology** 

More than 99% of the protein sequences provided by UniProtKB comes from coding sequences translation and related data submitted to the public nucleic acid databases, including the European Molecular Biology Laboratory (EMBL) Bank, the GenBank (USA) and the DNA DataBank of Japan *(*DDBJ). Taking advantage of the information as much as possible, there are a number of computational tools to finally interpret databases, some of them are briefly described below.

The Expert Protein Analysis System (ExPaSy) is a proteomics server from the Swiss Institute of Bioinformatics that analyzes protein sequences and structures and contains genome databases for several organisms ranging from *Archae* to human (http://expasy.org/ tools/proteome). It has several tools useful to depict primary, secondary and tertiary protein structures and to determine putative postranslational modifications, among others.

The Basic Local Alignment Search Tool (BLAST) is an algorithm for comparing primary biological sequence information, such as amino acid sequences of different proteins or nucleotides of distinct DNA sequences. A BLAST search enables a researcher to compare a query sequence with data existing in sequence libraries or databases, and to identify the sequences that resemble the query sequence above a certain threshold. The main idea of BLAST is that there are often high-scoring segment pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and sequences from genome databases, using a heuristic approach that approximates the Smith-Waterman algorithm (Altschul et al., 1990). The BLASTP program, which compares protein queries to protein databases, is a heuristic model that attempts to optimize a specific similarity measure. The goal of this tool is to find regions of sequence similarity. These regions can yield clues about the structure and function of the novel sequence and its evolutionary history and homology by comparison to other sequences in databases (Henikoff & Henikoff, 2000). To produce a multiple sequence alignment from the BLASTP output, this program simply collects all database sequence segments that have been aligned to the query with an expectation value (*E*-value) below a threshold by a default set to 0.001. Thus, the lower the *E*-value, the greater the similarity between the input and the match sequences will be. An *E-value* < e-3 of an alignment means that the alignment is highly unique and not due to error (http://bips.u-strasbg.fr/fr/Tutorials/Comparison/ Blast/blastall.html). As an alternative for accurate searches of query sequences, the Position Specific Iterative (PSI)-BLAST program iteratively searches for one or more proteins databases to find sequences similar to one or more protein query sequences.

ClustalW is also a widely used multiple sequence alignment computer program (http://align.genome.jp/). In many cases, the input set of query sequences is assumed to have an evolutionary relationship, share a lineage and descend from a common ancestor. This algorithm is usually supplemented by the BOXSHADE application (http://www.ch.embnet.org/software/BOX\_form.html). BOXSHADE is a program for creating good looking printouts from multiple-aligned protein or DNA sequences. BOXSHADE does not produce alignments by itself, it has to take as input a file preprocessed by a multiple alignment program or a multiple file editor such as ClustalW. In the standard BOXSHADE output, identical and similar residues in the multiple-alignment chart are represented by different colors or shadings.

#### **3.1 Computational tools for predicting protein domains**

Protein domains, defined as the independent folding units within a polypeptide, are also understood as the functional and evolutionarily conserved modules of protein families.

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

selection of mutagenesis sites and the design of rational drugs, among others.

repaired using a loop library and reconstruction procedures.

"space filling", "ribbon", etc. (http://jmol.sourceforge.net/download/).

protein structure accurate predictions.

and *Kinetoplastida (Trypanosoma* and *Leishmani*a).

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 299

protein predictions (Kelley & Stenberg, 2009). The Phyre platform follows the most successful general approaches for predicting the structure of proteins, which involve the detection of homologues of a known three dimensional structure, the so-called templatebased homology modeling and fold-recognition. Practical applications from three dimensional protein structure predictions include guidance on functional hypothesis, the

The Phyre server uses a library of known protein structures taken from the SCOP (Structural classification of proteins) database and augmented with newer depositions in the PDB. Sequences of each of these structures are scanned against a non-redundant sequence database and a profile is constructed and deposited in a "fold" library. The known and predicted secondary structures of these proteins are also stored in the fold library. A usersubmitted sequence follows the same process. Five iterations of PSI-BLAST are used to gather both close and remote sequence homologues. The pairwise alignments generated by PSI-BLAST are combined into a single alignment with the query sequence as the master. Following the profile construction, the secondary structure of the query is predicted using three distinct programs (Psi-Pred, SSPro and Jnet). Subsequently, both profile and secondary structure, are scanned against the fold library using a profile-profile algorithm that returns a score. Scores are fitted to an extreme value distribution to generate an *E-*value. The top ten highest scores are then used to construct full three-dimensional models for the query. Where possible, missing or inserted regions caused by deletions or insertions in the alignment are

An alternative program widely used to model tertiary protein structures is SWISS-MODEL. SWISS-MODEL is a fully automated protein structure homology-modeling server accessible via the ExPASy web server or from the DeepView program (http://swissmodel.expasy.org/). The purpose of this server is to make protein modeling accessible to all biochemists and molecular biologists worldwide by providing tools for

Once a tertiary structure has been modeled, it is sometimes necessary to get access into a model viewer. Jmol is a free open-source viewer for chemical three dimensional structures that is written in Java (so it runs on Windows, Mac OS X, Linux and UNIX systems). Jmol returns a representation of a molecule that may be used as a teaching tool, or for research e.g. in chemistry and biochemistry. The most notable feature is an applet that can be integrated into web pages to display molecules in a variety of models: "ball and stick",

**4. ESCRT protein survey in protozoan parasites with bioinformatical tools** 

By using a bioinformatical screening and comparative genomic analysis, we confirmed in this work the presence of ESCRT representatives in unrelated groups of unicellular parasites of medical importance belonging to the following taxa: *Entamoebidae (Entamoeba), Diplomonadida (Giardia), Alveolata* of the phyllum *Apicomplexa (Toxoplasma* and *Plasmodium),*

First, we obtained yeast or mammalian amino acid sequences for ESCRT-0 to -III and associated proteins from the UniProtKB database. Then, the retrieved sequences were used as probes to screen the Eukaryotic Pathogen database (EuPathDB version 2.9, http://eupathdb.org/eupathdb/). The EuPathDB has been developed as a Bioinformatics Resource Center and constitutes an integrated genome database covering eukaryotic

The Pfam protein family database is a large collection of multiple sequence alignments that is generated by probabilistic models known as hidden Markov models (HMM) (http://www.sanger.ac.uk/resources/databases/pfam.html). The Pfam database contains information about protein domains and families. For each family in Pfam, one can look at multiple alignments, view protein domain architectures, examine species distribution, and follow links to other databases and view known protein structures.

Despite the increasing volume of biochemical and molecular literature on protein data, Pfam contains the essential information about major protein domains for the understanding of the ever more complicated biological landscape.

Since the ClustalW and BOXSHADE programs could be useful to identify conserved residues and similar regions among amino acid sequences, they also allow the prediction of putative domains in a protein or group of proteins of interest.

#### **3.2 Computational approaches for predicting secondary protein structure**

Secondary structure refers to highly regular local sub-structures within a molecule. The secondary structure of a protein is defined by patterns of hydrogen bonds between the main-chain peptide groups, leading to several recognizable protein domains, such as alpha (α) helices and beta (β) sheets (Offer et al., 2002).

So far, several algorithms have been described for predicting secondary protein structures, one of them being Jpred (Cole et al., 2008). Jpred uses a 3-iteration PSI-BLAST search to obtain sequences from existing databases for predicting secondary structures. Jpred now includes Jnet, a neural network method also for secondary structure prediction. The Jnet algorithm works by applying multiple sequence alignments, alongside PSI-BLAST and HMM profiles (Cuff & Barton, 1999). The updated Jnet algorithm provides α-helix and βsheet predictions at an accuracy of 81.5% (Cole et al., 2008).

#### **3.3 Computational algorithms for predicting tertiary protein structure**

The tertiary structure of a protein refers to the three-dimensional arrangement of a single protein molecule. The α-helices and β-sheets are folded into a compact structure due to nonspecific hydrophobic interactions. However, this structure is stable only when the parts of a protein domain are locked into place by specific tertiary interactions, such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide bonds (Peng & Kim, 1994).

The Protein Data Bank (PDB) contains information about experimentally-determined structures of proteins and nucleic acids, and complex assemblies (http://www.pdb.org/pdb/home/home.do). The Resource for Studying Biological Macromolecules curates and annotates the PDB data according to agreed upon standards and also provides a variety of tools and resources. Interestingly, the PDB is a repository for three dimensional structural data of proteins (typically obtained by X-ray crystallography or Nuclear Magnetic Resonance spectroscopy) submitted by biologists and biochemists from around the world.

The PDB is a key resource in areas of structural biology, such as structural genomics. Contents of the PDB are thought to be primary data, and currently there are hundreds of derived databases that categorize data differently.

The Phyre (Protein homology/analogy recognition engine) webserver is a powerful computational tool that uses profile-profile matching algorithms to considerably improve

The Pfam protein family database is a large collection of multiple sequence alignments that is generated by probabilistic models known as hidden Markov models (HMM) (http://www.sanger.ac.uk/resources/databases/pfam.html). The Pfam database contains information about protein domains and families. For each family in Pfam, one can look at multiple alignments, view protein domain architectures, examine species distribution, and

Despite the increasing volume of biochemical and molecular literature on protein data, Pfam contains the essential information about major protein domains for the understanding of the

Since the ClustalW and BOXSHADE programs could be useful to identify conserved residues and similar regions among amino acid sequences, they also allow the prediction of

Secondary structure refers to highly regular local sub-structures within a molecule. The secondary structure of a protein is defined by patterns of hydrogen bonds between the main-chain peptide groups, leading to several recognizable protein domains, such as alpha

So far, several algorithms have been described for predicting secondary protein structures, one of them being Jpred (Cole et al., 2008). Jpred uses a 3-iteration PSI-BLAST search to obtain sequences from existing databases for predicting secondary structures. Jpred now includes Jnet, a neural network method also for secondary structure prediction. The Jnet algorithm works by applying multiple sequence alignments, alongside PSI-BLAST and HMM profiles (Cuff & Barton, 1999). The updated Jnet algorithm provides α-helix and β-

The tertiary structure of a protein refers to the three-dimensional arrangement of a single protein molecule. The α-helices and β-sheets are folded into a compact structure due to nonspecific hydrophobic interactions. However, this structure is stable only when the parts of a protein domain are locked into place by specific tertiary interactions, such as salt bridges, hydrogen bonds, and the tight packing of side chains and disulfide bonds (Peng & Kim,

The Protein Data Bank (PDB) contains information about experimentally-determined structures of proteins and nucleic acids, and complex assemblies (http://www.pdb.org/pdb/home/home.do). The Resource for Studying Biological Macromolecules curates and annotates the PDB data according to agreed upon standards and also provides a variety of tools and resources. Interestingly, the PDB is a repository for three dimensional structural data of proteins (typically obtained by X-ray crystallography or Nuclear Magnetic Resonance spectroscopy) submitted by biologists and biochemists from

The PDB is a key resource in areas of structural biology, such as structural genomics. Contents of the PDB are thought to be primary data, and currently there are hundreds of

The Phyre (Protein homology/analogy recognition engine) webserver is a powerful computational tool that uses profile-profile matching algorithms to considerably improve

**3.2 Computational approaches for predicting secondary protein structure** 

follow links to other databases and view known protein structures.

putative domains in a protein or group of proteins of interest.

sheet predictions at an accuracy of 81.5% (Cole et al., 2008).

derived databases that categorize data differently.

**3.3 Computational algorithms for predicting tertiary protein structure** 

ever more complicated biological landscape.

(α) helices and beta (β) sheets (Offer et al., 2002).

1994).

around the world.

protein predictions (Kelley & Stenberg, 2009). The Phyre platform follows the most successful general approaches for predicting the structure of proteins, which involve the detection of homologues of a known three dimensional structure, the so-called templatebased homology modeling and fold-recognition. Practical applications from three dimensional protein structure predictions include guidance on functional hypothesis, the selection of mutagenesis sites and the design of rational drugs, among others.

The Phyre server uses a library of known protein structures taken from the SCOP (Structural classification of proteins) database and augmented with newer depositions in the PDB. Sequences of each of these structures are scanned against a non-redundant sequence database and a profile is constructed and deposited in a "fold" library. The known and predicted secondary structures of these proteins are also stored in the fold library. A usersubmitted sequence follows the same process. Five iterations of PSI-BLAST are used to gather both close and remote sequence homologues. The pairwise alignments generated by PSI-BLAST are combined into a single alignment with the query sequence as the master. Following the profile construction, the secondary structure of the query is predicted using three distinct programs (Psi-Pred, SSPro and Jnet). Subsequently, both profile and secondary structure, are scanned against the fold library using a profile-profile algorithm that returns a score. Scores are fitted to an extreme value distribution to generate an *E-*value. The top ten highest scores are then used to construct full three-dimensional models for the query. Where possible, missing or inserted regions caused by deletions or insertions in the alignment are repaired using a loop library and reconstruction procedures.

An alternative program widely used to model tertiary protein structures is SWISS-MODEL. SWISS-MODEL is a fully automated protein structure homology-modeling server accessible via the ExPASy web server or from the DeepView program (http://swissmodel.expasy.org/). The purpose of this server is to make protein modeling accessible to all biochemists and molecular biologists worldwide by providing tools for protein structure accurate predictions.

Once a tertiary structure has been modeled, it is sometimes necessary to get access into a model viewer. Jmol is a free open-source viewer for chemical three dimensional structures that is written in Java (so it runs on Windows, Mac OS X, Linux and UNIX systems). Jmol returns a representation of a molecule that may be used as a teaching tool, or for research e.g. in chemistry and biochemistry. The most notable feature is an applet that can be integrated into web pages to display molecules in a variety of models: "ball and stick", "space filling", "ribbon", etc. (http://jmol.sourceforge.net/download/).

### **4. ESCRT protein survey in protozoan parasites with bioinformatical tools**

By using a bioinformatical screening and comparative genomic analysis, we confirmed in this work the presence of ESCRT representatives in unrelated groups of unicellular parasites of medical importance belonging to the following taxa: *Entamoebidae (Entamoeba), Diplomonadida (Giardia), Alveolata* of the phyllum *Apicomplexa (Toxoplasma* and *Plasmodium),* and *Kinetoplastida (Trypanosoma* and *Leishmani*a).

First, we obtained yeast or mammalian amino acid sequences for ESCRT-0 to -III and associated proteins from the UniProtKB database. Then, the retrieved sequences were used as probes to screen the Eukaryotic Pathogen database (EuPathDB version 2.9, http://eupathdb.org/eupathdb/). The EuPathDB has been developed as a Bioinformatics Resource Center and constitutes an integrated genome database covering eukaryotic

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

Table 1. Comparison of ESCRT machineries from parasitic protozoa.

**4.1 Characterization of the ESCRT machinery in** *E. histolytica* 

into endosomes (Wöstmann et al., 1996).

(Hurley & Emr, 2006).

The presence (+) of homologous proteins is based on data obtained by BLAST searches from protein sequence databases at NCBI, UniProtK and EuPathDB, as described in the text. Proteins apparently absent (-) from complete genome sequencing projects are indicated.

Our previous work, using comparative genomics for predicting ESCRT proteins in *E. histolytica*, provided valuable insights into the existence of a highly conserved ESCRT machinery in this parasite. López-Reyes et al. (2010) reported a set of 19 putative ESCRT proteins representing from ESCRT-0 to -III and -associated proteins (Table 2). Moreover, earlier characterization of ubiquitin genes and -transcripts and demonstration of an ubiquitin-conjugating system, together with our finding of a putative Rsp5 ubiquitin ligase (EhRsp5) *E. histolytica* provided additional support for the presence of at least one candidate that possibly mediates ubiquitin attachment to cargo molecules prior to their internalization

Previous work has provided knowledge into the architecture, membrane recruitment and functional interactions of the ESCRT machinery through multiple domains that have been shaped along evolution. These scaffolds serve as gripping tools for recognizing cargo proteins, membrane lipids, ESCRT components and accessory proteins along the MVB route

parasites.

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 301

Taken together, our *in silico* results support the existence of a seemingly conserved ESCRT machinery for endosomal protein trafficking through the MVB pathway in protozoan

pathogens of the genera *Cryptosporidium, Giardia, Entamoeba, Leishmania, Plasmodium, Toxoplasma, Trichomonas* and *Trypanosoma,* among others. This portal offers an entry point to all these resources, and the opportunity to leverage orthology (structural correspondence or similarity of genes or proteins in different species due to a common ancestor origin) for searches across genera in an interface that is functional, user-friendly and sophisticated.

Using yeast ESCRT protein sequences as queries in the EuPathDB resource for each parasite genome, the BLASTP program reported several amino acid sequences for each pathogen. When no matches were found, human corresponding ESCRT protein sequences were used as queries. Putative parasite ESCRT homologous sequences were selected with the following criteria: *i)* at least 20% identity and 35% similarity to the query sequence, *ii) E*-value lower than 0.002, and *iii)* absence of stop codons in the coding sequence. Furthermore, all recovered sequences were subjected to reverse BLAST analysis in the ExPaSy server to identify related proteins from genome databases. A candidate was taken into consideration if reverse BLAST recovered the original query within the top five hits. Failure to complete these tests resulted in a "not determined" assignment.

BLAST results showed that all parasites studied here contain putative protein sequences representing the ESCRT-0 to -III and -accessory proteins involved in the endocytic MVB pathway. In Table 1, we summarized the results derived from our parasite ESCRT genomic survey in comparison to ESCRT members previously reported in yeast or human. The major noticeable feature was the high conservation of ESCRT components in all taxa, as previously reported (Leung et al., 2008). As noticed, *Entamoeba histolytica* and *Leishmania major* contain the most represented and conserved ESCRT machinery among parasites, with 19 ESCRT components. Meanwhile, *Trypanosoma cruzi* and *Plasmodium falciparum* displayed 15 and 14 ESCRT putative proteins, respectively. By contrast, we only found 9 out of the 20 ESCRT proteins in *Toxoplasma gondii* and *Giardia lamblia.* 

Ubiquitin-label recognition is the signal for cargo protein entrance towards degradation through the endosomal pathway (Bowers et al., 2004). Rsp5 and Bul1 proteins mediate ubiquitin-attachment to cargo proteins in yeast. Here, bioinformatical approaches revealed that ubiquitination seems to be mediated by Rsp5 rather than Bul1 homologues, since Rsp5 like proteins were present in all protozoan genomes.

Unlike preceding work, we found at least one ESCRT-0 representative for each parasite, indicating that proteins recognizing ubiquitin signals could be participating in cargo sorting in these protozoa. ESCRT-I and -II were the least represented complexes among all parasites, suggesting that some taxa members could have lost specific components along ESCRT evolution. However, we cannot exclude that the lack of individual ESCRT components might be the result of malfunctionings in gene or protein detection, more than a real absence of the protein. In particular, failures have been frequently reported for *Giardia* due to difficulties to recover candidate orthologues in its extremely divergent genome.

To the best of our knowledge, there is no sequenced eukaryotic genome without an ESCRT-III-related gene. Moreover, the size of the subset of ESCRT-III-related genes is greatly expanded in higher eukaryotes such as mammals, compared to yeast. As a consequence, it has been hypothesized that the ESCRT-III complex might be the minimal ESCRT unit for MVB formation (Williams & Urbé, 2007). Consistently, our results revealed at least two ESCRT-III representatives in each parasite genome analyzed.

Regarding the ESCRT-accesory proteins, the most conserved sequences among all parasites were the Rsp5, Vps4, Vps46, Doa4, Vta1 and Bro1 homologues, in contrast to Ist1, which was only present in trypanosomatids.

pathogens of the genera *Cryptosporidium, Giardia, Entamoeba, Leishmania, Plasmodium, Toxoplasma, Trichomonas* and *Trypanosoma,* among others. This portal offers an entry point to all these resources, and the opportunity to leverage orthology (structural correspondence or similarity of genes or proteins in different species due to a common ancestor origin) for searches across genera in an interface that is functional, user-friendly

Using yeast ESCRT protein sequences as queries in the EuPathDB resource for each parasite genome, the BLASTP program reported several amino acid sequences for each pathogen. When no matches were found, human corresponding ESCRT protein sequences were used as queries. Putative parasite ESCRT homologous sequences were selected with the following criteria: *i)* at least 20% identity and 35% similarity to the query sequence, *ii) E*-value lower than 0.002, and *iii)* absence of stop codons in the coding sequence. Furthermore, all recovered sequences were subjected to reverse BLAST analysis in the ExPaSy server to identify related proteins from genome databases. A candidate was taken into consideration if reverse BLAST recovered the original query within the top five hits. Failure to complete

BLAST results showed that all parasites studied here contain putative protein sequences representing the ESCRT-0 to -III and -accessory proteins involved in the endocytic MVB pathway. In Table 1, we summarized the results derived from our parasite ESCRT genomic survey in comparison to ESCRT members previously reported in yeast or human. The major noticeable feature was the high conservation of ESCRT components in all taxa, as previously reported (Leung et al., 2008). As noticed, *Entamoeba histolytica* and *Leishmania major* contain the most represented and conserved ESCRT machinery among parasites, with 19 ESCRT components. Meanwhile, *Trypanosoma cruzi* and *Plasmodium falciparum* displayed 15 and 14 ESCRT putative proteins, respectively. By contrast, we only found 9 out of the 20 ESCRT

Ubiquitin-label recognition is the signal for cargo protein entrance towards degradation through the endosomal pathway (Bowers et al., 2004). Rsp5 and Bul1 proteins mediate ubiquitin-attachment to cargo proteins in yeast. Here, bioinformatical approaches revealed that ubiquitination seems to be mediated by Rsp5 rather than Bul1 homologues, since Rsp5-

Unlike preceding work, we found at least one ESCRT-0 representative for each parasite, indicating that proteins recognizing ubiquitin signals could be participating in cargo sorting in these protozoa. ESCRT-I and -II were the least represented complexes among all parasites, suggesting that some taxa members could have lost specific components along ESCRT evolution. However, we cannot exclude that the lack of individual ESCRT components might be the result of malfunctionings in gene or protein detection, more than a real absence of the protein. In particular, failures have been frequently reported for *Giardia* due to difficulties to recover candidate orthologues in its extremely divergent genome. To the best of our knowledge, there is no sequenced eukaryotic genome without an ESCRT-III-related gene. Moreover, the size of the subset of ESCRT-III-related genes is greatly expanded in higher eukaryotes such as mammals, compared to yeast. As a consequence, it has been hypothesized that the ESCRT-III complex might be the minimal ESCRT unit for MVB formation (Williams & Urbé, 2007). Consistently, our results revealed at least two

Regarding the ESCRT-accesory proteins, the most conserved sequences among all parasites were the Rsp5, Vps4, Vps46, Doa4, Vta1 and Bro1 homologues, in contrast to Ist1, which was

and sophisticated.

these tests resulted in a "not determined" assignment.

proteins in *Toxoplasma gondii* and *Giardia lamblia.* 

like proteins were present in all protozoan genomes.

ESCRT-III representatives in each parasite genome analyzed.

only present in trypanosomatids.


Taken together, our *in silico* results support the existence of a seemingly conserved ESCRT machinery for endosomal protein trafficking through the MVB pathway in protozoan parasites.

Table 1. Comparison of ESCRT machineries from parasitic protozoa.

The presence (+) of homologous proteins is based on data obtained by BLAST searches from protein sequence databases at NCBI, UniProtK and EuPathDB, as described in the text. Proteins apparently absent (-) from complete genome sequencing projects are indicated.

#### **4.1 Characterization of the ESCRT machinery in** *E. histolytica*

Our previous work, using comparative genomics for predicting ESCRT proteins in *E. histolytica*, provided valuable insights into the existence of a highly conserved ESCRT machinery in this parasite. López-Reyes et al. (2010) reported a set of 19 putative ESCRT proteins representing from ESCRT-0 to -III and -associated proteins (Table 2). Moreover, earlier characterization of ubiquitin genes and -transcripts and demonstration of an ubiquitin-conjugating system, together with our finding of a putative Rsp5 ubiquitin ligase (EhRsp5) *E. histolytica* provided additional support for the presence of at least one candidate that possibly mediates ubiquitin attachment to cargo molecules prior to their internalization into endosomes (Wöstmann et al., 1996).

Previous work has provided knowledge into the architecture, membrane recruitment and functional interactions of the ESCRT machinery through multiple domains that have been shaped along evolution. These scaffolds serve as gripping tools for recognizing cargo proteins, membrane lipids, ESCRT components and accessory proteins along the MVB route (Hurley & Emr, 2006).

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

been identified in *E. histolytica* (Nakada-Tsukui et al., 2009).

(Whitley et al., 2003).

ESCRT-0, -I, -II and –III complexes.

terminus of EhVps23, also supported by Pfam domain predictions.

code: 2gd5) proteins for EhVps23 and EhVps24, respectively.

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 303

Membrane phospholipids such as PtdIns3P, have been previously implicated in the regulation of endocytosis and phagocytosis and 12 FYVE-domain containing proteins have

The UEV domain present in yeast Vps23 and its human homologue Tsg101 is necessary to recognize ubiquitin signals in proteins to be sorted into MVB (Pornillos et al., 2002; Sundquist et al., 2004). Despite a less conserved similarity among analyzed sequences, our bioinformatics approach suggested the presence of a putative UEV domain at the N-

According to our current investigation, EhVps36 lacks the yeast NZF and human GLUE domains previously reported in Vps36 homologues. Both domains have been implicated in ubiquitin and PtdIns3P binding, respectively. Instead, EhVps36 conserves a N-terminal positively charged amino acid region. Similarly, the EhVps24 protein exhibits a positively charged amino acid tract present in almost its full sequence. Since specific binding to phosphoinositides requires electrostatic interactions between negatively charged phosphates on lipids and positively charged amino acids in proteins, it is feasible that EhVps36 and EhVps24 associate to phosphoinositides present at endosomal membranes

Secondary structure assignments for putative ESCRT proteins of *E. histolytica* were achieved by using the Jpred program. In agreement with our previous findings, EhVps27, EhVps36 and EhVps24, and EhVps23 proteins resulted in similar arrangements to yeast Vps27, Vps36 and Vps24 proteins, and human Tsg101, respectively. Furthermore, according to both Phyre and SWISS-MODEL tertiary structure predictions, the three-dimensional structures of EhVps27 and EhVps36 matched to yeast Vps27 (PDB code: 1vfy) and Vps36 (PDB code: 1u5t) crystalline structures, respectively. In addition, the Phyre software predicted a conformational arrangement similar to human Tsg101 (PDB code: 1s1q) and CHMP3 (PDB

Altogether, our results indicate the presence of putative structural and conformational features for ubiquitin and lípid binding in representative proteins from the *E. histolytica*

To determine the identity of putative ESCRT-accesory proteins, we first focused on EhADH112, a protein widely studied by our group and involved in *E. histolytica* adherence to and phagocytosis of host cells (García-Rivera et al., 1999). *In silico* analysis of the primary sequence of EhADH112 together with Pfam protein domain predictions, revealed that EhADH112 is structurally related to yeast Bro1 and its human homologue Alix. EhADH112 has a conserved Bro1 domain at its N-terminus. In Bro1 and Alix proteins, the Bro1 domain constitutes the interacting site for Vps32 or CHMP4B, respectively, both components of the ESCRT-III complex. Experimental approaches demonstrated that *E. histolytica* parasites overexpressing only a part of the EhADH112 Bro1 domain, reduced dramatically their ability to ingest cells, thus providing additional evidence for EhADH112 participation in phagocytosis (our unpublished results). Furthermore, immunolocalization of EhADH112 and truncated EhADH112 proteins in parasites, using both transmission electron and laser confocal microscopy, revealed that besides its detection at the plasma membrane and cytoplasmic vacuoles, EhADH112 is also in MVB-like organelles, whereas the EhADH112 mutant version accumulates in cytoplasmic vesicles. These findings led us to assign a putative role for the EhADH112 Bro1 domain to recruit proteins to the endosomal membranes forming MVB. Possibly, Vps proteins from the ESCRT-III complex or some other molecules could be involved in this event, thus affecting the *E. histolytica* phagocytosis process. In order to identify putative interacting partners for EhADH112, we used a

To dissect the presence of putative ubiquitin and phosphoinositide binding domains in *E. histolytica* ESCRT-like components, we selected ESCRT-0 to -III representatives (EhVps27, EhVps23, EhVps36 and EhVps24, respectively) presumably containing these structural features according to their yeast and human homologues and performed multiple sequence alignments with the ClustalW program.


Table 2. Comparison of *E. histolytica, H. sapiens* and *S. cerevisiae* ESCRT machineries. Data of conserved ESCRT proteins from yeast and human were obtained at NCBI and UniProtKB databases. Putative *E. histolytica* ESCRT proteins were retrieved by BLAST searches at EupathDB and corresponding UniProtKB accession numbers were obtained. Putative ESCRT proteins of *E. histolytica* exhibited significant *E*-values (1.1*e*-114 to 0.00032) and high similarity (20 to 62%) to yeast and human ESCRT orthologues. nd, not determined; ----, nonsignificant similarity or identity and *E*-values; S, similarity; I, identity (Modified from López-Reyes et al., 2010).

Our computational comparative analysis showed that the ESCRT-0 complex, lacks the characteristic VHS (Vps27, Hrs and STAM) domain of yeast Vps27, required for the protein interaction with ubiquitin (Williams & Urbé, 2007). However, EhVps27 displayed a (R/K)(R/K)HHCR motif usually found within conserved FYVE domains and necessary for phosphatidylinositol 3-phosphate (PtdIns3P) binding (Misra & Hurley, 1999). This finding was also supported by the Pfam database, which reported the presence of a putative FYVE domain in the EhVps27 amino acid sequence.

To dissect the presence of putative ubiquitin and phosphoinositide binding domains in *E. histolytica* ESCRT-like components, we selected ESCRT-0 to -III representatives (EhVps27, EhVps23, EhVps36 and EhVps24, respectively) presumably containing these structural features according to their yeast and human homologues and performed multiple sequence

Table 2. Comparison of *E. histolytica, H. sapiens* and *S. cerevisiae* ESCRT machineries. Data of conserved ESCRT proteins from yeast and human were obtained at NCBI and

UniProtKB databases. Putative *E. histolytica* ESCRT proteins were retrieved by BLAST searches at EupathDB and corresponding UniProtKB accession numbers were obtained. Putative ESCRT proteins of *E. histolytica* exhibited significant *E*-values (1.1*e*-114 to 0.00032) and high similarity (20 to 62%) to yeast and human ESCRT orthologues. nd, not determined; ----, nonsignificant similarity or identity and *E*-values; S, similarity; I, identity (Modified from López-

Our computational comparative analysis showed that the ESCRT-0 complex, lacks the characteristic VHS (Vps27, Hrs and STAM) domain of yeast Vps27, required for the protein interaction with ubiquitin (Williams & Urbé, 2007). However, EhVps27 displayed a (R/K)(R/K)HHCR motif usually found within conserved FYVE domains and necessary for phosphatidylinositol 3-phosphate (PtdIns3P) binding (Misra & Hurley, 1999). This finding was also supported by the Pfam database, which reported the presence of a putative FYVE

alignments with the ClustalW program.

Reyes et al., 2010).

domain in the EhVps27 amino acid sequence.

Membrane phospholipids such as PtdIns3P, have been previously implicated in the regulation of endocytosis and phagocytosis and 12 FYVE-domain containing proteins have been identified in *E. histolytica* (Nakada-Tsukui et al., 2009).

The UEV domain present in yeast Vps23 and its human homologue Tsg101 is necessary to recognize ubiquitin signals in proteins to be sorted into MVB (Pornillos et al., 2002; Sundquist et al., 2004). Despite a less conserved similarity among analyzed sequences, our bioinformatics approach suggested the presence of a putative UEV domain at the Nterminus of EhVps23, also supported by Pfam domain predictions.

According to our current investigation, EhVps36 lacks the yeast NZF and human GLUE domains previously reported in Vps36 homologues. Both domains have been implicated in ubiquitin and PtdIns3P binding, respectively. Instead, EhVps36 conserves a N-terminal positively charged amino acid region. Similarly, the EhVps24 protein exhibits a positively charged amino acid tract present in almost its full sequence. Since specific binding to phosphoinositides requires electrostatic interactions between negatively charged phosphates on lipids and positively charged amino acids in proteins, it is feasible that EhVps36 and EhVps24 associate to phosphoinositides present at endosomal membranes (Whitley et al., 2003).

Secondary structure assignments for putative ESCRT proteins of *E. histolytica* were achieved by using the Jpred program. In agreement with our previous findings, EhVps27, EhVps36 and EhVps24, and EhVps23 proteins resulted in similar arrangements to yeast Vps27, Vps36 and Vps24 proteins, and human Tsg101, respectively. Furthermore, according to both Phyre and SWISS-MODEL tertiary structure predictions, the three-dimensional structures of EhVps27 and EhVps36 matched to yeast Vps27 (PDB code: 1vfy) and Vps36 (PDB code: 1u5t) crystalline structures, respectively. In addition, the Phyre software predicted a conformational arrangement similar to human Tsg101 (PDB code: 1s1q) and CHMP3 (PDB code: 2gd5) proteins for EhVps23 and EhVps24, respectively.

Altogether, our results indicate the presence of putative structural and conformational features for ubiquitin and lípid binding in representative proteins from the *E. histolytica* ESCRT-0, -I, -II and –III complexes.

To determine the identity of putative ESCRT-accesory proteins, we first focused on EhADH112, a protein widely studied by our group and involved in *E. histolytica* adherence to and phagocytosis of host cells (García-Rivera et al., 1999). *In silico* analysis of the primary sequence of EhADH112 together with Pfam protein domain predictions, revealed that EhADH112 is structurally related to yeast Bro1 and its human homologue Alix. EhADH112 has a conserved Bro1 domain at its N-terminus. In Bro1 and Alix proteins, the Bro1 domain constitutes the interacting site for Vps32 or CHMP4B, respectively, both components of the ESCRT-III complex. Experimental approaches demonstrated that *E. histolytica* parasites overexpressing only a part of the EhADH112 Bro1 domain, reduced dramatically their ability to ingest cells, thus providing additional evidence for EhADH112 participation in phagocytosis (our unpublished results). Furthermore, immunolocalization of EhADH112 and truncated EhADH112 proteins in parasites, using both transmission electron and laser confocal microscopy, revealed that besides its detection at the plasma membrane and cytoplasmic vacuoles, EhADH112 is also in MVB-like organelles, whereas the EhADH112 mutant version accumulates in cytoplasmic vesicles. These findings led us to assign a putative role for the EhADH112 Bro1 domain to recruit proteins to the endosomal membranes forming MVB. Possibly, Vps proteins from the ESCRT-III complex or some other molecules could be involved in this event, thus affecting the *E. histolytica* phagocytosis process. In order to identify putative interacting partners for EhADH112, we used a

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

Fig. 2. Structural comparison of Vps32 homologues.

as template.

(A) At the top, a schematic representation for human CHMP4B, yeast Vps32 and *E. histolytica* EhVps32 proteins is shown. Numbers indicate amino acids for each protein. All proteins contain conserved Snf7 domains, present in the Snf7 family proteins. Vps32 orthologues belong to the ESCRT-III complex and have been described as the interacting partners of Bro1 domain-containing proteins. At the bottom, a multiple sequence alignment for Vps32 homologues is shown. Hs, *H. sapiens*; Sc, *S. cerevisiae*; and Eh, *E. histolytica*. Black boxes, identical amino acids; grey boxes, conserved substitutions; and open boxes, different residues. Numbers at left are relative to the position of the start codon in each protein. The Jpred secondary structure prediction program revealed that EhVps32 folds into five αhelices (green horizontal cylinders) as it has been reported for Vps32 homologues. (B) Tertiary protein structure for *H. sapiens*, *S. cerevisiae* and *E. histolytica* Vps32 homologues. Modeling was done using the Phyre program with the crystal structure of human CHMP3

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 305

computational survey for yeast Vps32 or human CHMP4B homologous sequences in the *E. histolytica* genome. We found a putative EhVps32 protein whose existence in *E. histolytica* was confirmed by further experimental data (Bañuelos et al., 2007). According to multiple sequence analysis and Pfam database predictions, EhVps32 contains a Snf7 domain, present in all members of the Snf7 family. Additionally, the predicted EhVps32 secondary structure using the Jpred program, suggested that EhVps32 conserves the characteristic five -helices present in the Snf7 family protein (Fig. 2A). Using the Phyre program, the tertiary structure for EhVps32 was modeled. Results showed that the predicted structure of EhVps32 is related to human CHMP3, a Snf7 family member (Fig. 2B). Since the crystal structure for CHMP4B has not yet been solved, the program uses by default the CHMP3 crystal structure as template due to the presence of the highly conserved Snf7 domain. Thus, tertiary structures for CHMP4B and Vps32 were also modeled using CHMP3 as template (Fig. 2B). Retrieved results showed that EhVps32 adopts a conformational structure and folding more similar to CHMP4B than to yeast Vps32 and this is in agreement with the highest similarity reported for EhVps32 to the human sequence of CHMP4B by BLAST analysis (Table 2). To confirm the predicted interaction between EhADH112 and EhVps32 proteins, pull down experiments were perfomed. Assays demonstrated that EhADH112 binds through its Nterminus to a recombinant protein of EhVps32 fused to GST (our unpublished data). Since yeast Vps4 and its orthologues have been previously described as key molecules for ESCRT dissociation and recycling, López-Reyes et al., (2010) characterized the EhVps4 protein in more detail. Protein domain predictions, as well as tertiary structure modeling and phylogenetic trees assayed for EhVps4 suggest, that it conserves a typical Vps4 architecture (Babst et al., 1998) and is more related to protozoan Vps4 homologues than to that of higher eukaryotes. Biochemical experiments using an EhVps4 recombinant protein and ATP as substrate, evidenced the ATPase activity of EhVps4 *in vitro*. As expected, when using a mutant version of EhVps4, in which an E residue was substituted by a Q amino acid, the ATPase

activity was reduced. Furthermore, *E. histolytica* parasites overexpressing the EhVps4 mutant protein displayed reduced virulence properties, suggesting a role for EhVps4 in parasite pathogenicity, probably related to its participation in the endocytic pathway.

#### **5. Challenges and perspectives**

Our previous results obtained via bioinformatical tools and biochemical experiments, allow us to propose a model for the ESCRT machinery in *E. histolytica* (Fig. 3). Since we found the EhVps27 component of the ESCRT-0 complex, we suggest that it may initiate the MVB sorting process. Additionally, EhVps27 has a FYVE domain that possibly mediates protein binding to the endosomal membrane. However, EhVps27 lacks the UIM domain, important for the initial selection of ubiquitinated cargo, probably by EhRsp5. Perhaps, EhVps23, through its UEV motif, or another unidentified protein could be recruiting cargo proteins to endosomes. Furthermore, the EhVps23 UEV domain could associate toEhVps27 and other components of the ESCRT-I complex, which includes the EhVps28 and EhVps37 proteins. Then, ESCRT-I binds to ESCRT-II (formed by EhVps22, EhVps25 and EhVps36 proteins). Although EhVps36 does not exhibit an ubiquitin-interacting domain as yeast homologues, this protein contains a recognition region for phosphoinositides that presumably would allow ESCRT-II attachment to the endosomal membrane. Next, ESCRT-II binds to the ESCRT-III complex, which contains the overall components previously described for yeast. Interestingly, similarly to yeast Vps20, EhVps20 has a myristoylated modification that facilitates ESCRT-III insertion into the endosomal membrane. Then, ESCRT-III interaction

computational survey for yeast Vps32 or human CHMP4B homologous sequences in the *E. histolytica* genome. We found a putative EhVps32 protein whose existence in *E. histolytica* was confirmed by further experimental data (Bañuelos et al., 2007). According to multiple sequence analysis and Pfam database predictions, EhVps32 contains a Snf7 domain, present in all members of the Snf7 family. Additionally, the predicted EhVps32 secondary structure using the Jpred program, suggested that EhVps32 conserves the characteristic five -helices present in the Snf7 family protein (Fig. 2A). Using the Phyre program, the tertiary structure for EhVps32 was modeled. Results showed that the predicted structure of EhVps32 is related to human CHMP3, a Snf7 family member (Fig. 2B). Since the crystal structure for CHMP4B has not yet been solved, the program uses by default the CHMP3 crystal structure as template due to the presence of the highly conserved Snf7 domain. Thus, tertiary structures for CHMP4B and Vps32 were also modeled using CHMP3 as template (Fig. 2B). Retrieved results showed that EhVps32 adopts a conformational structure and folding more similar to CHMP4B than to yeast Vps32 and this is in agreement with the highest similarity reported for EhVps32 to the human sequence of CHMP4B by BLAST analysis (Table 2). To confirm the predicted interaction between EhADH112 and EhVps32 proteins, pull down experiments were perfomed. Assays demonstrated that EhADH112 binds through its N-

terminus to a recombinant protein of EhVps32 fused to GST (our unpublished data).

pathogenicity, probably related to its participation in the endocytic pathway.

**5. Challenges and perspectives** 

Since yeast Vps4 and its orthologues have been previously described as key molecules for ESCRT dissociation and recycling, López-Reyes et al., (2010) characterized the EhVps4 protein in more detail. Protein domain predictions, as well as tertiary structure modeling and phylogenetic trees assayed for EhVps4 suggest, that it conserves a typical Vps4 architecture (Babst et al., 1998) and is more related to protozoan Vps4 homologues than to that of higher eukaryotes. Biochemical experiments using an EhVps4 recombinant protein and ATP as substrate, evidenced the ATPase activity of EhVps4 *in vitro*. As expected, when using a mutant version of EhVps4, in which an E residue was substituted by a Q amino acid, the ATPase activity was reduced. Furthermore, *E. histolytica* parasites overexpressing the EhVps4 mutant protein displayed reduced virulence properties, suggesting a role for EhVps4 in parasite

Our previous results obtained via bioinformatical tools and biochemical experiments, allow us to propose a model for the ESCRT machinery in *E. histolytica* (Fig. 3). Since we found the EhVps27 component of the ESCRT-0 complex, we suggest that it may initiate the MVB sorting process. Additionally, EhVps27 has a FYVE domain that possibly mediates protein binding to the endosomal membrane. However, EhVps27 lacks the UIM domain, important for the initial selection of ubiquitinated cargo, probably by EhRsp5. Perhaps, EhVps23, through its UEV motif, or another unidentified protein could be recruiting cargo proteins to endosomes. Furthermore, the EhVps23 UEV domain could associate toEhVps27 and other components of the ESCRT-I complex, which includes the EhVps28 and EhVps37 proteins. Then, ESCRT-I binds to ESCRT-II (formed by EhVps22, EhVps25 and EhVps36 proteins). Although EhVps36 does not exhibit an ubiquitin-interacting domain as yeast homologues, this protein contains a recognition region for phosphoinositides that presumably would allow ESCRT-II attachment to the endosomal membrane. Next, ESCRT-II binds to the ESCRT-III complex, which contains the overall components previously described for yeast. Interestingly, similarly to yeast Vps20, EhVps20 has a myristoylated modification that facilitates ESCRT-III insertion into the endosomal membrane. Then, ESCRT-III interaction

Fig. 2. Structural comparison of Vps32 homologues.

(A) At the top, a schematic representation for human CHMP4B, yeast Vps32 and *E. histolytica* EhVps32 proteins is shown. Numbers indicate amino acids for each protein. All proteins contain conserved Snf7 domains, present in the Snf7 family proteins. Vps32 orthologues belong to the ESCRT-III complex and have been described as the interacting partners of Bro1 domain-containing proteins. At the bottom, a multiple sequence alignment for Vps32 homologues is shown. Hs, *H. sapiens*; Sc, *S. cerevisiae*; and Eh, *E. histolytica*. Black boxes, identical amino acids; grey boxes, conserved substitutions; and open boxes, different residues. Numbers at left are relative to the position of the start codon in each protein. The Jpred secondary structure prediction program revealed that EhVps32 folds into five αhelices (green horizontal cylinders) as it has been reported for Vps32 homologues. (B) Tertiary protein structure for *H. sapiens*, *S. cerevisiae* and *E. histolytica* Vps32 homologues. Modeling was done using the Phyre program with the crystal structure of human CHMP3 as template.

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

may have a role in regulating EhVps4 function.

corroborated by experimental approaches.

particular attention on the *E. histolytica* case.

parasite pathogenicity and virulence.

basis for further experimental validation.

**6. Conclusions** 

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 307

membrane to initiate new rounds of cargo sorting and vesicle formation. Possibly, EhVta1

Of note, *E. histolytica* possesses a conserved ESCRT machinery. However, the study related to ESCRT functions and putative interactions along the MVB pathway needs to be

Bioinformatics, the application of statistics and computer sciences to molecular biology, entails the creation and advancement of databases, algorithms, computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of biological data. In this chapter, we used bioinformatics to analyze the ESCRT protein machinery possibly participating in parasitic protozoa endosomal pathways, with

The ESCRT machinery comprises a set of protein complexes that regulate recognition, sorting and trafficking of monoubiquitinated proteins into MVB compartments towards lysosome degradation. Previous work has shed light on molecular details underlying the assembly and regulation of ESCRT in yeast and human. Here, we took advantage from eukaryotic pathogen genome database availability and bioinformatics tools to identify proteins representing putative ESCRT components in protozoan parasites of medical importance. We found representative proteins for ESCRT-0, -I, -II, -III and -accesory proteins in almost all protozoa examined, being *E. histolytica* and *L. major* the parasites in which ESCRT components were the most represented. Despite these findings, several issues need to be experimentally addressed to finely determine the structure and function of ESCRT

In *E. histolytica,* we found a highly conserved ESCRT machinery with 19 putative components representing all complexes. These findings have been experimentally confirmed by determining the expression of most ESCRT gene transcripts (López-Reyes, et al., 2010). Furthermore, our current *in silico* results suggest that some *E. histolytica* ESCRT-0 to -III components contain putative FYVE or ubiquitin binding domains, both important to recruit cargo molecules to endosomal membranes. In addition, our computational analysis together to previous functional characterization of putative *E. histolytica* ESCRT-accessory proteins, strongly suggest the presence of a Bro1-domain containing protein (EhADH112), its putative interacting partnership, EhVps32, and an ATPase (EhVps4) that may be responsible for energy-dependent ESCRT disassembly. Of note, tertiary structure modeling of EhVps32 supported our experimental findings on EhADH112 binding to EhVps32, proving the value of bioinformatical approaches. Therefore, our overall results provide significant evidence for a conserved role of the *E. histolytica* ESCRT machinery in the MVB endocytic pathway. In summary, bioinformatics and experimental approaches can improve our understanding on evolutionary implications of the MVB sorting pathway in *E. histolytica, L. major, T. cruzi, P. falciparum, T. gondii* and *G. lamblia* and also for elucidating its possible relationship to

Although some limitations exist due to incompleteness of experimental data, we conclude that computational methods have a reasonable prediction accuracy and provide invaluable

proteins and their putative role during endocytosis in these parasites.

Fig. 3. Model for the role of the ESCRT machinery in *E. histolytica* within the endosomal MVB pathway.

In *E. histolytica,* the EhRsp5 protein could be responsible for cargo protein ubiquitination. Then, the EhVps27 protein could initiate the MVB process. Similar to yeast Vps27, EhVps27 has a FYVE domain that binds PtdIns3P allowing endosomal membrane attachment. However, EhVps27 lacks the UIM domain, important for ubiquitin recognition in cargo proteins. Instead, EhVps23 could be mediating this event through its UEV motif. Subsequently, EhVps27 binds to the ESCRT-I complex through EhVps23. Then, EhVps36 by its positively charged region binds to PtdIns3P, facilitating the ESCRT-II attachment to endosomal membranes. *E. histolytica* contains all ESCRT-III components which belong to the Snf7 family of proteins. In addition, it has several accessory proteins, including the EhADH112 (a Bro1 domain-containing protein), EhDoa4 (deaubiquitinating enzyme that removes ubiquitin from cargo) and EhVps4 (an ATPase) proteins. Finally, as in yeast, EhVps4 may play a critical role in catalyzing the dissociation of ESCRT from the endosomal membrane in order to start new rounds of cargo protein sorting through MVB.

with accessory proteins could be mediated by EhVps32. In fact, EhVps32 could associate to EhADH112 through its putative N-terminal Bro1 domain (our unpublished data). Besides, EhADH112 could also be recruiting another accessory molecule, the EhDoa4 ubiquitin hydrolase, removing ubiquitin from cargo prior MVB internalization. Finally, the EhVps4 ATPase might catalyze the disassembly of the ESCRT complex from the endosomal membrane to initiate new rounds of cargo sorting and vesicle formation. Possibly, EhVta1 may have a role in regulating EhVps4 function.

Of note, *E. histolytica* possesses a conserved ESCRT machinery. However, the study related to ESCRT functions and putative interactions along the MVB pathway needs to be corroborated by experimental approaches.

#### **6. Conclusions**

306 Bioinformatics – Trends and Methodologies

Fig. 3. Model for the role of the ESCRT machinery in *E. histolytica* within the endosomal

In *E. histolytica,* the EhRsp5 protein could be responsible for cargo protein ubiquitination. Then, the EhVps27 protein could initiate the MVB process. Similar to yeast Vps27, EhVps27 has a FYVE domain that binds PtdIns3P allowing endosomal membrane attachment. However, EhVps27 lacks the UIM domain, important for ubiquitin recognition in cargo proteins. Instead, EhVps23 could be mediating this event through its UEV motif.

Subsequently, EhVps27 binds to the ESCRT-I complex through EhVps23. Then, EhVps36 by its positively charged region binds to PtdIns3P, facilitating the ESCRT-II attachment to endosomal membranes. *E. histolytica* contains all ESCRT-III components which belong to the

with accessory proteins could be mediated by EhVps32. In fact, EhVps32 could associate to EhADH112 through its putative N-terminal Bro1 domain (our unpublished data). Besides, EhADH112 could also be recruiting another accessory molecule, the EhDoa4 ubiquitin hydrolase, removing ubiquitin from cargo prior MVB internalization. Finally, the EhVps4 ATPase might catalyze the disassembly of the ESCRT complex from the endosomal

Snf7 family of proteins. In addition, it has several accessory proteins, including the EhADH112 (a Bro1 domain-containing protein), EhDoa4 (deaubiquitinating enzyme that removes ubiquitin from cargo) and EhVps4 (an ATPase) proteins. Finally, as in yeast, EhVps4 may play a critical role in catalyzing the dissociation of ESCRT from the endosomal

membrane in order to start new rounds of cargo protein sorting through MVB.

MVB pathway.

Bioinformatics, the application of statistics and computer sciences to molecular biology, entails the creation and advancement of databases, algorithms, computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of biological data. In this chapter, we used bioinformatics to analyze the ESCRT protein machinery possibly participating in parasitic protozoa endosomal pathways, with particular attention on the *E. histolytica* case.

The ESCRT machinery comprises a set of protein complexes that regulate recognition, sorting and trafficking of monoubiquitinated proteins into MVB compartments towards lysosome degradation. Previous work has shed light on molecular details underlying the assembly and regulation of ESCRT in yeast and human. Here, we took advantage from eukaryotic pathogen genome database availability and bioinformatics tools to identify proteins representing putative ESCRT components in protozoan parasites of medical importance. We found representative proteins for ESCRT-0, -I, -II, -III and -accesory proteins in almost all protozoa examined, being *E. histolytica* and *L. major* the parasites in which ESCRT components were the most represented. Despite these findings, several issues need to be experimentally addressed to finely determine the structure and function of ESCRT proteins and their putative role during endocytosis in these parasites.

In *E. histolytica,* we found a highly conserved ESCRT machinery with 19 putative components representing all complexes. These findings have been experimentally confirmed by determining the expression of most ESCRT gene transcripts (López-Reyes, et al., 2010). Furthermore, our current *in silico* results suggest that some *E. histolytica* ESCRT-0 to -III components contain putative FYVE or ubiquitin binding domains, both important to recruit cargo molecules to endosomal membranes. In addition, our computational analysis together to previous functional characterization of putative *E. histolytica* ESCRT-accessory proteins, strongly suggest the presence of a Bro1-domain containing protein (EhADH112), its putative interacting partnership, EhVps32, and an ATPase (EhVps4) that may be responsible for energy-dependent ESCRT disassembly. Of note, tertiary structure modeling of EhVps32 supported our experimental findings on EhADH112 binding to EhVps32, proving the value of bioinformatical approaches. Therefore, our overall results provide significant evidence for a conserved role of the *E. histolytica* ESCRT machinery in the MVB endocytic pathway.

In summary, bioinformatics and experimental approaches can improve our understanding on evolutionary implications of the MVB sorting pathway in *E. histolytica, L. major, T. cruzi, P. falciparum, T. gondii* and *G. lamblia* and also for elucidating its possible relationship to parasite pathogenicity and virulence.

Although some limitations exist due to incompleteness of experimental data, we conclude that computational methods have a reasonable prediction accuracy and provide invaluable basis for further experimental validation.

A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

Publishing, Inc. ISBN: 978-0-470-08985-9 Indianapolis, IN

3820, ISSN 1059-1524

ISSN 0887-3585

ISSN 0027-8424

ISSN 1059-1524

414, ISSN 0021-9533

Greenwich, Conn

pp. 556-68, ISSN 0950-382X

*Traffic,* Vol, 2, pp. 175–188, ISSN 1398-9219

9533

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 309

Claverie, J.M. & Notredame, C. (2006). *Bioinformatics for Dummies* (2nd ed). Wiley

Cole, C., Barber, J.D. & Barton, G.J. (2008). The Jpred 3 secondary structure prediction

Coppens, I. & Joiner, K.A. (2003). Host but not parasite cholesterol controls *Toxoplasma* entry

Cuff, J.A. & Barton, G.J. (1999). Evaluation and improvement of multiple sequence methods

Curtiss, M., Jones, C. & Babst, M. (2007). Efficient cargo sorting by ESCRT-I and the

Dacks, J.B., Poon, P.P. & Field, M.C. (2008). Phylogeny of endocytic components yields

de Souza, W., Sant'Anna, C. & Cunha-e-Silva, N.L. (2009). *Progress in Histochemistry and* 

Dimaano, C., Jones, C.B., Hanono, A., Curtiss, M. & Babst, M. (2008). Ist1 regulates Vps4

Field, M.C., Gabernet-Castello, C. & Dacks, J.B. (2007). Reconstructing the evolution of the

García-Rivera, G., Rodríguez, M.A., Ocádiz, R., Martínez-López, M.C., Arroyo, R., González-

Geoff, H. (1997). The Molecular Epidemiology of Parasites, In: *Principles of Medical Biology,* 

Ghedin, E., Debrabant, A., Engel, J.C. & Dwyer, D.M. (2001). Secretory and endocytic

Gietz, R.D., Triggs-Raine, B., Robbins, A., Graham, K. & Woods, R. (1997). Identification of

*experimental medicine and biology*, Vol. 607, pp. 84–96, ISSN 0065-2598 Fujita, H., Yamanaka, M., Imamura, K., Tanaka, Y., Nara, A., Yoshimori, T., Yokota, S. &

*Cytochemistry,* Vol. 44, No. 2, pp. 67-124, ISSN 0079-6336

server. *Nucleic Acids Research*, Vol. 35, No. suppl. 2, pp. W197-W201, ISSN 0305-1048

by modulating organelle discharge. *Molecular Biology of the Cell,* Vol. 14, pp. 3804-

for protein secondary structure prediction. *Proteins,* Vol. 34, No. 4, pp. 508-519,

subsequent release of ESCRT-I from multivesicular bodies requires the subunit Mvb12. *Molecular Biology of the Cell,* Vol. 18, No. 2, pp. 636-645, ISSN 1059-1524 Dacks, J.B. & Field MC. (2007). Evolution of the eukaryotic membrane-trafficking system:

origin, tempo and mode. *Journal of Cell Science*, Vol. 120, pp. 2977–2985, ISSN 0021-

insight into the process of non-endosymbiotic organelle evolution. *Proceedings of the National Academy of Sciences of the United States of America*, Vol. 105, pp. 588–593,

localization and assembly. *Molecular Biology of the Cell,* Vol. 19, No. 2, pp. 465-474,

endocytic system: insights from genomics and molecular cell biology. *Advances in* 

Himeno, M. (2003). A dominant negative form of the AAA ATPase SKD1/VPS4 impairs membrane trafficking out of endosomal/lysosomal compartments: class E vps phenotype in mammalian cells. *Journal of Cell Science,* Vol. 116, Pt 2, pp. 401-

Robles, A. & Orozco, E. (1999). *Entamoeba histolytica*: a novel cysteine protease and an adhesin form the 112 kDa surface protein. *Molecular Microbiology,* Vol. 33, No. 3,

*Microbiology*, Edward Bittar (ed), pp. 597-614, JAI Press Inc., ISBN: 1-55938-814-5.

pathways converge in a dynamic endosomal system in a primitive protozoan.

proteins that interact with a protein of interest: Applications of the yeast two-

#### **7. Acknowledgements**

Authors would like to thank Dra. Rossana Arroyo, Dr. Jaime Ortega and Dr. Michael Schnoor for providing their comments on the manuscript and Alfredo Padilla-Barberi for efforts in the artwork.

#### **8. References**


Authors would like to thank Dra. Rossana Arroyo, Dr. Jaime Ortega and Dr. Michael Schnoor for providing their comments on the manuscript and Alfredo Padilla-Barberi for

Allen, C.L., Liao, D., Chung, W.L. & Field, M.C. (2007). Dileucine signal-dependent and AP-

*Molecular and Biochemical Parasitology*, Vol. 156, pp. 175–190, ISSN 0166-6851 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J. (1990). Basic local alignment

Babst, M., Katzmann, D.J., Estepa-Sabal, E.J., Meerloo, T., & Emr, S.D. (2002a). ESCRT-III: an

Babst, M., Katzmann, D.J., Snyder, W.B., Wendland, B., & Emr, S.D. (2002b). Endosome-

Bañuelos, C., García-Rivera, G., López-Reyes, I., & Orozco, E. (2005). Functional

Besteiro, S., Williams, R.A., Morrison, L.S., Coombs, G.H., & Mottram, J.C. (2006). Endosome

Bonangelino, C.J., Chavez, E.M., & Bonifacino, J.S. (2002). Genomic screen for vacuolar

Bowers, K., Lottridge, J., Helliwell, S.B., Goldthwaite, L.M., Luzio, J.P. & Stevens, T.H.

Chiang, Y.S., Gelfand, T.I., Kister, A.E. & Gelfand, I.M. (2007). New classification of

strand assemblage. *Proteins*, Vol. 68, No. 4, pp. 915–921, ISSN 0887-3585 Chung, W.L., Carrington, M. & Field, M.C. (2004). Cytoplasmic targeting signals in

*Biological Chemistry,* Vol. 279, pp. 54887–54895, ISSN 1067-8816

function. *The EMBO Journal,* Vol. 17, pp. 2982–2993, ISSN 0261-4189

*Developmental Cell,* Vol. 3, pp. 271–282, ISSN 1534-5807

Amsterdam, The Netherlands

pp. 2486–2501. ISSN 1059-1524

*cerevisiae*. *Traffic,* Vol. 5, pp.194–210, ISSN 1398-9219

0021-9258

1-independent targeting of a lysosomal glycoprotein in *Trypanosoma brucei*.

search tool. *Journal of Molecular Biology,* Vol. 215, No. 3, pp. 403-410, ISSN 0022-2836

endosome-associated heterooligomeric protein complex required for MVB sorting.

associated complex, ESCRT-II, recruits transport machinery for protein sorting at the multivesicular body. *Developmental Cell,* Vol. 3, pp. 283–289, ISSN 1534-5807 Babst, M., Wendland, B., Estepa, E.J., & Emr, S.D. (1998). The Vps4p AAA ATPase regulates

membrane association of a Vps protein complex required for normal endosome

characterization of EhADH112: an Entamoeba histolytica Bro1 domain-containing protein. *Experimental Parasitology,* Vol. 110, No. 3, pp. 292-297, ISSN 0014-4894 Bañuelos, C., López-Reyes, I., García-Rivera, G., González-Robles, A. and Orozco, E. (2007).

The presence of a Snf7-like protein strenghtens a role for EhADH in the *Entamoeba histolytica* multivesicular bodies pathway. *Proceedings of the 5th European Congress on Tropical Medicine and International Health*, Boeree, M.J. (ed), Vol. 978, pp. 31–35,

sorting and autophagy are essential for differentiation and virulence of Leishmania major. *The Journal of Biological Chemistry,* Vol. 281, No. 16, pp. 11384-11396, ISSN

protein sorting genes in *Saccharomyces cerevisiae*. *Molecular Biology of the Cell,* Vol. 13,

(2004). Protein–Protein Interactions of ESCRT Complexes in the Yeast *Saccharomyces* 

supersecondary structures of sandwich-like proteins uncovers strict patterns of

transmembrane invariant surface glycoproteins of trypanosomes. *The Journal of* 

**7. Acknowledgements** 

efforts in the artwork.

**8. References** 


A Bioinformatical Approach to Study the Endosomal Sorting Complex Required for

*Biology,* Vol. 123, pp. 225–235, ISSN 1047-8477

19, No. 5, pp. 3588-3599, ISSN 1471-0072

5, pp. 657-666, ISSN 0092-8674

ISSN 1398-9219

ISSN 1462-5814

ISSN 1047-8477

ISSN 0019-5103

ISSN 0343-8651

Transport (ESCRT) Machinery in Protozoan Parasites: The *Entamoeba Histolytica* Case 311

Leung, K.F., Dacks, J.B. & Field, M.C. (2008). Evolution of the Multivesicular Body ESCRT

Li, Y., Kane, T., Tipper, C., Spatrick, P. & Jenness, D.D. (1999). Yeast mutants affecting

López-Reyes, I., García-Rivera, G., Bañuelos, C., Herranz, S., Vincent, O., López-Camarillo,

Misra, S. & Hurley, J.H. (1999). Crystal structure of a phosphatidylinositol 3-phosphate-

Nakada-Tsukui, K., Okada, H., Mitra, B.N. & Nozaki, T. (2009) Phosphatidylinositol-

Obita, T., Saksena, S., Ghazi-Tabatabai, S., Gill, D.J., Perisic, O., Emr, S.D. & Williams, R.L.

Odorizzi, G., Katzmann, D.J., Babst, M., Audhya, A. & Emr, S.D. (2003). Bro1 is an

Pandeya, S.N. & Thakkar, D. (2005). Combinatorial chemistry: A novel method in drug

Peng, Z.Y. & Kim, P.S. (1994). A protein dissection study of a molten globule. *Biochemistry,* 

Pornillos, O., Alam, S.L., Rich, R.L., Myszka, D.G., Davis, D.R. & Sundquist, W.I. (2002).

Shiflett, S.L., Ward, D.M., Huynh, D., Vaughn, M.B., Simmons, J.C. & Kaplan, J. (2004)

Söding, J. (2005). Protein homology detection by HMM-HMM comparison. *Bioinformatics*.

*cerevisiae*. *Journal of Cell Science*, Vol. 116, pp. 1893–1903, ISSN 0021-9533 Offer, G., Hicks, M.R. & Woolfson, D.N. (2002). Generalized Crick equations for modeling

Vps4. *Nature*, Vol. 449, pp. 735–739, ISSN 0028-0836

Vol. 33, No. 8, pp. 2136-2141, ISSN 0006-2960

Vol. 21, No. 7, pp. 951-60, ISSN 1367-4803

*Journal,* Vol. 21, No. 10, pp. 2397-2406, ISSN 0261-4189

may correspond to early and late endosomes and to lysosomes. *Journal of Structural* 

Machinery; Retention Across the Eukaryotic Lineage. *Traffic*, Vol. 9, pp. 1698–1716,

possible quality control of plasma membrane proteins. *Molecular Cell Biology,* Vol.

C., Marchat, L.A, & Orozco, E. (2010). Detection of the endosomal sorting complex required for transport in *Entamoeba histolytica* and characterization of the EhVps4 protein. *Journal of Biomedicine & Biotechnology,* Vol. 2010, pp. 890674, ISSN 1110-7243

specific membrane-targeting motif, the FYVE domain of Vps27p. *Cell*, Vol. 97, No.

phosphates mediate cytoskeletal reorganization during phagocytosis via a unique modular protein consisting of RhoGEF/DH and FYVE domains in the parasitic protozoan *Entamoeba histolytica*. *Cellular Microbiology*, Vol. 11, No. 10, pp. 1471-1491,

(2007). Structural basis for selective recognition of ESCRT-III by the AAA ATPase

endosome-associated protein that functions in the MVB pathway in *Saccharomyces* 

noncanonical coiled coils. *Journal of Structural Biology*, Vol. 137, No. 1-2, pp. 41-53,

discovery and its application. *Indian Journal of Chemistry,* Vol. 44B, pp. 335-348,

Structure and functional interactions of the Tsg101 UEV domain*. The EMBO* 

Characterization of Vta1p, a class E Vps protein in *Saccharomyces cerevisiae*. *The Journal of Biological Chemistry,* Vol. 279 No. 12, pp. 10982-10990, ISSN 0021-9258 Sinha, A., Mandal, S., Banerjee, S., Ghosh, A., Ganguly, S., Sil, A.K. & Sarkar, S. (2010).

Identification and Characterization of a FYVE Domain from the Early Diverging Eukaryote *Giardia lamblia*. *Current Microbiology*, Vol. Dec 17, [Epub ahead of print],

hybrid system. *Molecular and Cellular Biochemistry*, Vol. 172, No. 1-2, pp. 67–79, ISSN 0300-8177


http://swissmodel.expasy.org/


Henikoff, S. & Henikoff, J.G. (2000). Amino acid substitution matrices. *Advances in Protein* 

Hoppe, H.C., Ngo, H.M., Yang, M. & Joiner, K.A. (2000). Targeting to rhoptry organelles of

Hurley, J.H. & Emr, S.D. (2006). The Escrt Complexes: Structure and Mechanism of a

Hurley, J.H. & Hanson, P.I. (2010). Membrane budding and scission by the ESCRT

Kaltenbach, L.S., Romero, E., Becklin, R.R., Chettier, R., Bell, R., Phansalkar, A., Strand, A.,

neurodegeneration. *PLoS Genetics,* Vol. 3, No. 5, pp. e82, ISSN 1553-7390 Kanazawa, C., Morita, E., Yamada, M., Ishii, N., Miura, S., Asao, H., Yoshimori, T., &

Katzmann, D.J., Babst, M., & Emr, S.D. (2001). Ubiquitin-dependent sorting into the

Kuchaiev, O. & Przulj, N. (2011). Integrative Network Alignment Reveals Large Regions of

Lanfredi-Rangel, A., Attias, M., de Carvalho, T.M., Kattenbach, W.M. & de Souza, W. (1998).

*Communication*, Vol. 309, No. 4, pp. 848-856, ISSN 0006-291X

Membrane-Trafficking Network*. Annual Review of Biophysics Biomolecular Structure*,

machinery: it's all in the neck. *Nature Reviews. Molecular Cell Biology,* Vol. 11, No. 8,

Torcassi, C., Savage, J., Hurlburt, A., Cha, G.H., Ukani, L., Chepanoske, C.L., Zhen, Y., Sahasrabudhe, S., Olson, J., Kurschner, C., Ellerby, L.M., Peltier, J.M., Botas, J. & Hughes, R.E. (2007) Huntingtin interacting proteins are genetic modifiers of

Sugamura, K. (2003). Effects of deficiencies of STAMs and Hrs, mammalian class E Vps proteins, on receptor downregulation. *Biochemical Biophysical Research* 

multivesicular body pathway requires the function of a conserved endosomal protein sorting complex, ESCRT-I. *Cell,* Vol. 106, pp. 145–155, ISSN 0092-8674 Kelley, L.A. & Sternberg, M.J. (2009). Protein structure prediction on the Web: a case study using the Phyre server. *Nature protocols,* Vol. 4, No. 3, pp. 363-371, ISSN 1754-2189 Kim, J., Sitaraman, S., Hierro, A., Beach, B.M., Odorizzi, G. & Hurley, J.H. (2005). Structural

basis for endosomal targeting by the Bro1 domain. *Developmental Cell,* Vol. 8, No. 6,

Global Network Similarity in Yeast and Human. *Bioinformatics.* Vol. Mar 16, [Epub

The peripheral vesicles of trophozoites of the primitive protozoan *Giardia lamblia*

*Chemistry,* Vol, 54, pp. 73-97, ISSN 0065-3233

*Biology,* Vol. 2, pp. 449-456, ISSN 1465-7392

http://www.ch.embnet.org/software/BOX\_form.html

Vol. 35, pp. 277–298, ISSN 1056-8700

http://www.sanger.ac.uk/resources/databases/pfam.html

http://www.pdb.org/pdb/home/home.do

pp. 556-566, ISSN 1471-0072

pp. 937-947, ISSN 1534-5807

ahead of print], ISSN 1367-4803

http://bips.u-strasbg.fr/fr/Tutorials/Comparison/Blast/blastall.html

0300-8177

http://align.genome.jp/

http://eupathdb.org/eupathdb/ http://expasy.org/tools/#proteome http://jmol.sourceforge.net/download/

http://swissmodel.expasy.org/

http://www.uniprot.org/

hybrid system. *Molecular and Cellular Biochemistry*, Vol. 172, No. 1-2, pp. 67–79, ISSN

*Toxoplasma gondii* involves evolutionarily conserved mechanisms. *Nature Cell* 

may correspond to early and late endosomes and to lysosomes. *Journal of Structural Biology,* Vol. 123, pp. 225–235, ISSN 1047-8477


**15** 

Sheau Ling Ho

*Taiwan* 

*Chinese Culture University* 

**Structural Bioinformatics Analysis of** 

**Acid Alpha-Glucosidase Mutants with** 

Most lysosomal storage disorders (LSDs) are usually inherited, caused by the deficiency of a single lysosomal hydrolase, leading to the accumulation of the corresponding substrate. LSDs can also result from mutations in proteins involved in the intracellular trafficking of lysosomal enzymes (Carrell & Lomas, 1997, Kopito & Ron, 2000, Selkoe, 2003, and Arakawa et al., 2006). Indeed, LSDs are considered as a group of more than sixty diverse inherited disorders. Each of the diseases is due to a specific enzymatic defect (Hodges & Cheng, 2006, Raben et al., 2009). Pompe disease is one of these LSDs through point mutations (single wild type amino acid substitutions) in the gene that encodes for acid α-glucosidase (GAA). The resulting total or partial deficiency of lysosomal acid α-glucosidase triggers glycogen to accumulate in lysosomes (Alberts et al., 2002, Raben et al., 2002, Bernier et al., 2004, Kroos et

Recently, various small molecule pharmacological chaperones have been discovered to increase stability of such mutant proteins and facilitate their efficient trafficking of lysosomal enzymes. As such, it pointed the way to a new therapeutic approach in LSDs treatment. In this study, we are concerned with revealing the mechanism and accurate structures underlying the defects in the folding behaviors of the involved enzymatic protein mutants, also the way in which they interact with small molecule pharmacological

The pharmacological chaperone 1-deoxynojirimycin (DNJ) showed improvement in the treatment of Pompe disease. Yet, experimental data had shown that only a number of GAA mutants responded well to this pharmacological chaperone (Hirschhorn & Reuser 2001, Petsko & Ringe, 2004, and Chaudhuri & Paul 2006, Sugawara et al., 2009, Flanagan et al., 2009). In an effort to improve the stability of mutant enzymes, the understanding on the molecular interaction between the enzyme and the chaperones is very important. Since neighboring residues share physical characteristics, we undertook a detailed study of the surroundings of GAA variants in the structures (Zvelebil, et al., 1987). Thus, we herein aim at discriminating between structural, as opposed to, GAA mutants, based on analysis of

Despite the absence of crystallographic data of human acid alpha-glucosidase, we reviewed recently published papers to construct a structural model of human maltase-glucoamylase

**1. Introduction** 

al., 2008).

chaperones.

their local environments.

**Pharmacological Chaperones** 


### **Structural Bioinformatics Analysis of Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones**

Sheau Ling Ho *Chinese Culture University Taiwan* 

#### **1. Introduction**

312 Bioinformatics – Trends and Methodologies

Stahelin, R.V., Long, F., Diraviyam, K., Bruzik, K.S., Murray, D. & Cho, W. (2002).

Sundquist, W.I., Schubert, H.L., Kelly, B.N., Hill, G.C., Holton, J.M. & Hill, C.P. (2004).

Tse, Y.C., Mo, B., Hillmer, S., Zhao, M., Lo, S.W., Robinson, D.G. & Jiang, L. (2004).

Williams, R.L. & Urbé, S. (2007). The emerging shape of the ESCRT machinery. *Nature* 

Wöstmann, C., Liakopoulos, D., Ciechanover, A. & Bakker-Grunwald, T. (1996)

Yang, M., Coppens, I., Wormsley, S., Baevova, P., Hoppe, H.C., & Joiner, K.A. (2004). The

*tabacum* BY-2 cells. *The Plant Cell*, Vol. 16, pp. 672–693, ISSN 1040-4651 Whitley, P., Reaves, B.J., Hashimoto, M., Riley, A.M., Potter, B.V. & Holman, G.D. (2003).

*Chemistry,* Vol. 278, No. 40, pp. 38786-38795, ISSN 0021-9258

*Parasitology*, Vol. 82, No.1, pp. 81-90, ISSN 0166-6851

*Reviews. Molecular Cell Biology*, Vol. 8, pp. 355-368, ISSN 1471-0072

*Journal of Cell Science*, Vol. 117, Pt 17, pp. 3831-3838, ISSN 0021-9533

ISSN 0021-9258

pp. 783-789, ISSN 1097-2765

Phosphatidylinositol 3-phosphate induces the membrane penetration of the FYVE domains of Vps27p and Hrs. *The Journal of Biological Chemistry,* Vol. 277, pp. 26379,

Ubiquitin recognition by the human TSG101 protein. *Molecular Cell,* Vol. 13, No. 6,

Identification of multivesicular bodies as prevacuolar compartments in *Nicotiana* 

Identification of mammalian Vps24p as an effector of phosphatidylinositol 3,5 bisphosphate-dependent endosome compartmentalization. *The Journal of Biological* 

Characterization of ubiquitin genes and -transcripts and demonstration of a ubiquitin-conjugating system in *Entamoeba histolytica*. *Molecular and Biochemical* 

*Plasmodium falciparum* Vps4 homolog mediates multivesicular body formation.

Most lysosomal storage disorders (LSDs) are usually inherited, caused by the deficiency of a single lysosomal hydrolase, leading to the accumulation of the corresponding substrate. LSDs can also result from mutations in proteins involved in the intracellular trafficking of lysosomal enzymes (Carrell & Lomas, 1997, Kopito & Ron, 2000, Selkoe, 2003, and Arakawa et al., 2006). Indeed, LSDs are considered as a group of more than sixty diverse inherited disorders. Each of the diseases is due to a specific enzymatic defect (Hodges & Cheng, 2006, Raben et al., 2009). Pompe disease is one of these LSDs through point mutations (single wild type amino acid substitutions) in the gene that encodes for acid α-glucosidase (GAA). The resulting total or partial deficiency of lysosomal acid α-glucosidase triggers glycogen to accumulate in lysosomes (Alberts et al., 2002, Raben et al., 2002, Bernier et al., 2004, Kroos et al., 2008).

Recently, various small molecule pharmacological chaperones have been discovered to increase stability of such mutant proteins and facilitate their efficient trafficking of lysosomal enzymes. As such, it pointed the way to a new therapeutic approach in LSDs treatment. In this study, we are concerned with revealing the mechanism and accurate structures underlying the defects in the folding behaviors of the involved enzymatic protein mutants, also the way in which they interact with small molecule pharmacological chaperones.

The pharmacological chaperone 1-deoxynojirimycin (DNJ) showed improvement in the treatment of Pompe disease. Yet, experimental data had shown that only a number of GAA mutants responded well to this pharmacological chaperone (Hirschhorn & Reuser 2001, Petsko & Ringe, 2004, and Chaudhuri & Paul 2006, Sugawara et al., 2009, Flanagan et al., 2009). In an effort to improve the stability of mutant enzymes, the understanding on the molecular interaction between the enzyme and the chaperones is very important. Since neighboring residues share physical characteristics, we undertook a detailed study of the surroundings of GAA variants in the structures (Zvelebil, et al., 1987). Thus, we herein aim at discriminating between structural, as opposed to, GAA mutants, based on analysis of their local environments.

Despite the absence of crystallographic data of human acid alpha-glucosidase, we reviewed recently published papers to construct a structural model of human maltase-glucoamylase

Structural Bioinformatics Analysis of

Fig. 1. 3D and 2D visualization of DNJ.

**3. Results and discussion** 

**3.1.1 Wild-type** 

and F649.

**3.1.2 Wild-type vs. mutants** 

**3.1 Structure modelling of human GAA** 

**2.2.3 Structural analysis of GAA mutation** 

locally affect the electrostatic surface of the enzyme.

(Davies & Henrissat, 1995, and Lovering et al., 2005).

Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones 315

To exam the effect of atoms, each mutant model was superimposed on the wild-type structure on the basis of the C atom by the least-square-mean fitting method (Matsuzawa et al., 2005 and Saito et al., 2008). We assumed that the structure was influenced by an amino acid substitution when the position of an atom in a mutant differed from that in the wildtype structure, thus, such substations were expected to affect neighboring residue and to

As the results showed, our constructed wild-type model of GAA appears to be composed of five domains: a trefoil type- domain (residues 89–135), an N-terminal b-sandwich domain (residues 136–346), a catalytic (/)8 barrel domain (residues 347–723) with two inserted loops, which include insert 1 (residues 444–491) and insert 2 (residues 522–567) protruding out between 3 and 3, and between 4 and 4, respectively, a proximal C-terminal domain (residues 724–818) and a distal C-terminal domain (residues 819–952) (Figure 2). The key catalytic activity (D518 and D616) (Hermans et al., 1991, Sugawara et al., 2008 and Sugawara et al., 2009) and sequence motifs of family 31 glycosyl hydrolases were well conserved

The proposed active-site pocket here was composed to residues of residues W376, W402, D404, I441, D443, W481, W516, D518, M519, F525, R600, W613, D616, D645, F649, and H674 (see Figure 3). Like many other sugar-binding enzymes, there were a lot of hydrophobic residues lining the active-site pocket, including W376, W402, I441, W481, W516, F525, W613,

The six mutant forms of GAA which responded to DNJ severely were superposed with wild-type. After the structure was superimposed, RMSD was computed in terms of the active-site pocket between the wild-type and mutants and the value were found to be less than 0.8 Å respectively in between. These varied situations were illustrated in Figure 3.

(MGAM) through homology modeling using the structural information (PDB ID: 2QLY) as a template. Note that there are approximate 44% amino acid sequence identities between the GAA and template. Based on the sequence alignment and the structural mode, our structural model, GAA residues (84-952) were threaded on to the MGAM template. The active site region for both GAA and MGAM overlaid well and the key catalytic residues had high similar spatial alignment (D518/D616 and D445/D542 in GAA and template respectively).

This study involved active site analysis that we applied the proposed model to reveal whether any conformational changes take place at the active site of GAA mutants and molecular docking studies on DNJ which we presented the geometry of the binding site of the complexes of GAA/DNJ and GAA mutants/DNJ. These were done by visual inspection of the atomic models looking at the interaction between the human GAA variants and chaperones, in terms of both binding energy and spatial orientation of the active site. Structural studies should be useful in improving our understanding of enzyme protein stability, molecular recognition and binding and then will help us to further elucidate the molecular basis of Pompe diseas.

#### **2. Methodology**

#### **2.1 Structural modelling of the wild-type and mutant human acid α-glucosidase**

A structural model of wild-type human acid α-glucosidase was built using molecular modeling software, MIFit (a cross-platform interactive graphics application for molecular modelling), and Molecular Operating Environment, MOE (CCG-Chemical Computing Group Inc.), by means of homology modeling. The structural of human intestinal maltaseglucoamylase (PDB: 2QLY) was used as a template and then energy minimization was carried out. The root-mean-square gradient (RMSD) was computed in terms of all the atoms in a protein backbone and the value was less than 0.6 Å which is indicative of considerable structural similarity.

More than hundred different GAA mutations know to cause Pompe disease are predicted to produce full-length proteins corresponding to a single amino acid substitution. Thus, based on the wild-type human acid α-glucosidase model, the structural models of mutants incorporating the amino acid substitutions were constructed using MIFiT. And the initial model was further refined by energy minimization. However, because of the low amino acids sequence identity between the human acid α-glucosidase and template, the investigations were restricted to a limited region of the enzyme protein.

#### **2.2 Molecular docking**

#### **2.2.1 Preparation of ligand**

The initial structure of the pharmacological chaperone 1-deoxynojirimycin (DNJ) (Figure 1) for the docking was generated using ChemDraw Ultra Version 9.0 (CambridgeSoft Corp.). And then geometry optimized ligands were prepared using MOE.

#### **2.2.2 Docking**

According to the effect of DNJ on responsive GAA mutants, six severe effects of GAA variants (G377R, A445P, L552P, Y575S, E579K, and H612Q) and wild-type GAA were chosen as the receptor for docking (Flanagan et al., 2009). Enzyme proteins and ligand structures were imported into MOE 2010.10 where three-dimensional structures were generated using a course energy minimization protocol and the MMFF94x force field (Halgren, 1996, 1999).

Fig. 1. 3D and 2D visualization of DNJ.

#### **2.2.3 Structural analysis of GAA mutation**

To exam the effect of atoms, each mutant model was superimposed on the wild-type structure on the basis of the C atom by the least-square-mean fitting method (Matsuzawa et al., 2005 and Saito et al., 2008). We assumed that the structure was influenced by an amino acid substitution when the position of an atom in a mutant differed from that in the wildtype structure, thus, such substations were expected to affect neighboring residue and to locally affect the electrostatic surface of the enzyme.

#### **3. Results and discussion**

#### **3.1 Structure modelling of human GAA**

#### **3.1.1 Wild-type**

314 Bioinformatics – Trends and Methodologies

(MGAM) through homology modeling using the structural information (PDB ID: 2QLY) as a template. Note that there are approximate 44% amino acid sequence identities between the GAA and template. Based on the sequence alignment and the structural mode, our structural model, GAA residues (84-952) were threaded on to the MGAM template. The active site region for both GAA and MGAM overlaid well and the key catalytic residues had high similar spatial

This study involved active site analysis that we applied the proposed model to reveal whether any conformational changes take place at the active site of GAA mutants and molecular docking studies on DNJ which we presented the geometry of the binding site of the complexes of GAA/DNJ and GAA mutants/DNJ. These were done by visual inspection of the atomic models looking at the interaction between the human GAA variants and chaperones, in terms of both binding energy and spatial orientation of the active site. Structural studies should be useful in improving our understanding of enzyme protein stability, molecular recognition and binding and then will help us to further elucidate the

**2.1 Structural modelling of the wild-type and mutant human acid α-glucosidase**  A structural model of wild-type human acid α-glucosidase was built using molecular modeling software, MIFit (a cross-platform interactive graphics application for molecular modelling), and Molecular Operating Environment, MOE (CCG-Chemical Computing Group Inc.), by means of homology modeling. The structural of human intestinal maltaseglucoamylase (PDB: 2QLY) was used as a template and then energy minimization was carried out. The root-mean-square gradient (RMSD) was computed in terms of all the atoms in a protein backbone and the value was less than 0.6 Å which is indicative of considerable

More than hundred different GAA mutations know to cause Pompe disease are predicted to produce full-length proteins corresponding to a single amino acid substitution. Thus, based on the wild-type human acid α-glucosidase model, the structural models of mutants incorporating the amino acid substitutions were constructed using MIFiT. And the initial model was further refined by energy minimization. However, because of the low amino acids sequence identity between the human acid α-glucosidase and template, the

The initial structure of the pharmacological chaperone 1-deoxynojirimycin (DNJ) (Figure 1) for the docking was generated using ChemDraw Ultra Version 9.0 (CambridgeSoft Corp.).

According to the effect of DNJ on responsive GAA mutants, six severe effects of GAA variants (G377R, A445P, L552P, Y575S, E579K, and H612Q) and wild-type GAA were chosen as the receptor for docking (Flanagan et al., 2009). Enzyme proteins and ligand structures were imported into MOE 2010.10 where three-dimensional structures were generated using a course energy minimization protocol and the MMFF94x force field (Halgren, 1996, 1999).

investigations were restricted to a limited region of the enzyme protein.

And then geometry optimized ligands were prepared using MOE.

alignment (D518/D616 and D445/D542 in GAA and template respectively).

molecular basis of Pompe diseas.

**2. Methodology** 

structural similarity.

**2.2 Molecular docking 2.2.1 Preparation of ligand** 

**2.2.2 Docking** 

As the results showed, our constructed wild-type model of GAA appears to be composed of five domains: a trefoil type- domain (residues 89–135), an N-terminal b-sandwich domain (residues 136–346), a catalytic (/)8 barrel domain (residues 347–723) with two inserted loops, which include insert 1 (residues 444–491) and insert 2 (residues 522–567) protruding out between 3 and 3, and between 4 and 4, respectively, a proximal C-terminal domain (residues 724–818) and a distal C-terminal domain (residues 819–952) (Figure 2). The key catalytic activity (D518 and D616) (Hermans et al., 1991, Sugawara et al., 2008 and Sugawara et al., 2009) and sequence motifs of family 31 glycosyl hydrolases were well conserved (Davies & Henrissat, 1995, and Lovering et al., 2005).

The proposed active-site pocket here was composed to residues of residues W376, W402, D404, I441, D443, W481, W516, D518, M519, F525, R600, W613, D616, D645, F649, and H674 (see Figure 3). Like many other sugar-binding enzymes, there were a lot of hydrophobic residues lining the active-site pocket, including W376, W402, I441, W481, W516, F525, W613, and F649.

#### **3.1.2 Wild-type vs. mutants**

The six mutant forms of GAA which responded to DNJ severely were superposed with wild-type. After the structure was superimposed, RMSD was computed in terms of the active-site pocket between the wild-type and mutants and the value were found to be less than 0.8 Å respectively in between. These varied situations were illustrated in Figure 3.

Structural Bioinformatics Analysis of

Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones 317

Fig. 4. Superimposed with the corresponding active-site pocket of the wild-type and six mutant variants GAA. The conserved catalytic residues D518/D616 are circled in red. Of

The comparison results were shown that no significant changes in the conformations of amino acid residues that comprise the active site and mutations of the key catalytic residues were conserved but when mutated as G377R, Try veered forward in active site. This might imply that new drugs can be designed or existing drugs can be modified based on its interaction with the new tyrosine residue (see Figure 4). This observation rules out the possibility of a conformational difference between the mutant and the wild-type enzyme as

Molecular docking is utilized for the prediction of protein-ligand complexes which creates possible protein-ligand complex geometries. To understand the interaction between the enzyme and the pharmacological chaperone DNJ, we examined the binding affinity of the DNJ to the enzyme based on the complex geometry and binding energy. In the complex of the DNJ and enzyme model (either wild-type or mutants), the DNJ molecule fitted into the

Of the wild-type, residues D404, D518 and D616 were predicted to bind to the hydroxyl groups and the nitrogen of DNJ through hydrogen bonding inside the active-site pocket. Residues W376, I441, W481, W516, M519, W613, and F649 might be involved in the hydrophobic interaction of the DNJ. It is assumed that these residues contribute to the substrate binding specificity. The active-site pocket was apt for DNJ as to both space and binding. We also observed that the interactions between DNJ and the active-site pocket residues of wild-type and mutants; the nitrogen of DNJ was interacted with D518 through

The DNJ fit into the active-site pocket well and a limit space between the nitrogen atom of DNJ and the wall of the active-site pocket of wild-type GAA and mutant variants

GAA variant (G377R), Try turned forwards in the active site.

the derivation cause for the reduction of catalytic activity.

**3.2 Docking** 

active-site pocket well.

hydrogen bonding. (Figure 5 and Figure 6)

respectively were observed (Figure 7, Figure 8 and Figure 9).

Fig. 2. GAA structural model. A ribbon diagram of GAA structural model. The orange shallow circle area represents the active-site pocket.

Fig. 3. A close-up view of the active-site pocket (W376, W402, D404, I441, D443, W481, W516, D518, M519, F525, R600, W613, D616, D645, F649, and H674).

Fig. 4. Superimposed with the corresponding active-site pocket of the wild-type and six mutant variants GAA. The conserved catalytic residues D518/D616 are circled in red. Of GAA variant (G377R), Try turned forwards in the active site.

The comparison results were shown that no significant changes in the conformations of amino acid residues that comprise the active site and mutations of the key catalytic residues were conserved but when mutated as G377R, Try veered forward in active site. This might imply that new drugs can be designed or existing drugs can be modified based on its interaction with the new tyrosine residue (see Figure 4). This observation rules out the possibility of a conformational difference between the mutant and the wild-type enzyme as the derivation cause for the reduction of catalytic activity.

#### **3.2 Docking**

316 Bioinformatics – Trends and Methodologies

Fig. 2. GAA structural model. A ribbon diagram of GAA structural model. The orange

Fig. 3. A close-up view of the active-site pocket (W376, W402, D404, I441, D443, W481,

W516, D518, M519, F525, R600, W613, D616, D645, F649, and H674).

shallow circle area represents the active-site pocket.

Molecular docking is utilized for the prediction of protein-ligand complexes which creates possible protein-ligand complex geometries. To understand the interaction between the enzyme and the pharmacological chaperone DNJ, we examined the binding affinity of the DNJ to the enzyme based on the complex geometry and binding energy. In the complex of the DNJ and enzyme model (either wild-type or mutants), the DNJ molecule fitted into the active-site pocket well.

Of the wild-type, residues D404, D518 and D616 were predicted to bind to the hydroxyl groups and the nitrogen of DNJ through hydrogen bonding inside the active-site pocket. Residues W376, I441, W481, W516, M519, W613, and F649 might be involved in the hydrophobic interaction of the DNJ. It is assumed that these residues contribute to the substrate binding specificity. The active-site pocket was apt for DNJ as to both space and binding. We also observed that the interactions between DNJ and the active-site pocket residues of wild-type and mutants; the nitrogen of DNJ was interacted with D518 through hydrogen bonding. (Figure 5 and Figure 6)

The DNJ fit into the active-site pocket well and a limit space between the nitrogen atom of DNJ and the wall of the active-site pocket of wild-type GAA and mutant variants respectively were observed (Figure 7, Figure 8 and Figure 9).

Structural Bioinformatics Analysis of

Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones 319

Fig. 7. Surface representation of the active-site pocket of wild-type GAA with bound DNJ.

Fig. 5. The interaction diagram between DNJ and the wild-type GAA inside the active-site pocket

Fig. 6. Structure of wild-type GAA bound to DNJ.

Fig. 5. The interaction diagram between DNJ and the wild-type GAA inside the active-site

Fig. 6. Structure of wild-type GAA bound to DNJ.

pocket

Structural Bioinformatics Analysis of

Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones 321

(a) G377R (b) A445P

(c) L552P (d) Y575S

(e) E579K (f) H612Q

Fig. 9. Surface representation of the active-site pocket of GAA variants with bound DNJ. G377R variant shows a larger narrow funnel-shaped region of the active-site cavity.

Fig. 8. Structure of GAA mutant variants bound to DNJ.

(a) G377R (b) A445P

(c) L552P (d) Y575S

(e) E579K (f) H612Q

Fig. 8. Structure of GAA mutant variants bound to DNJ.

Fig. 9. Surface representation of the active-site pocket of GAA variants with bound DNJ. G377R variant shows a larger narrow funnel-shaped region of the active-site cavity.

Structural Bioinformatics Analysis of

 *Biophys Acta.* Vol. 764:1677-1687.

 *Metabolism* Vol. 15(5): 222-228.

Structure. Vol. 3:853–859.

 *Comp. Chem.* Vol. 20:720-729.

**6. References** 

Acid Alpha-Glucosidase Mutants with Pharmacological Chaperones 323

Alberts B., Bray D., Hopkin K., Johnson A., Lewis J., Raff M., Roberts K., & Walter P.

Arakawa T., Ejima D., Kita Y., & Tsumoto K. (2006). Small molecule pharmacological

Bernier V., Lagace M., Bichet D.G.,,& Bouvier M. (2004). Pharmacological chaperones:

Davies G. & Henrissat B. (1995). Structures and mechanism of glycosyl hydrolases.

Flanagan JJ., Rossi B., Tang K., Wu X., Mascioli K., Donaudy F., Tuzzi MR., Fontana F.,

Halgren T.A. (1996). Merck molecular force field.1. Basis, form, scope, parameterization, and

Halgren T.A. (1999a). MMFF VI. MMFF94s Option for Energy Minimization Studies. *J.* 

Halgren T.A. (1999b). MMFF VII. Characterization of MMFF94, MMFF94s, and Other

Hodges B.L. & Cheng S.H. (2006). Cell and gene-based therapies for the lysosomal storage

Lovering A.L., Lee S. S., Kim Y-W, Withers S.G., & Strynadka N.C.J. (2005). Mechanistic

Matsuzawa F., Aikawa S., Doi H., Okumiya T., & Sakuraba H. (2005). Fabry disease:

Kopito R.R. & Ron D. (2000). Conformational disease. *Nat. Cell Biol.* Vol. 2: E207–E209. Kroos M., Pomponio RJ., van Vliet L., Palmer RE., Phipps M., Van der Helm R., Halley D., &

 Interaction Energies and Geometries. *J. Comp. Chem.* Vol. 20:730-748. Hermans Monique M.P., Kroos Marian A., Beeurnen Jos van, Oostra Ben A., & Reuser

Hirschhorn R. & Reuser A.J.J. (2001). Glycogen Storage Disease type II; acid α-

 *Inherited Disease*. D.V. M.D., Editor, Mc Graw-Hill: New York.

Widely Available Force Fields for Conformational Energies and for Intermolecular-

Arnold J.J. (1991). Human Lysosomal -Glucosidase characterization of the

Glucosidase (Acid Maltase) deficiency. *The Metabolic and Molecular Bases of* 

 Reuser A. (2008). GAA Database Consortium. Update of the Pompe disease mutation database with 107 sequence variants and a format for severity rating.

and Structural Analysis of a Family 31 Glycosidase and Its Glycosyl-enzyme

correlation between structural changes in alpha-galactosidase, and clinical and

Carrell R.W. & Lomas, D.A. (1997). Conformational disease, *The Lancet* Vol. 350: 134–138. Chaudhuri T.K. & Paul S. (2006). Protein-misfolding diseases and chaperone-based

therapeutic approaches. *FEBS J.* Vol. 273:1331-1349.

alpha-glucosidase. *Hum Mutat.* Vol. 30(12):1683-92.

catalytic site. *J. Biol. Chem.* Vol. 266(21): 13507-13512.

Intermediate. *J. Biol. Chem.* Vol. 280(3):2105–2115.

biochemical phenotype. *Hum. Genet.* Vol. 117:317–328.

diseases. *Curr Gene Ther.* Vol. 6:227–241.

*Hum Mutat.* Vol. 6: E13-E26.

performance of MMFF94. *J. Comp. Chem.* Vol. 17(5–6):490–519.

chaperones: From thermodynamic stabilization to pharmaceutical drugs. *Biochim* 

potential treatment for conformational diseases. *TRENDS in Endocrinology and* 

 Cubellis MV., Porto C., Benjamin E., Lockhart DJ., Valenzano KJ., Andria G., Parenti G., & Do HV. (2009). The pharmacological chaperone 1-deoxynojirimycin increases the activity and lysosomal trafficking of multiple mutant forms of acid

(2002). *Essential Cell Biology*. Garland Science Textbooks, London.

We noticed that a narrow funnel-shaped region of the active-site cavity of wild-type GAA was smaller compared with that of other mutant variants. Especially, not only G377R variant showed a larger narrow funnel-shaped region of the active-site cavity compared with that of wild-type GAA or other mutant variants but also of GAA variant (G377R), Try turned forwards in the active site. Thus, it should be possible to modify this molecule to develop a novel derivative suitable for Pompe disease.

The theme of molecular docking is a vital aspect in drug discovery and development. Molecular docking is utilized for the prediction of protein–ligand complexes which predicts the binding affinity of the ligand to the protein based on the complex geometry. The binding energies also reflect the binding affinity of a ligand. The docking results were described in Table 1. The values showed that the binding energy of mutated complex (GAA variants) was higher than that of wild-type complex. Thus, it is interesting to speculate that increase in binding energy due to mutation might decrease the binding affinity of GAA towards DNJ, stabilizing GAA, and modulating its activity.


Table 1. Energy values obtained in docking calculation.

Still, these binding energies might not yet sufficient for determining binding affinity of ligands or drug candidates associations, some other physical effects such as electrostatics, van der waals, hydrogen bonding, and hydrophobic could affect the binding affinity; those are also needed to be evaluated.

#### **4. Conclusions**

This work involved active site analysis, molecular docking and binding energy studies. We revealed the mechanism in the folding behaviors of the involved enzymatic protein mutants, and the way they interacted with small molecule pharmacological chaperones based on spatial schematics which provided a basis for experimental validation. The validity of this approach was supported by the identification of some known GAA mutants. Therefore, the conformational changes detected in the distribution of various residues and their constituents around various GAA mutants should be useful in improving our understanding of enzyme protein stability, molecular recognition and binding. Effectively we have demonstrated the corresponding structural conformations associated with GAA wild-type and mutants in their three-dimensional environment. The difference in binding energies might rise due to mutations which could affect the binding affinity of DNJ. And then, it turns out that the complex structures and energy results presented here may provide useful consideration in the therapeutic approaches to these diseases as well as in the design of novel inhibitors associated with sucrose degradation.

#### **5. Acknowledgments**

The author is grateful to Dr. Wei-Chieh Cheng for his expert advice and useful discussion and Dr. Cheng-Yuan Huang acknowledged for exchanging information.

#### **6. References**

322 Bioinformatics – Trends and Methodologies

We noticed that a narrow funnel-shaped region of the active-site cavity of wild-type GAA was smaller compared with that of other mutant variants. Especially, not only G377R variant showed a larger narrow funnel-shaped region of the active-site cavity compared with that of wild-type GAA or other mutant variants but also of GAA variant (G377R), Try turned forwards in the active site. Thus, it should be possible to modify this molecule to

The theme of molecular docking is a vital aspect in drug discovery and development. Molecular docking is utilized for the prediction of protein–ligand complexes which predicts the binding affinity of the ligand to the protein based on the complex geometry. The binding energies also reflect the binding affinity of a ligand. The docking results were described in Table 1. The values showed that the binding energy of mutated complex (GAA variants) was higher than that of wild-type complex. Thus, it is interesting to speculate that increase in binding energy due to mutation might decrease the binding affinity of GAA towards

(kcal/mol) -149.782 -107.416 -99.904 -130.414 -102.599 -109.852 -95.383

Still, these binding energies might not yet sufficient for determining binding affinity of ligands or drug candidates associations, some other physical effects such as electrostatics, van der waals, hydrogen bonding, and hydrophobic could affect the binding affinity; those

This work involved active site analysis, molecular docking and binding energy studies. We revealed the mechanism in the folding behaviors of the involved enzymatic protein mutants, and the way they interacted with small molecule pharmacological chaperones based on spatial schematics which provided a basis for experimental validation. The validity of this approach was supported by the identification of some known GAA mutants. Therefore, the conformational changes detected in the distribution of various residues and their constituents around various GAA mutants should be useful in improving our understanding of enzyme protein stability, molecular recognition and binding. Effectively we have demonstrated the corresponding structural conformations associated with GAA wild-type and mutants in their three-dimensional environment. The difference in binding energies might rise due to mutations which could affect the binding affinity of DNJ. And then, it turns out that the complex structures and energy results presented here may provide useful consideration in the therapeutic approaches to these diseases as well as in the design

The author is grateful to Dr. Wei-Chieh Cheng for his expert advice and useful discussion

and Dr. Cheng-Yuan Huang acknowledged for exchanging information.

G377R A445P L552P Y575S E579K H612Q

(DNJ) *wild-type Mutants (GAA variants)* 

develop a novel derivative suitable for Pompe disease.

DNJ, stabilizing GAA, and modulating its activity.

Table 1. Energy values obtained in docking calculation.

of novel inhibitors associated with sucrose degradation.

Ligand

Binding Energy

**4. Conclusions** 

**5. Acknowledgments** 

are also needed to be evaluated.


**16** 

*Slovakia* 

**Bioinformatics Domain Structure Prediction** 

Ryanodine receptors (RyRs) are homotetrameric intracellular calcium release channels in the membranes of the endoplasmic (ER) and sarcoplasmic reticulum (SR) (George et al. 2005, Meissner 2002, 2004). Each subunit consists of ~5000 amino acid residues (George et al. 2005). There are three isoforms of the ryanodine receptor: the RyR1 isoform is expressed predominantly in skeletal muscle, the RyR2 isoform predominates in cardiac muscle, and the RyR3 isoform is expressed in a variety of tissues (Sorrentino 1995). In the mammalian heart, the RyR2 isoform is a principal component of the excitation-contraction (E-C) coupling process. Action potential depolarization of the cardiac cell results in injection of calcium ions into the cell via calcium channels (dihydropyridine receptors, DHPRs). This small calcium influx then drives the release of calcium from intracellular calcium stores by triggering the opening of RyR2 channels (Fabiato 1985). The released calcium causes contraction by binding to troponin C (Ebashi and Ogawa 1988). Consequently, precise

regulation of RyR activity during heartbeat is essential to proper cardiac function.

release in the absence of an action potential (Durham et al. 2007, Yano et al. 2006).

J. Bauer1, E. Hostinová1, J. Gašperík1, K. Beck3, Ľ. Borko1, A. Faltínová2, A. Zahradníková2 and

*2Institute of Molecular Physiology and Genetics, Slovak Academy of Sciences, Bratislava, Slovakia 3School of Dentistry, Cardiff University, Heath Park, Cardiff, Wales, UK*

In several cardiac diseases, such as heart failure and the genetic diseases CVPT (catecholaminergic polymorphic ventricular tachycardia) and ARVD2 (arrhythmogenic right ventricular dysplasia), the function of RyR is compromised. In heart failure, the release of calcium in response to the action potential is decreased, while RyR remains more active during the diastole (Durham et al. 2007, Yano et al. 2006). In CPVT and ARVD2, RyRs contain mutations that lead to altered RyR activity which may result in premature calcium

In this work we present a bioinformatics analysis of the whole of human RyR2 (hRyR2) in context with the available functional information, in order to locate individual domains for further biochemical and structural studies. The reliability of the predictions in the Nterminal region (Bauerova-Hlinkova et al. 2010) was verified experimentally by expressing and characterizing the domains identified. We also describe the results of a CD-

**1. Introduction** 

 \*

J. Ševčík1

**and Homology Modeling of Human** 

*1Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava,* 

**Ryanodine Receptor 2** 

V. Bauerová-Hlinková1 et al.\*


### **Bioinformatics Domain Structure Prediction and Homology Modeling of Human Ryanodine Receptor 2**

V. Bauerová-Hlinková1 et al.\*

*1Institute of Molecular Biology, Slovak Academy of Sciences, Bratislava, Slovakia* 

#### **1. Introduction**

324 Bioinformatics – Trends and Methodologies

Molecular Operating Environment (MOE) 2010.10. (2010.) *Chemical Computing Group*, Inc;

Petsko GA. & Ringe D. (2004). From structure to function. *In Protein Structure and Function.*

Raben N., Shea L., & Hill V. (2009). Monitoring Autophagy in Lysosomal Storage Disorders.

Raben N., Plotz P., & Byrne BJ. (2002). Acid alpha-glucosidase deficiency (glucogenosis type

Saito S., Ohno K., Sugawara K., & Sakuraba H. (2008). Structural and clinical implications

Sugawara K., Ohno K., Saito S., & Sakuraba H. (2008). Structural characterization of mutant alpha-galactosidases causing Fabry disease. *J. Hum. Genet.* Vol. 53:812–824. Sugawara K., Saito S., Sekijima M., Ohno K., Tajima Y., & Kroos M.A. (2009). Structural

Zvelebil M.J., Barton G.J., Taylor W.R., & Sternberg M.J. (1987). Prediction of Protein

mucopolysaccharidosis type VI. *Mol. Genet. Metab.* Vol. 93:419–425.

of amino acid substitutions in N-acetylgalactosamine-4-sulfatase: insight into

modeling of mutant -glucosidases resulting in a processing/transport defect in

Secondary Structure and Active Sites Using the Alignment of Homologous

Montreal, Quebec.

London; New Science Press.

*Methods in Enzymology* Vol. 453:417-449.

II, Pompe disease). *Curr Mol Med.* Vol. 2:145-166.

Pompe disease. *J. Hum. Genet.* Vol. 54:324-330.

Sequences. *J. Mol Biol.* Vol. 195(4):957-961.

Selkoe D.J. (2003). Folding proteins in fatal ways. *Nature* Vol. 426:900–904.

Ryanodine receptors (RyRs) are homotetrameric intracellular calcium release channels in the membranes of the endoplasmic (ER) and sarcoplasmic reticulum (SR) (George et al. 2005, Meissner 2002, 2004). Each subunit consists of ~5000 amino acid residues (George et al. 2005). There are three isoforms of the ryanodine receptor: the RyR1 isoform is expressed predominantly in skeletal muscle, the RyR2 isoform predominates in cardiac muscle, and the RyR3 isoform is expressed in a variety of tissues (Sorrentino 1995). In the mammalian heart, the RyR2 isoform is a principal component of the excitation-contraction (E-C) coupling process. Action potential depolarization of the cardiac cell results in injection of calcium ions into the cell via calcium channels (dihydropyridine receptors, DHPRs). This small calcium influx then drives the release of calcium from intracellular calcium stores by triggering the opening of RyR2 channels (Fabiato 1985). The released calcium causes contraction by binding to troponin C (Ebashi and Ogawa 1988). Consequently, precise regulation of RyR activity during heartbeat is essential to proper cardiac function.

In several cardiac diseases, such as heart failure and the genetic diseases CVPT (catecholaminergic polymorphic ventricular tachycardia) and ARVD2 (arrhythmogenic right ventricular dysplasia), the function of RyR is compromised. In heart failure, the release of calcium in response to the action potential is decreased, while RyR remains more active during the diastole (Durham et al. 2007, Yano et al. 2006). In CPVT and ARVD2, RyRs contain mutations that lead to altered RyR activity which may result in premature calcium release in the absence of an action potential (Durham et al. 2007, Yano et al. 2006).

In this work we present a bioinformatics analysis of the whole of human RyR2 (hRyR2) in context with the available functional information, in order to locate individual domains for further biochemical and structural studies. The reliability of the predictions in the Nterminal region (Bauerova-Hlinkova et al. 2010) was verified experimentally by expressing and characterizing the domains identified. We also describe the results of a CD-

<sup>\*</sup> J. Bauer1, E. Hostinová1, J. Gašperík1, K. Beck3, Ľ. Borko1, A. Faltínová2, A. Zahradníková2 and J. Ševčík1

*<sup>2</sup>Institute of Molecular Physiology and Genetics, Slovak Academy of Sciences, Bratislava, Slovakia 3School of Dentistry, Cardiff University, Heath Park, Cardiff, Wales, UK*

Bioinformatics Domain Structure Prediction

**2.2 Regulation by Ca2+** 

proposed (Laver 2007, Laver 2009).

**2.2.1 Cytosolic activation** 

the selectivity filter (Balshaw et al. 1999, Gao et al. 2000).

and Homology Modeling of Human Ryanodine Receptor 2 327

0.4–0.6 pA (Mejia-Alvarez et al. 1999, Kettlun et al. 2003). The permeation properties are conferred on the RyR by the amino acids forming the pore, which are close to the C-terminal end of the RyR (Du et al. 2001, Zhao et al. 1999), and where aa. GGIG were proposed to form

Ca2+ ions are the most important regulator of RyR activity (Fabiato 1985). They act at several Ca2+ binding sites, leading to activation as well as inactivation of RyR. From the physiological point of view it is important to note that Mg2+ ions, present at millimolar concentrations in the cytosol and the SR lumen, are also capable of binding to all RyR Ca2+ binding sites. The existence of two activation sites and two inactivation sites has been

Cytosolic Ca2+ is the physiological activator of the RyR2 and RyR3 isoforms and contributes to the activation of the RyR1 isoform. In the cardiac myocyte, diastolic Ca2+ is ~ 50–100 nM (Baartscheer et al. 1998, Kagaya et al. 1995). Ca2+ ions activate RyR channels in the concentration range relevant for excitation-contraction coupling (0.3–100 µM). The probability that RyR channels are open in the absence of other modulators increases with increasing calcium concentration with half-activation at ~1 µM (Chu et al. 1993, Coronado

The time course of both RyR activation (Gyorke and Fill 1993, Schiefer et al. 1995, Zahradnikova and Zahradnik 1999, Zahradnikova et al. 1999, Zahradnikova et al. 2003) and deactivation (Schiefer et al. 1995, Velez et al. 1997) is very rapid; the activation rate is dependent on Ca2+ concentration (Schiefer et al. 1995, Zahradnikova et al. 1999) while the deactivation rate is not (Schiefer et al. 1995). The fast activation and deactivation kinetics should allow RyRs to respond to physiological calcium signals that last only a few milliseconds. The response of the ryanodine receptor to rapid and brief calcium elevations, mimicking physiological stimuli, has shown that RyR has several Ca2+ binding sites. In wildtype RyRs, binding of at least 4 Ca2+ ions precedes channel activation (Zahradnikova et al. 1999). RyR channels containing subunits mutated in the putative Ca2+ binding sites are less sensitive to activation by cytosolic Ca2+ (Li and Chen 2001). Analysis of the calcium dependence of RyR tetramers containing both wild-type and mutated monomers confirmed the presence of a single Ca2+ binding site on each of the monomers and revealed that activation by Ca2+ proceeds by allosteric interaction between Ca2+ binding and channel opening (Zahradnik et al. 2005). The cytosolic Ca2+ binding activation site is located in the C-terminal part of the channel (Chen et al. 1998, Li and Chen 2001). The C-terminal part of the RyR sequence (amino acids 3661–5037) is capable of forming an ion channel that can be

Due to competition between Ca2+ and Mg2+ ions, the apparent sensitivity of RyR2 channels *in situ* to activation by calcium is decreased about 10 times by Mg2+ binding to the activation site (Meissner 1994). The calcium dependence of *in situ* RyR activity enabled elucidation of the mechanism of the differences between the effect of Mg2+ and Ca2+ on RyR. While binding of Ca2+ to the activation site has a strong positively allosteric effect, the binding of Mg2+ has a weak negatively allosteric effect (Zahradnikova et al. 2010). At diastolic calcium concentrations, more than 75% of the cytosolic activation sites are occupied by Mg2+

et al. 1994, Gyorke et al. 1994, Meissner 2004, Zahradnikova et al. 1999).

activated by calcium (Bhat et al. 1997, Xu et al. 2000).

spectroscopy study we carried out in order to determine the domain organization of RyR2; this study also included an analysis of the secondary structure elements of the N-terminal part of RyR2. Finally, we present a homology model of the N-terminal part of RyR2 which is based on the recently determined X-ray structure of rabbit RyR1. The amino-acid sequence identity of these two proteins is more than 80%, which suggests that predictions made from this model will most likely be reliable. The homology model agrees with our bioinformatics analysis and also with the results of our CD-spectroscopy study. This model should help to locate and identify the mutations and the residues in their proximity that are responsible for the cardiac diseases CPVT1 and ARVD2.

#### **2. Physiological function of RyR2**

Calcium release from intracellular stores is mediated by two types of calcium release channels — ryanodine receptors (RyRs) and inositol trisphosphate receptors (IP3Rs) (Berridge 1994). These channels are expressed in most tissues. RyRs play a primary role in skeletal and cardiac muscle cells, where they mediate muscle contraction. In these tissues IP3Rs play only a modulatory role. In contrast, smooth muscle, neurons and non-excitable tissues rely on IP3R to play the primary role in calcium release, while RyRs play a modulatory role (Berridge 1994). The study of RyRs is therefore mostly concerned with understanding their role in the activation of skeletal and cardiac muscle contraction.

#### **2.1 Ion permeation**

The physiological role of the RyR is to allow permeation of Ca2+ ions from the lumen of the SR into the cytosol. Although the calcium gradient between these two compartments is high (the diastolic free Ca2+ concentration is only ~100 nM in the cytosol but ~1–2 mM in the SR lumen (Shannon et al. 2003)), concentrations of other ions are much larger (~150 mM K+) or comparable (~1 mM Mg2+) on both sides of the SR membrane. During the systole, >70% of the Ca2+ ions are released from the SR in several milliseconds. Therefore the conductance, which determines the rate of transport, and the permeability of the channel for Ca2+ ions, which enables selective transport of Ca2+ in the presence of high concentrations of other ions, have to be sufficiently high.

The properties of RyRs have been investigated in planar bilayers by fusion of SR vesicles (Meissner and Henderson (1987), reviewed by Meissner (2002)) and by incorporation of purified ryanodine receptors (Lai et al. (1988), reviewed by Meissner (2002, 2004)) into bilayer membranes. RyRs are characterised by their high conductance and relatively low ion selectivity. Their monovalent cation conductance is very high (200–700 pS), highest for the RyR2 isoform and lowest for the RyR1 isoform, and they are half-saturated at ~10–50 mM concentrations for all monovalent cations. Despite their differences in conductance, their permeability to all monovalents is approximately equal. The channel is 6.5-fold more permeable for divalent than for monovalent ions, and their conductance (90– 200 pS) (Williams 1992) increases with the size of the divalent ion. Half-saturation is achieved at ~0.5 mM divalent ion concentration, i.e., at much lower concentrations than for monovalent ions. These properties of the RyR channel enable effective transport of Ca2+ ions. The large conductance ensures a sufficient transport rate, while the high calcium affinity ensures that under normal conditions, the rate of Ca2+ transport will be close to maximal. The unitary Ca2+ current under physiological conditions has been estimated to be 0.4–0.6 pA (Mejia-Alvarez et al. 1999, Kettlun et al. 2003). The permeation properties are conferred on the RyR by the amino acids forming the pore, which are close to the C-terminal end of the RyR (Du et al. 2001, Zhao et al. 1999), and where aa. GGIG were proposed to form the selectivity filter (Balshaw et al. 1999, Gao et al. 2000).

#### **2.2 Regulation by Ca2+**

326 Bioinformatics – Trends and Methodologies

spectroscopy study we carried out in order to determine the domain organization of RyR2; this study also included an analysis of the secondary structure elements of the N-terminal part of RyR2. Finally, we present a homology model of the N-terminal part of RyR2 which is based on the recently determined X-ray structure of rabbit RyR1. The amino-acid sequence identity of these two proteins is more than 80%, which suggests that predictions made from this model will most likely be reliable. The homology model agrees with our bioinformatics analysis and also with the results of our CD-spectroscopy study. This model should help to locate and identify the mutations and the residues in their proximity that are responsible for

Calcium release from intracellular stores is mediated by two types of calcium release channels — ryanodine receptors (RyRs) and inositol trisphosphate receptors (IP3Rs) (Berridge 1994). These channels are expressed in most tissues. RyRs play a primary role in skeletal and cardiac muscle cells, where they mediate muscle contraction. In these tissues IP3Rs play only a modulatory role. In contrast, smooth muscle, neurons and non-excitable tissues rely on IP3R to play the primary role in calcium release, while RyRs play a modulatory role (Berridge 1994). The study of RyRs is therefore mostly concerned with

The physiological role of the RyR is to allow permeation of Ca2+ ions from the lumen of the SR into the cytosol. Although the calcium gradient between these two compartments is high (the diastolic free Ca2+ concentration is only ~100 nM in the cytosol but ~1–2 mM in the SR lumen (Shannon et al. 2003)), concentrations of other ions are much larger (~150 mM K+) or comparable (~1 mM Mg2+) on both sides of the SR membrane. During the systole, >70% of the Ca2+ ions are released from the SR in several milliseconds. Therefore the conductance, which determines the rate of transport, and the permeability of the channel for Ca2+ ions, which enables selective transport of Ca2+ in the presence of high concentrations of other

The properties of RyRs have been investigated in planar bilayers by fusion of SR vesicles (Meissner and Henderson (1987), reviewed by Meissner (2002)) and by incorporation of purified ryanodine receptors (Lai et al. (1988), reviewed by Meissner (2002, 2004)) into bilayer membranes. RyRs are characterised by their high conductance and relatively low ion selectivity. Their monovalent cation conductance is very high (200–700 pS), highest for the RyR2 isoform and lowest for the RyR1 isoform, and they are half-saturated at ~10–50 mM concentrations for all monovalent cations. Despite their differences in conductance, their permeability to all monovalents is approximately equal. The channel is 6.5-fold more permeable for divalent than for monovalent ions, and their conductance (90– 200 pS) (Williams 1992) increases with the size of the divalent ion. Half-saturation is achieved at ~0.5 mM divalent ion concentration, i.e., at much lower concentrations than for monovalent ions. These properties of the RyR channel enable effective transport of Ca2+ ions. The large conductance ensures a sufficient transport rate, while the high calcium affinity ensures that under normal conditions, the rate of Ca2+ transport will be close to maximal. The unitary Ca2+ current under physiological conditions has been estimated to be

understanding their role in the activation of skeletal and cardiac muscle contraction.

the cardiac diseases CPVT1 and ARVD2.

**2. Physiological function of RyR2** 

**2.1 Ion permeation** 

ions, have to be sufficiently high.

Ca2+ ions are the most important regulator of RyR activity (Fabiato 1985). They act at several Ca2+ binding sites, leading to activation as well as inactivation of RyR. From the physiological point of view it is important to note that Mg2+ ions, present at millimolar concentrations in the cytosol and the SR lumen, are also capable of binding to all RyR Ca2+ binding sites. The existence of two activation sites and two inactivation sites has been proposed (Laver 2007, Laver 2009).

#### **2.2.1 Cytosolic activation**

Cytosolic Ca2+ is the physiological activator of the RyR2 and RyR3 isoforms and contributes to the activation of the RyR1 isoform. In the cardiac myocyte, diastolic Ca2+ is ~ 50–100 nM (Baartscheer et al. 1998, Kagaya et al. 1995). Ca2+ ions activate RyR channels in the concentration range relevant for excitation-contraction coupling (0.3–100 µM). The probability that RyR channels are open in the absence of other modulators increases with increasing calcium concentration with half-activation at ~1 µM (Chu et al. 1993, Coronado et al. 1994, Gyorke et al. 1994, Meissner 2004, Zahradnikova et al. 1999).

The time course of both RyR activation (Gyorke and Fill 1993, Schiefer et al. 1995, Zahradnikova and Zahradnik 1999, Zahradnikova et al. 1999, Zahradnikova et al. 2003) and deactivation (Schiefer et al. 1995, Velez et al. 1997) is very rapid; the activation rate is dependent on Ca2+ concentration (Schiefer et al. 1995, Zahradnikova et al. 1999) while the deactivation rate is not (Schiefer et al. 1995). The fast activation and deactivation kinetics should allow RyRs to respond to physiological calcium signals that last only a few milliseconds. The response of the ryanodine receptor to rapid and brief calcium elevations, mimicking physiological stimuli, has shown that RyR has several Ca2+ binding sites. In wildtype RyRs, binding of at least 4 Ca2+ ions precedes channel activation (Zahradnikova et al. 1999). RyR channels containing subunits mutated in the putative Ca2+ binding sites are less sensitive to activation by cytosolic Ca2+ (Li and Chen 2001). Analysis of the calcium dependence of RyR tetramers containing both wild-type and mutated monomers confirmed the presence of a single Ca2+ binding site on each of the monomers and revealed that activation by Ca2+ proceeds by allosteric interaction between Ca2+ binding and channel opening (Zahradnik et al. 2005). The cytosolic Ca2+ binding activation site is located in the C-terminal part of the channel (Chen et al. 1998, Li and Chen 2001). The C-terminal part of the RyR sequence (amino acids 3661–5037) is capable of forming an ion channel that can be activated by calcium (Bhat et al. 1997, Xu et al. 2000).

Due to competition between Ca2+ and Mg2+ ions, the apparent sensitivity of RyR2 channels *in situ* to activation by calcium is decreased about 10 times by Mg2+ binding to the activation site (Meissner 1994). The calcium dependence of *in situ* RyR activity enabled elucidation of the mechanism of the differences between the effect of Mg2+ and Ca2+ on RyR. While binding of Ca2+ to the activation site has a strong positively allosteric effect, the binding of Mg2+ has a weak negatively allosteric effect (Zahradnikova et al. 2010). At diastolic calcium concentrations, more than 75% of the cytosolic activation sites are occupied by Mg2+

Bioinformatics Domain Structure Prediction

(Williams et al. 2001).

Terentyev et al. 2008).

**2.2.6 Modulation by phosphorylation** 

**2.2.7 Interdomain interactions** 

and Homology Modeling of Human Ryanodine Receptor 2 329

different conformations and therefore also different effects. CaM inhibits all RyR isoforms. Apo-CaM has a stimulatory effect on the activity of RyR1 and RyR3, but, depending on conditions, it either does not affect or inhibits the RyR2 isoform. The effects of CaM are mediated by its high-affinity binding to a binding site (amino acids 3583–3603 in RyR2) on each of the monomers, which is conserved in all RyR isoforms. The different CaM effects on the different isoforms are apparently due to differences in the isoforms in a region outside the CaM binding site (Meissner 2004). The locations of bound CaM and apo-CaM on the binding site are different (Meissner 2004), and a calcium-dependent physical relocation of CaM on the RyR molecule has also been observed by cryoelectron microscopy (cryo-EM) (Samso and Wagenknecht 2002). Modulation of RyR activity by calmodulin may therefore

FKBP (FK506-binding protein) belongs to the immunophilins, cytosolic receptors for imunosuppresants such as rapamycin and FK506. Each RyR monomer contains a binding site for either FKBP12 (in RyR1) or FKBP12.6 (in RyR2). The interaction of FKBP with the channel stabilizes the protein complex and supports coordinated gating of all four subunits. It is thought that FKBP also plays a role in coupled gating of neighbouring RyR channels

In addition to RyR channels, the SR membrane of terminal cisternae also contains the proteins triadin and junctin (Bers 2004) which associate with RyRs from the luminal side. In the lumen of the terminal cisterna there is also a large quantity of the protein calsequestrin (CSQ), a low-affinity calcium-binding protein that serves as a calcium buffer and also modulates RyR activity in a calcium-dependent manner (Bers 2004). CSQ most probably interacts with the RyR channel through interactions with triadin and junctin, so that for correct luminal regulation all three accessory proteins are necessary (Gyorke et al. 2004,

Ryanodine receptors have several conserved regions that are putative phosphorylation sites. Furthermore, kinases (PKA) and phosphatases (PP1 and PP2A) are directly attached to the channel, suggesting that regulation by phosphorylation may have physiological significance (Marx et al. 2001). The first phosphorylation site discovered on the RyR2 molecule (Witcher et al. 1991) was S2809 (S2843 in RyR1), which can be phosphorylated by the Ca2+/calmodulin-dependent protein kinase II (CaMKII) and to some extent also by PKA. Other kinases, such as PKG and PKC, affect RyRs as well (Takasago et al. 1991), e.g. by changing their ability to bind the alkaloid ryanodine. The mechanisms by which phosphorylation and dephosphorylation induce changes in RyR activity are not clearly understood, however. An increase in RyR activity due to an increase in calcium sensitivity (Marx and Marks 2002), as well as an increase in the rate of adaptation after a rapid calcium

Most of the mutations that affect RyR2 function in CPVT and ARVD2 are located in either of four domains: the N-terminal domain, the central domain, the cytoplasmic I-domain, and the transmembrane domain. This clustering, as well as the similarity between the effects of mutations at different positions within these domains (hyperactivation of the channel and

increase (Valdivia et al. 1995) have both been observed after phosphorylation.

involve conformational changes in more distant parts of the RyR protein.

(Zahradnikova et al. 2010). Since Mg2+ dissociation from this site is relatively slow (Zahradnikova et al. 2003, 2010), it limits the rate at which RyRs can respond to physiological Ca2+ elevations, which might play an important role in the physiological regulation of RyRs.

#### **2.2.2 Luminal activation**

Calcium ions also affect RyR activity from the side of the SR lumen. At low cytosolic Ca2+ concentrations, when RyR activity is low, the presence of Ca2+ at the luminal side increases RyR activity by prolonging channel opening if calcium current flows from the lumen to the cytosol, i.e., by binding to the cytosolic calcium binding site (Laver 2007, Laver 2009, Xu and Meissner 1998). Luminal Ca2+ also affects RyR activity by binding to a luminal binding site which may be located either on the channel or on an associated protein (Gaburjakova and Gaburjakova 2006, Gyorke and Gyorke 1998, Gyorke et al. 2004, Qin et al. 2008). The action of luminal Ca2+ is complicated by the fact that the SR lumen contains a large amount of calsequestrin, a low-affinity Ca2+ buffer, which, in addition to its buffering effect, also interacts with RyR and modulates its activity (see below). In planar lipid bilayers, the effect of luminal Ca2+ on RyR activity can be explained by a combination of direct binding to the luminal activation site and of action at the cytosolic sites via "feed-through" of Ca2+ that passes through the channel pore to the cytosolic side of the channel (Laver 2007, 2009). Mg2+ inhibits the luminal effect of Ca2+ by competing with it at the luminal activation site (Laver and Honen 2008).

#### **2.2.3 Inactivation**

Their low sensitivity to Ca2+-induced inactivation distinguishes RyR2 and RyR3 from the skeletal isoform, which is half-inactivated by ~100 µM Ca2+ (Chu et al. 1993, Lamb 1993, Smith et al. 1988). The identity of the inactivating calcium binding site is unknown. Because of the differences between RyR1 on one hand and RyR2/RyR3 on the other, it is assumed that the differences in sensitivity to calcium-dependent inactivation may be partially due to the divergent region DR1, which differs between these RyR isoforms (Du et al. 2000). The inhibitory site has low specificity—the affinity of Mg2+ and Ca2+ to this binding site is similar (Gyorke and Gyorke 1998, Laver and Honen 2008, Laver et al. 1997, Xu et al. 1996).

#### **2.2.4 Activation by ATP**

ATP increases the probability that the RyR channel will be open (EC50 = ~100 µM) without markedly affecting its calcium dependence (Xu et al. 1996). The activity of the RyR1 isoform is potentiated more strongly than that of RyR2 (Zimanyi and Pessah 1991). Although most ATP in the cell is present in the form of Mg·ATP, it seems that the activating species is free ATP2- (Copello et al. 2002). Other nucleosides are much less effective than ATP. ADP is a partial agonist with a lower affinity (EC50 = ~1 mM). Adenosine and adenine have a still lower effect. CTP, GTP, ITP and UTP do not activate the channel at all (Meissner 2002). The existence of the adenine ring is necessary for ATP binding, and the large effectiveness of channel activation by ATP appears to be caused by the presence of the negatively charged phosphate groups (Chan et al. 2000, 2003).

#### **2.2.5 Modulation by associated proteins**

Calmodulin (CaM) is a small Ca2+-binding protein that affects many enzymes, receptors and channels. CaM with four bound calcium ions and apo-CaM without bound Ca2+ ions have

(Zahradnikova et al. 2010). Since Mg2+ dissociation from this site is relatively slow (Zahradnikova et al. 2003, 2010), it limits the rate at which RyRs can respond to physiological Ca2+ elevations, which might play an important role in the physiological

Calcium ions also affect RyR activity from the side of the SR lumen. At low cytosolic Ca2+ concentrations, when RyR activity is low, the presence of Ca2+ at the luminal side increases RyR activity by prolonging channel opening if calcium current flows from the lumen to the cytosol, i.e., by binding to the cytosolic calcium binding site (Laver 2007, Laver 2009, Xu and Meissner 1998). Luminal Ca2+ also affects RyR activity by binding to a luminal binding site which may be located either on the channel or on an associated protein (Gaburjakova and Gaburjakova 2006, Gyorke and Gyorke 1998, Gyorke et al. 2004, Qin et al. 2008). The action of luminal Ca2+ is complicated by the fact that the SR lumen contains a large amount of calsequestrin, a low-affinity Ca2+ buffer, which, in addition to its buffering effect, also interacts with RyR and modulates its activity (see below). In planar lipid bilayers, the effect of luminal Ca2+ on RyR activity can be explained by a combination of direct binding to the luminal activation site and of action at the cytosolic sites via "feed-through" of Ca2+ that passes through the channel pore to the cytosolic side of the channel (Laver 2007, 2009). Mg2+ inhibits the luminal effect of Ca2+ by competing with it at the luminal activation site (Laver

Their low sensitivity to Ca2+-induced inactivation distinguishes RyR2 and RyR3 from the skeletal isoform, which is half-inactivated by ~100 µM Ca2+ (Chu et al. 1993, Lamb 1993, Smith et al. 1988). The identity of the inactivating calcium binding site is unknown. Because of the differences between RyR1 on one hand and RyR2/RyR3 on the other, it is assumed that the differences in sensitivity to calcium-dependent inactivation may be partially due to the divergent region DR1, which differs between these RyR isoforms (Du et al. 2000). The inhibitory site has low specificity—the affinity of Mg2+ and Ca2+ to this binding site is similar (Gyorke and Gyorke 1998, Laver and Honen 2008, Laver et al. 1997, Xu et al. 1996).

ATP increases the probability that the RyR channel will be open (EC50 = ~100 µM) without markedly affecting its calcium dependence (Xu et al. 1996). The activity of the RyR1 isoform is potentiated more strongly than that of RyR2 (Zimanyi and Pessah 1991). Although most ATP in the cell is present in the form of Mg·ATP, it seems that the activating species is free ATP2- (Copello et al. 2002). Other nucleosides are much less effective than ATP. ADP is a partial agonist with a lower affinity (EC50 = ~1 mM). Adenosine and adenine have a still lower effect. CTP, GTP, ITP and UTP do not activate the channel at all (Meissner 2002). The existence of the adenine ring is necessary for ATP binding, and the large effectiveness of channel activation by ATP appears to be caused by the presence of the negatively charged

Calmodulin (CaM) is a small Ca2+-binding protein that affects many enzymes, receptors and channels. CaM with four bound calcium ions and apo-CaM without bound Ca2+ ions have

regulation of RyRs.

and Honen 2008).

**2.2.3 Inactivation** 

**2.2.4 Activation by ATP** 

phosphate groups (Chan et al. 2000, 2003).

**2.2.5 Modulation by associated proteins** 

**2.2.2 Luminal activation** 

different conformations and therefore also different effects. CaM inhibits all RyR isoforms. Apo-CaM has a stimulatory effect on the activity of RyR1 and RyR3, but, depending on conditions, it either does not affect or inhibits the RyR2 isoform. The effects of CaM are mediated by its high-affinity binding to a binding site (amino acids 3583–3603 in RyR2) on each of the monomers, which is conserved in all RyR isoforms. The different CaM effects on the different isoforms are apparently due to differences in the isoforms in a region outside the CaM binding site (Meissner 2004). The locations of bound CaM and apo-CaM on the binding site are different (Meissner 2004), and a calcium-dependent physical relocation of CaM on the RyR molecule has also been observed by cryoelectron microscopy (cryo-EM) (Samso and Wagenknecht 2002). Modulation of RyR activity by calmodulin may therefore involve conformational changes in more distant parts of the RyR protein.

FKBP (FK506-binding protein) belongs to the immunophilins, cytosolic receptors for imunosuppresants such as rapamycin and FK506. Each RyR monomer contains a binding site for either FKBP12 (in RyR1) or FKBP12.6 (in RyR2). The interaction of FKBP with the channel stabilizes the protein complex and supports coordinated gating of all four subunits. It is thought that FKBP also plays a role in coupled gating of neighbouring RyR channels (Williams et al. 2001).

In addition to RyR channels, the SR membrane of terminal cisternae also contains the proteins triadin and junctin (Bers 2004) which associate with RyRs from the luminal side. In the lumen of the terminal cisterna there is also a large quantity of the protein calsequestrin (CSQ), a low-affinity calcium-binding protein that serves as a calcium buffer and also modulates RyR activity in a calcium-dependent manner (Bers 2004). CSQ most probably interacts with the RyR channel through interactions with triadin and junctin, so that for correct luminal regulation all three accessory proteins are necessary (Gyorke et al. 2004, Terentyev et al. 2008).

#### **2.2.6 Modulation by phosphorylation**

Ryanodine receptors have several conserved regions that are putative phosphorylation sites. Furthermore, kinases (PKA) and phosphatases (PP1 and PP2A) are directly attached to the channel, suggesting that regulation by phosphorylation may have physiological significance (Marx et al. 2001). The first phosphorylation site discovered on the RyR2 molecule (Witcher et al. 1991) was S2809 (S2843 in RyR1), which can be phosphorylated by the Ca2+/calmodulin-dependent protein kinase II (CaMKII) and to some extent also by PKA. Other kinases, such as PKG and PKC, affect RyRs as well (Takasago et al. 1991), e.g. by changing their ability to bind the alkaloid ryanodine. The mechanisms by which phosphorylation and dephosphorylation induce changes in RyR activity are not clearly understood, however. An increase in RyR activity due to an increase in calcium sensitivity (Marx and Marks 2002), as well as an increase in the rate of adaptation after a rapid calcium increase (Valdivia et al. 1995) have both been observed after phosphorylation.

#### **2.2.7 Interdomain interactions**

Most of the mutations that affect RyR2 function in CPVT and ARVD2 are located in either of four domains: the N-terminal domain, the central domain, the cytoplasmic I-domain, and the transmembrane domain. This clustering, as well as the similarity between the effects of mutations at different positions within these domains (hyperactivation of the channel and

Bioinformatics Domain Structure Prediction

and Homology Modeling of Human Ryanodine Receptor 2 331

spatio-temporal properties appear to be independent of their trigger signal (Cannell et al. 1995, Lopez-Lopez et al. 1995). This indicates that the amplitude and time course of a Ca2+ spark is largely governed by the properties of the participating RyRs. Thus, Ca2+ sparks can be considered to be the elementary Ca2+ release events underlying E-C coupling, and gradation of calcium release in response to ICa can be explained by the summation of

Under normal conditions, Ca2+ sparks are unable to activate additional Ca2+ sparks in adjacent regions, although the local free Ca2+ concentration associated with a Ca2+ spark is much larger than the global increase in free Ca2+ concentration produced by Ca2+ current activation ((Cannell et al. 1995), but see Parker et al. (1996)). This observation can be explained by the fact that SR Ca2+ release channels are situated very close to the L-type Ca2+ channel (the "local control model"; (Stern 1992)), where they sense the > 100-fold increase in local free Ca2+ concentration upon opening of a nearby L-type Ca2+ channel. The sensitivity of this local control is most clearly seen in the evidence that a single DHPR channel opening can elicit a Ca2+ spark (Cannell et al. 1995, Lopez-Lopez et al. 1995, Santana et al. 1996). A detailed analysis revealed that the probability of Ca2+ activation depends on the square of the single sarcolemmal Ca2+ channel current and on the square of the local free Ca2+ concentration (Santana et al. 1996) and that it is quite sensitive to the duration of the DPHR openings (Zahradnikova et al. 1999). Theoretical calculations (Cannell and Soeller 1997) as well as experimental data (Zahradnikova et al. 1999) suggest that RyRs can respond rapidly to calcium influx via DHPRs; the responsiveness of RyRs *in situ* depends on the geometrical arrangement of channels in the narrow dyadic space, in which high Ca2+ levels rapidly develop near the SR membrane. Such high Ca2+ levels lead to a rapid rate of binding of calcium to the RyR calcium sensitive sites, thereby reducing the latency between DHPR opening and SR calcium release to ~1 ms, in correspondence with the experimentally observed DHPR mean open times. However, direct measurements of the latency between long (~20-ms) calcium channel openings and spark activation (Wang et al. 2001) indicated a much longer latency (~7 ms). Since the binding of Mg2+ (Zahradnikova et al. 2010, Zahradnikova et al. 2003) has not been included in the model (Cannell and Soeller 1997), it may be inferred that only RyR channels that do not have Mg2+ bound are able to respond

rapidly to DHPR openings (Zahradnikova et al. 2010, Zahradnikova et al. 2003).

agreement with experimental findings (Wang et al. 2001).

**3. Structure of ryanodine receptors** 

It is not clear how many RyR channels contribute to a Ca2+ spark. While there are 5-9 RyRs per DHPR in cardiac myocytes, individual Ca2+ sparks may reflect activation of up to 20 RyRs (Bridge et al. 1999, Lukyanenko et al. 2000). This means that the majority of RyRs are activated by neighbouring RyRs and not by adjacent DHPR channels. A quantitative model of spark activation based on the allosteric action of Ca2+ and Mg2+ on RyR2 opening (Zahradnikova et al. 2010) predicts the opening of 1–8 RyR2 channels per spark, in

**3.1 Overall structural features determined by cryoelectron microscopy (cryo-EM)**  RyR channels are the largest ion channels known so far, which makes structural studies of them very challenging. So far, the structure of a whole RyR has been determined only by cryo-EM. Most of these studies, including sub-nanometer EM (Samso et al. 2009,

variable numbers of Ca2+ sparks being activated (Cheng et al. 1996).

increased sensitivity to agonists) led to the postulation of the interdomain hypothesis (Yamamoto et al. 2000). It was hypothesized that there is an interaction between the Nterminal and the central domain that stabilizes the channel in the closed state. Mutations within these domains would then weaken this interdomain interaction (Ikemoto and Yamamoto 2000, Yamamoto et al. 2000) and these changes might play a key role in the regulatory mechanisms of the channel. The experimental strategy to test this hypothesis was based on the assumption that if an interaction between two regions functions as a regulatory mechanism, then a synthetic domain peptide with a sequence identical to that of one of the interacting regions should destabilize the interaction and disturb RyR regulation in a way similar to that observed in the mutations. Several domain peptides from the N-terminal and central region of RyR1 or RyR2 (H163–S195, L590–C609 and L601–C620, L2442–P2477 and G2460–P2495, D2380–A2411) were indeed able to activate RyR and increase its sensitivity to agonists (El-Hayek et al. 1999, Faltinova et al. 2011, Tateishi et al. 2009, Yamamoto and Ikemoto 2002, Yang et al. 2006). The group of Yamamoto and Ikemoto further postulated, based results of George et al. (2004), that the interaction between the I-domain and another, undefined domain also plays a role in stabilizing the closed state of RyR2. A domain peptide from the I-domain (P4090–E4123) did indeed activate RyR2 (Tateishi et al. 2009). Furthermore, mutations in the N-terminal and central domains destabilized the interaction of RyR2 with its regulator calmodulin (Ono et al. 2010). Two RyR-stabilizing drugs that act on the central (K201) and N-terminal domains (dantrolene) antagonized the effect of the central and N-terminal mutations, respectively, and restored calmodulin binding (Ono et al. 2010, Xu et al. 2010). Transmission between the interdomain interaction, calmodulin binding, and channel opening is believed to be mediated by a calmodulin-like domain (residues 4064–4210 in RyR1; (Xiong et al. 2006)) and its interaction with the calmodulin binding domain of RyR (Ono et al. 2010). The effect of CPVT and ARVD2 mutations on the extent of interdomain interaction may be manifested as a change in the intrinsic opening tendency of the channel or in the allosteric effect between calcium binding and channel opening (Zahradnikova et al. 2010).

#### **2.3 Excitation-contraction coupling**

The RyRs are located at tubulo-reticular junctions, special calcium release sites, where they face the adjacent calcium channels (DHPRs) of the plasma membrane. The release sites in both skeletal and cardiac muscle contain clusters of RyRs and DHPRs, and their structures are quite similar. However, DHPRs and RyRs are positioned very precisely adjacent to one another in skeletal muscle but not in cardiac muscle (Protasi 2002). Therefore, in contrast to skeletal muscle, where RyR1 is activated by conformational changes in the DHPR protein (Rios and Brum 1987, Rios et al. 1993), in cardiac cells RyRs are activated by calcium ions that flow into the dyadic space during single-channel openings of DHPRs (Fabiato 1985).

Calcium sparks, calcium release events of individual calcium release units (Cheng et al. 1993), raise local cytosolic Ca2+ concentrations by ∼200 nM (Cannell et al. 1994, Cannell et al. 1995, Cheng et al. 1993, Lopez-Lopez et al. 1995). They can be discerned when release probability is low (Cheng et al. 1993). Although the probability of spontaneous sparks in a rat ventricular myocyte is low (about 100 s-1) (Cheng et al. 1993), openings of L-type Ca2+ channels greatly increase this probability during a voltage step depolarization (Cannell et al. 1995, Lopez-Lopez et al. 1995). Ca2+ sparks are stereotypical, i.e., their amplitudes and

increased sensitivity to agonists) led to the postulation of the interdomain hypothesis (Yamamoto et al. 2000). It was hypothesized that there is an interaction between the Nterminal and the central domain that stabilizes the channel in the closed state. Mutations within these domains would then weaken this interdomain interaction (Ikemoto and Yamamoto 2000, Yamamoto et al. 2000) and these changes might play a key role in the regulatory mechanisms of the channel. The experimental strategy to test this hypothesis was based on the assumption that if an interaction between two regions functions as a regulatory mechanism, then a synthetic domain peptide with a sequence identical to that of one of the interacting regions should destabilize the interaction and disturb RyR regulation in a way similar to that observed in the mutations. Several domain peptides from the N-terminal and central region of RyR1 or RyR2 (H163–S195, L590–C609 and L601–C620, L2442–P2477 and G2460–P2495, D2380–A2411) were indeed able to activate RyR and increase its sensitivity to agonists (El-Hayek et al. 1999, Faltinova et al. 2011, Tateishi et al. 2009, Yamamoto and Ikemoto 2002, Yang et al. 2006). The group of Yamamoto and Ikemoto further postulated, based results of George et al. (2004), that the interaction between the I-domain and another, undefined domain also plays a role in stabilizing the closed state of RyR2. A domain peptide from the I-domain (P4090–E4123) did indeed activate RyR2 (Tateishi et al. 2009). Furthermore, mutations in the N-terminal and central domains destabilized the interaction of RyR2 with its regulator calmodulin (Ono et al. 2010). Two RyR-stabilizing drugs that act on the central (K201) and N-terminal domains (dantrolene) antagonized the effect of the central and N-terminal mutations, respectively, and restored calmodulin binding (Ono et al. 2010, Xu et al. 2010). Transmission between the interdomain interaction, calmodulin binding, and channel opening is believed to be mediated by a calmodulin-like domain (residues 4064–4210 in RyR1; (Xiong et al. 2006)) and its interaction with the calmodulin binding domain of RyR (Ono et al. 2010). The effect of CPVT and ARVD2 mutations on the extent of interdomain interaction may be manifested as a change in the intrinsic opening tendency of the channel or in the allosteric effect between calcium binding and channel

The RyRs are located at tubulo-reticular junctions, special calcium release sites, where they face the adjacent calcium channels (DHPRs) of the plasma membrane. The release sites in both skeletal and cardiac muscle contain clusters of RyRs and DHPRs, and their structures are quite similar. However, DHPRs and RyRs are positioned very precisely adjacent to one another in skeletal muscle but not in cardiac muscle (Protasi 2002). Therefore, in contrast to skeletal muscle, where RyR1 is activated by conformational changes in the DHPR protein (Rios and Brum 1987, Rios et al. 1993), in cardiac cells RyRs are activated by calcium ions that flow into the dyadic space during single-channel openings of DHPRs (Fabiato 1985). Calcium sparks, calcium release events of individual calcium release units (Cheng et al. 1993), raise local cytosolic Ca2+ concentrations by ∼200 nM (Cannell et al. 1994, Cannell et al. 1995, Cheng et al. 1993, Lopez-Lopez et al. 1995). They can be discerned when release probability is low (Cheng et al. 1993). Although the probability of spontaneous sparks in a rat ventricular myocyte is low (about 100 s-1) (Cheng et al. 1993), openings of L-type Ca2+ channels greatly increase this probability during a voltage step depolarization (Cannell et al. 1995, Lopez-Lopez et al. 1995). Ca2+ sparks are stereotypical, i.e., their amplitudes and

opening (Zahradnikova et al. 2010).

**2.3 Excitation-contraction coupling** 

spatio-temporal properties appear to be independent of their trigger signal (Cannell et al. 1995, Lopez-Lopez et al. 1995). This indicates that the amplitude and time course of a Ca2+ spark is largely governed by the properties of the participating RyRs. Thus, Ca2+ sparks can be considered to be the elementary Ca2+ release events underlying E-C coupling, and gradation of calcium release in response to ICa can be explained by the summation of variable numbers of Ca2+ sparks being activated (Cheng et al. 1996).

Under normal conditions, Ca2+ sparks are unable to activate additional Ca2+ sparks in adjacent regions, although the local free Ca2+ concentration associated with a Ca2+ spark is much larger than the global increase in free Ca2+ concentration produced by Ca2+ current activation ((Cannell et al. 1995), but see Parker et al. (1996)). This observation can be explained by the fact that SR Ca2+ release channels are situated very close to the L-type Ca2+ channel (the "local control model"; (Stern 1992)), where they sense the > 100-fold increase in local free Ca2+ concentration upon opening of a nearby L-type Ca2+ channel. The sensitivity of this local control is most clearly seen in the evidence that a single DHPR channel opening can elicit a Ca2+ spark (Cannell et al. 1995, Lopez-Lopez et al. 1995, Santana et al. 1996). A detailed analysis revealed that the probability of Ca2+ activation depends on the square of the single sarcolemmal Ca2+ channel current and on the square of the local free Ca2+ concentration (Santana et al. 1996) and that it is quite sensitive to the duration of the DPHR openings (Zahradnikova et al. 1999). Theoretical calculations (Cannell and Soeller 1997) as well as experimental data (Zahradnikova et al. 1999) suggest that RyRs can respond rapidly to calcium influx via DHPRs; the responsiveness of RyRs *in situ* depends on the geometrical arrangement of channels in the narrow dyadic space, in which high Ca2+ levels rapidly develop near the SR membrane. Such high Ca2+ levels lead to a rapid rate of binding of calcium to the RyR calcium sensitive sites, thereby reducing the latency between DHPR opening and SR calcium release to ~1 ms, in correspondence with the experimentally observed DHPR mean open times. However, direct measurements of the latency between long (~20-ms) calcium channel openings and spark activation (Wang et al. 2001) indicated a much longer latency (~7 ms). Since the binding of Mg2+ (Zahradnikova et al. 2010, Zahradnikova et al. 2003) has not been included in the model (Cannell and Soeller 1997), it may be inferred that only RyR channels that do not have Mg2+ bound are able to respond rapidly to DHPR openings (Zahradnikova et al. 2010, Zahradnikova et al. 2003).

It is not clear how many RyR channels contribute to a Ca2+ spark. While there are 5-9 RyRs per DHPR in cardiac myocytes, individual Ca2+ sparks may reflect activation of up to 20 RyRs (Bridge et al. 1999, Lukyanenko et al. 2000). This means that the majority of RyRs are activated by neighbouring RyRs and not by adjacent DHPR channels. A quantitative model of spark activation based on the allosteric action of Ca2+ and Mg2+ on RyR2 opening (Zahradnikova et al. 2010) predicts the opening of 1–8 RyR2 channels per spark, in agreement with experimental findings (Wang et al. 2001).

#### **3. Structure of ryanodine receptors**

#### **3.1 Overall structural features determined by cryoelectron microscopy (cryo-EM)**

RyR channels are the largest ion channels known so far, which makes structural studies of them very challenging. So far, the structure of a whole RyR has been determined only by cryo-EM. Most of these studies, including sub-nanometer EM (Samso et al. 2009,

Bioinformatics Domain Structure Prediction

indicated.

and Homology Modeling of Human Ryanodine Receptor 2 333

Fig. 1. A cryo-EM density map of RyR1 (accession no. 1274, http://www.ebi.ac.uk/emdbsrv/atlas/1274\_summary.html) in the in closed state. A. Cytopasmic view. B. Side view. The cytoplasmic (clamp, handle, central rim and column) and transmembrane regions (TM) are

Fig. 2. Analysis of the human cardiac ryanodine receptor using the PFAM database. The

position and type of mutations resulting in CVPT or ARVD2, as reported at

www.fsm.it/cardmoc/, are indicated above the sequence regions.

Serysheva et al. 2008) have focused on the skeletal RyR1 isoform, but some studies have been performed on RyR2 (Liu et al. 2002, Sharma et al. 1998) and RyR3 (Liu et al. 2001, Sharma et al. 2000) as well. In agreement with the high sequence homology of RyRs, which reaches ~65%, EM structures of all three isoforms are very similar (Wagenknecht and Samso 2002). The RyR2 (Sharma et al. 1998) and RyR3 (Sharma et al. 2000) isoforms differ slightly from the RyR1 isoform in several structural domains (called divergent regions) that map to segments of reduced homology between the RyR isoforms (Zhang et al. 2003), though the overall structure of the RyR isoforms is very well conserved (Wagenknecht and Samso 2002). The complete channel is made from a combination of four monomers to yield a tetramer with a fourfold axis of symmetry. The divergent regions have been suggested to play a role in calcium-dependent inactivation (DR1) (Du and Maclennan 1999), in signal transmission between DHPR and RyR1 (DR2) (Perez et al. 2003), and in conformational changes of the RyR and RyR-RyR interactions (DR3) (Zhang et al. 2003).

Each subunit of the RyR homotetramer consists of two main parts: a cytoplasmic part and a transmembrane part. The cytoplasmic part of the whole receptor, also called the "foot", is very large (280 × 280 × 120 Å) and interacts with many modulators which affect channel gating (Yano et al. 2006). It is composed of several structural segments: the clamp and the handle at the perimeter, the central rim surrounding the putative pore, and the column connecting the cytoplasmic and transmembrane parts (Fig. 1). These segments have been further subdivided into 15 subdomains (Lanner et al. 2010). The clamps are located at the corners of the cytoplasmic part and are likely to participate in intermolecular interactions with neighboring RyR molecules and other RyR modulators. They undergo large changes during the opening and closing of the channel (Orlova et al. 1996, Samso et al. 2009, Serysheva et al. 1999). Like the cytoplasmic domain, the transmembrane part undergoes large structural changes during the opening and closing of the RyR2 channel (Orlova et al. 1996, Samso et al. 2009, Serysheva et al. 1999).

#### **3.2 Bioinformatics domain prediction**

To find the putative individual structural entities of hRyR2, we analyzed the whole hRyR2 amino acid sequence using the PFAM domain database (Finn et al. 2008). Fourteen probable domains were found in the hRyR2 monomer. Eight of them are located in the N-terminal region (residues 1–~1561) and were identified as Ins145\_P3\_rec, MIR, RIH, SPRY and two RyR domains. The central region (residues 1562–3000) contains the RIH domain and two RyR domains. The C-terminal part (residues 3001–4995) contains an RIH associated domain, a RR\_TM4-6 domain and an Ion\_Trans domain, Fig. 2. The beginnings and ends of each domain are numbered according to the PFAM search results. Mutations of specific residues involved in ARVD2 and CPVT1 are shown (Yano et al. 2006; www.fsm.it/cardmoc). LIZ1-3 (amino acid residues 554–585, 1603–1631, and 3003–3039) represent leucine-isoleucine zipper areas. The proposed binding partners, SPI, PR130, mKAP, PP1, PP2A, PKA, D, K201, FKBP, and CaM represent spinophilin, protein 130, muscle specific kinase anchoring protein, protein phosphatases 1 and 2A, protein kinase A, dantrolene, 1,4-benzothiazepine derivative K201 (JTV519), FK-binding protein, and calmodulin, respectively. Binding partners for hRyR2 and their positions were adapted from Wang et al. (2011), Yamamoto et al. (2008), and Yano et al. (2006).

Serysheva et al. 2008) have focused on the skeletal RyR1 isoform, but some studies have been performed on RyR2 (Liu et al. 2002, Sharma et al. 1998) and RyR3 (Liu et al. 2001, Sharma et al. 2000) as well. In agreement with the high sequence homology of RyRs, which reaches ~65%, EM structures of all three isoforms are very similar (Wagenknecht and Samso 2002). The RyR2 (Sharma et al. 1998) and RyR3 (Sharma et al. 2000) isoforms differ slightly from the RyR1 isoform in several structural domains (called divergent regions) that map to segments of reduced homology between the RyR isoforms (Zhang et al. 2003), though the overall structure of the RyR isoforms is very well conserved (Wagenknecht and Samso 2002). The complete channel is made from a combination of four monomers to yield a tetramer with a fourfold axis of symmetry. The divergent regions have been suggested to play a role in calcium-dependent inactivation (DR1) (Du and Maclennan 1999), in signal transmission between DHPR and RyR1 (DR2) (Perez et al. 2003), and in conformational changes of the RyR and RyR-RyR interactions (DR3) (Zhang

Each subunit of the RyR homotetramer consists of two main parts: a cytoplasmic part and a transmembrane part. The cytoplasmic part of the whole receptor, also called the "foot", is very large (280 × 280 × 120 Å) and interacts with many modulators which affect channel gating (Yano et al. 2006). It is composed of several structural segments: the clamp and the handle at the perimeter, the central rim surrounding the putative pore, and the column connecting the cytoplasmic and transmembrane parts (Fig. 1). These segments have been further subdivided into 15 subdomains (Lanner et al. 2010). The clamps are located at the corners of the cytoplasmic part and are likely to participate in intermolecular interactions with neighboring RyR molecules and other RyR modulators. They undergo large changes during the opening and closing of the channel (Orlova et al. 1996, Samso et al. 2009, Serysheva et al. 1999). Like the cytoplasmic domain, the transmembrane part undergoes large structural changes during the opening and closing of the RyR2 channel (Orlova et al.

To find the putative individual structural entities of hRyR2, we analyzed the whole hRyR2 amino acid sequence using the PFAM domain database (Finn et al. 2008). Fourteen probable domains were found in the hRyR2 monomer. Eight of them are located in the N-terminal region (residues 1–~1561) and were identified as Ins145\_P3\_rec, MIR, RIH, SPRY and two RyR domains. The central region (residues 1562–3000) contains the RIH domain and two RyR domains. The C-terminal part (residues 3001–4995) contains an RIH associated domain, a RR\_TM4-6 domain and an Ion\_Trans domain, Fig. 2. The beginnings and ends of each domain are numbered according to the PFAM search results. Mutations of specific residues involved in ARVD2 and CPVT1 are shown (Yano et al. 2006; www.fsm.it/cardmoc). LIZ1-3 (amino acid residues 554–585, 1603–1631, and 3003–3039) represent leucine-isoleucine zipper areas. The proposed binding partners, SPI, PR130, mKAP, PP1, PP2A, PKA, D, K201, FKBP, and CaM represent spinophilin, protein 130, muscle specific kinase anchoring protein, protein phosphatases 1 and 2A, protein kinase A, dantrolene, 1,4-benzothiazepine derivative K201 (JTV519), FK-binding protein, and calmodulin, respectively. Binding partners for hRyR2 and their positions were adapted from Wang et al. (2011), Yamamoto et al. (2008),

et al. 2003).

1996, Samso et al. 2009, Serysheva et al. 1999).

**3.2 Bioinformatics domain prediction** 

and Yano et al. (2006).

Fig. 1. A cryo-EM density map of RyR1 (accession no. 1274, http://www.ebi.ac.uk/emdbsrv/atlas/1274\_summary.html) in the in closed state. A. Cytopasmic view. B. Side view. The cytoplasmic (clamp, handle, central rim and column) and transmembrane regions (TM) are indicated.

Fig. 2. Analysis of the human cardiac ryanodine receptor using the PFAM database. The position and type of mutations resulting in CVPT or ARVD2, as reported at www.fsm.it/cardmoc/, are indicated above the sequence regions.

Bioinformatics Domain Structure Prediction

**3.4.1 The Ins145\_P3\_rec domain** 

**3.4.2 The MIR domain** 

the central rim.

**3.4.3 The RIH domains** 

and Homology Modeling of Human Ryanodine Receptor 2 335

The Ins145\_P3\_rec domain is found in RyRs and IP3Rs (Ponting C. P. 2000). In IP3R, it participates in forming the ligand binding suppressor region (Bosanac et al. 2005). In the structure of RyR1, it is equivalent to the first N-terminal domain A (Tung et al. 2010). The PFAM prediction assigned this domain to residues 9–223, while in the RyR1 structure it corresponds to the equivalent RyR2 amino acids 12–228. This domain contains two closely spaced clusters of mutations associated with CPVT and ARVD in the region of residues 62– 81 and 164–189 (http://www.fsm.it/cardmoc), which belong to the first mutation cluster CPVT-I (George et al. 2007). The domain consists of 12 β-strands, arranged into a β-trefoil motif (see below). Superposing the ligand-binding suppressor domain of the IP3R (PDB ID 1XZZ; (Serysheva et al. 2008)) and the X-ray structure of the N-terminal portion of RyR1 into the cryo-EM density map of RyR1 (Tung et al. 2010) indicates that this domain should lie in

MIR domains are found in a number of proteins (George et al. 2007, Hamada et al. 1996, Strahl-Bolsinger and Scheinost 1999), including IP3Rs (Bosanac et al. 2005, Bosanac et al. 2002), and RyRs (Amador et al. 2009, Lobo and Van Petegem 2009, Ponting 2000, Tung et al. 2010). They usually consist of several ~50-residue MIR motifs with a β-trefoil fold (Murzin et al. 1992), and form β-barrel structures with hairpin triplets and internal pseudo-threefold symmetry (Bosanac et al. 2005). In RyR1, the MIR domain is equivalent to the second Nterminal domain, domain B (Tung et al. 2010). In PMT1 mannosyltransferases, MIR motifs are located in the luminal loops of the enzyme and are essential for transferase activity (Stahl-Bolsinger et al., 1999). In IP3R, the first two of the β-trefoil motifs were found to belong to the suppressor region (Ins145\_P3\_rec, (Bosanac et al. 2002), see above), while the latter two (parts of the MIR domain) belong to the ligand binding region (Bosanac et al. 2005). Similar β-trefoil motifs were predicted to be present in the N-terminal region of the RyR1 isoform (Bosanac et al. 2002), and were later found in its crystal structure (Amador et al. 2009), although the sequence similarity between IP3R and RyR is relatively low. PFAM predicted this domain to lie between residues 226–406 of RyR2, and in the RyR1 structure it corresponds to the equivalent RyR2 amino acids 228–411. This region contains a large number of CPVT/ARVD2 mutations (Fig. 2) (http://www.fsm.it/cardmoc/), which belong to the first mutation cluster CPVT-I (George et al. 2007). Docking of the ligand-binding domain of IP3R (PDB ID 1N4K, (Serysheva et al. 2005); PDB ID 1XZZ and PDB ID 1N4K, (Serysheva et al. 2008)) into the cryo-EM structure of RyR1 predicts that this domain will lie in the clamp region while doing a similar docking using the X-ray structure of the Nterminal sequence of RyR1 (Tung et al. 2010) indicates that the MIR domain should lie on

Two RIH domains were found in both RyRs and IP3Rs. The X-ray structure of a major part of the first RIH domain of both IP3R (Bosanac et al. 2002) and RyR1 (Tung et al. 2010) has been determined. In the case of RyR1, this corresponds to domain C and the structure extends out to residue 532, which corresponds to residue 543 in hRyR2. Structurally, RIH is composed of α-helices. In IP3R, this domain forms the binding site for inositol 1,4,5-

**3.4 Relationships between domain structure, 3D structure and RyR function** 

the clamp or on the central rim of the ion channel, respectively.

#### **3.3 X-ray analysis**

To date, X-ray structures have been determined for several N-terminal fragments of rabbit RyR1 (residues ~12–210 (Amador et al. 2009, Lobo and Van Petegem 2009); residues 12–532 (Tung et al. 2010)); and murine RyR2 (residues 12–217, wt and mutants A77V and V186M) (Lobo and Van Petegem 2009). The overall structures of all isoforms, including those containing mutations, are very similar (superposition results in r.m.s.d. of 0.69 Å for 150 C<sup>α</sup> atoms), Fig. 3, which is not surprising due to their close sequence homology and physiological function. The longest fragment, residues 12–532, is composed of three structural domains, which have been designated as A (1–205), B (206–394) and C (395–532) (Tung et al. 2010). Domains B and C are homologous respectively with the β-trefoil and αhelical domains of the IP3R binding core. Domain A is homologous with the IP3 binding suppressor domain of IP3R (Yuchi and Van Petegem 2011). The central motif of domains A and B is a β-trefoil core consisting of 12 β-strands which are held together by hydrophobic interactions. In the A domain, a 10-residue α-helix is inserted between strands β4 and β5 (Lobo and Van Petegem 2009). Domain C consists of five α-helices. Most of the secondary structure elements are connected by flexible loops, which were proposed to be located at the interfaces with other RyR domains or at the interfaces with proteins interacting with RyR (Tung et al. 2010, Yano et al. 2006). The X-ray crystal structures allowed the precise mapping of several mutations which are associated with CPVT and ARVD2 (Lobo and Van Petegem 2009), as well as the homologous mutations in RyR1 which are responsible for malignant hyperthermia (MH) and central core disease (CCD) (Tung et al. 2010).

Fig. 3. Comparison of X-ray structures of the N-terminal domains of RyR1 (PDB ID 2XOA green, 3HSM pink, 3ILA magenta) and RyR2 (PDB ID 3IM5 blue, 3IM6 yellow, 3IM7 purple). The superposition was performed using the program Multiprot (Shatsky et al. 2004), through 150 Cα atoms with a r.m.s.d. of 0.69 Å. Arrows indicate flexible loops.

#### **3.4 Relationships between domain structure, 3D structure and RyR function 3.4.1 The Ins145\_P3\_rec domain**

The Ins145\_P3\_rec domain is found in RyRs and IP3Rs (Ponting C. P. 2000). In IP3R, it participates in forming the ligand binding suppressor region (Bosanac et al. 2005). In the structure of RyR1, it is equivalent to the first N-terminal domain A (Tung et al. 2010). The PFAM prediction assigned this domain to residues 9–223, while in the RyR1 structure it corresponds to the equivalent RyR2 amino acids 12–228. This domain contains two closely spaced clusters of mutations associated with CPVT and ARVD in the region of residues 62– 81 and 164–189 (http://www.fsm.it/cardmoc), which belong to the first mutation cluster CPVT-I (George et al. 2007). The domain consists of 12 β-strands, arranged into a β-trefoil motif (see below). Superposing the ligand-binding suppressor domain of the IP3R (PDB ID 1XZZ; (Serysheva et al. 2008)) and the X-ray structure of the N-terminal portion of RyR1 into the cryo-EM density map of RyR1 (Tung et al. 2010) indicates that this domain should lie in the clamp or on the central rim of the ion channel, respectively.

#### **3.4.2 The MIR domain**

334 Bioinformatics – Trends and Methodologies

To date, X-ray structures have been determined for several N-terminal fragments of rabbit RyR1 (residues ~12–210 (Amador et al. 2009, Lobo and Van Petegem 2009); residues 12–532 (Tung et al. 2010)); and murine RyR2 (residues 12–217, wt and mutants A77V and V186M) (Lobo and Van Petegem 2009). The overall structures of all isoforms, including those containing mutations, are very similar (superposition results in r.m.s.d. of 0.69 Å for 150 C<sup>α</sup> atoms), Fig. 3, which is not surprising due to their close sequence homology and physiological function. The longest fragment, residues 12–532, is composed of three structural domains, which have been designated as A (1–205), B (206–394) and C (395–532) (Tung et al. 2010). Domains B and C are homologous respectively with the β-trefoil and αhelical domains of the IP3R binding core. Domain A is homologous with the IP3 binding suppressor domain of IP3R (Yuchi and Van Petegem 2011). The central motif of domains A and B is a β-trefoil core consisting of 12 β-strands which are held together by hydrophobic interactions. In the A domain, a 10-residue α-helix is inserted between strands β4 and β5 (Lobo and Van Petegem 2009). Domain C consists of five α-helices. Most of the secondary structure elements are connected by flexible loops, which were proposed to be located at the interfaces with other RyR domains or at the interfaces with proteins interacting with RyR (Tung et al. 2010, Yano et al. 2006). The X-ray crystal structures allowed the precise mapping of several mutations which are associated with CPVT and ARVD2 (Lobo and Van Petegem 2009), as well as the homologous mutations in RyR1 which are responsible for malignant

hyperthermia (MH) and central core disease (CCD) (Tung et al. 2010).

Fig. 3. Comparison of X-ray structures of the N-terminal domains of RyR1 (PDB ID 2XOA green, 3HSM pink, 3ILA magenta) and RyR2 (PDB ID 3IM5 blue, 3IM6 yellow, 3IM7 purple). The superposition was performed using the program Multiprot (Shatsky et al. 2004), through 150 Cα atoms with a r.m.s.d. of 0.69 Å. Arrows indicate flexible loops.

**3.3 X-ray analysis** 

MIR domains are found in a number of proteins (George et al. 2007, Hamada et al. 1996, Strahl-Bolsinger and Scheinost 1999), including IP3Rs (Bosanac et al. 2005, Bosanac et al. 2002), and RyRs (Amador et al. 2009, Lobo and Van Petegem 2009, Ponting 2000, Tung et al. 2010). They usually consist of several ~50-residue MIR motifs with a β-trefoil fold (Murzin et al. 1992), and form β-barrel structures with hairpin triplets and internal pseudo-threefold symmetry (Bosanac et al. 2005). In RyR1, the MIR domain is equivalent to the second Nterminal domain, domain B (Tung et al. 2010). In PMT1 mannosyltransferases, MIR motifs are located in the luminal loops of the enzyme and are essential for transferase activity (Stahl-Bolsinger et al., 1999). In IP3R, the first two of the β-trefoil motifs were found to belong to the suppressor region (Ins145\_P3\_rec, (Bosanac et al. 2002), see above), while the latter two (parts of the MIR domain) belong to the ligand binding region (Bosanac et al. 2005). Similar β-trefoil motifs were predicted to be present in the N-terminal region of the RyR1 isoform (Bosanac et al. 2002), and were later found in its crystal structure (Amador et al. 2009), although the sequence similarity between IP3R and RyR is relatively low. PFAM predicted this domain to lie between residues 226–406 of RyR2, and in the RyR1 structure it corresponds to the equivalent RyR2 amino acids 228–411. This region contains a large number of CPVT/ARVD2 mutations (Fig. 2) (http://www.fsm.it/cardmoc/), which belong to the first mutation cluster CPVT-I (George et al. 2007). Docking of the ligand-binding domain of IP3R (PDB ID 1N4K, (Serysheva et al. 2005); PDB ID 1XZZ and PDB ID 1N4K, (Serysheva et al. 2008)) into the cryo-EM structure of RyR1 predicts that this domain will lie in the clamp region while doing a similar docking using the X-ray structure of the Nterminal sequence of RyR1 (Tung et al. 2010) indicates that the MIR domain should lie on the central rim.

#### **3.4.3 The RIH domains**

Two RIH domains were found in both RyRs and IP3Rs. The X-ray structure of a major part of the first RIH domain of both IP3R (Bosanac et al. 2002) and RyR1 (Tung et al. 2010) has been determined. In the case of RyR1, this corresponds to domain C and the structure extends out to residue 532, which corresponds to residue 543 in hRyR2. Structurally, RIH is composed of α-helices. In IP3R, this domain forms the binding site for inositol 1,4,5-

Bioinformatics Domain Structure Prediction

**3.4.7 The Ion\_Trans domain** 

sides.

clusters.

**domains** 

this domain in the column of RyR1 (Wang et al. 2011).

**3.4.8 Regions without a known domain structure** 

and Homology Modeling of Human Ryanodine Receptor 2 337

of RyR according to cryo-EM (Samso and Wagenknecht 2002). The adjacent RR\_TM4-6 domain was predicted to lie between residues 4333–4599 by PFAM. It contains the divergent region DR1 (Liu et al. 2002), the putative calcium sensor that is responsible for the physiological activation of RyR2 (Chen et al. 1998, Li and Chen 2001), and also the calmodulin-like domain (Xiong et al. 2006). The end of the RIH\_associated domain together with the RR\_TM4-6 domain (aa. 3722–4610) were identified as a separate functional domain (called the I-domain) (George et al. 2004), which contains a third cluster of CPVT/ARVD mutations (CPVT-III, (George et al. 2007), http://www.fsm.it/cardmoc/); cryo-EM locates

The Ion\_Trans (ion transport) domain covers the transmembrane region of the ryanodine receptor. This domain is found in most voltage-dependent ion channels as well as in RyRs and IP3Rs. The domain usually has six transmembrane helices, the final two of which flank a loop that determines ion selectivity (Unnerstale et al. 2009). The tetrameric ion channels (potassium channels, IP3Rs and RyRs) contain one Ion\_Trans domain per monomer, while the sodium and calcium channels contain four Ion\_Trans domain repeats. This domain is located in the transmembrane region, embedded in the membrane of the SR, Fig. 1A,B, and contains the ion conducting pore. It has been proposed that the pore includes the GVRAGGGIGD amino acid sequence (Du et al. 2001, Zhao et al. 1999), where the amino acids GGIG were proposed to form a selectivity filter (Balshaw et al. 1999, Gao et al. 2000). The fourth cluster of CPVT/ARVD2 mutations (CPVT-IV, (George et al. 2007), http://www.fsm.it/cardmoc/) occurs in this domain and in the flanking regions on both its

Two of the three regions of isoform sequence diversity (DR2 and DR3; (Perez et al. 2003, Zhang et al. 2003)) are located outside of the PFAM-predicted domains. DR2 is located between the second and the third SPRY domains (residues 1353-1397, Liu et al. 2004), while DR3 (residues 1852-1890) is located between SPRY3 and RIH2. Both DR2 and DR3 are found in the clamp region of RyR1 (Wang et al. 2011). DR3 contains two isolated CPVT/ARVD2 mutations, G1885E and G1886S (Milting et al. 2006), which lie outside of the four mutation

The second and third leucine-isoleucine zipper (aa. 1604–1644 and 3004–3041), that were found to bind PP2A and PKA with the help of the adapter proteins PR130 and mAKAP, respectively (Marx et al. 2001), are also located outside the PFAM-predicted domains. The

**4. Cloning, expression and characterization of predicted N-terminal hRyR2** 

In our previous work we concentrated on the production and characterization of recombinant N-terminal domains of hRyR2 (residues 1–759) in *Escherichia coli* expression systems ((Bauerova-Hlinkova et al. 2010); unpublished results). Based on the bioinformatics analysis described above, we assumed that the predicted domains would form individual entities and might behave as stable proteins. We designed several constructs covering the

location of these motifs in the 3D structure of the RyR2 is unknown.

trisphosphate (Bosanac et al. 2002). By superposing the ligand suppressor domain (PDB ID 1XZZ) and the ligand binding core (PDB ID 1N4K) on the N-terminal part of RyR1 (PDB ID 2XAO), it was proposed that all three domains of IP3R, i.e. Ins145\_P3\_rec, MIR and RIH, interact together, as predicted by Chan et al. (2007), and are arranged similarly as in the Nterminal part of RyR1 (Yuchi and Van Petegem 2011). The PFAM prediction placed this domain between residues 451–655, while in the RyR1 structure its beginning corresponds to the equivalent RyR2 amino acid 410. In RyR2, the RIH domain was reported to contain a leucine–isoleucine zipper between amino acid residues 554 and 585 that mediates binding of the phosphatase PP1 via the spinophilin targeting protein (Marx et al. 2001). This domain was also proposed to contain the binding site for dantrolene (residue 626 in hRyR2), and it contains several of the CPVT/ARVD2 mutations (http://www.fsm.it/cardmoc/), which belong to the first mutation cluster CPVT-I (George et al. 2007). It was located in the clamp region by cryo-EM (Wang et al. 2011). However, docking of the X-ray structure of the Nterminal sequence of RyR1 into the cryo-EM structure of RyR1 predicts that this domain lies in the central rim (Tung et al. 2010).

PFAM predicted that the second RIH domain lies between residues 2121-2332. This domain and its C-terminally adjacent region contain the central cluster of CPVT/ARVD mutations (CPVT-II, (George et al. 2007), http://www.fsm.it/cardmoc/) and is flanked by putative binding sites for the protein FKBP 12.6. Cryo-EM places this region in the clamp (Liu et al. 2005, Wang et al. 2011).

#### **3.4.4 The SPRY domains**

The SPRY domain (**sp**1A kinase and the **ry**anodine receptors) (Ponting et al. 1997) structurally consists of several antiparallel β-strands connected with flexible loops. The precise function of the SPRY domain (and the related B30.2 domain) is unknown; however, it is believed to act as a protein–protein interaction module capable of binding multiple targets by recognizing the conformation of a partner protein rather than a consensus sequence motif (Woo et al. 2006, Yao et al. 2006). The B30.2/SPRY domain has been identified in numerous and diverse proteins across bacterial and eukaryotic species (e.g. pyrin/marenostrin and other butyrophilin-like homologues, ryanodine receptors and midin1), including over 150 proteins in humans (Rhodes et al. 2005), suggesting that the specific function of the B30.2/SPRY domain within a given protein may heavily rely on the other domains in their neighbourhood (Kleiber and Singh 2009). PFAM predicted that RyR2 contains three SPRY domains, corresponding to residues 670–808, 1098–1221, and 1433–1561. The first two domains contain three CPVT/ARVD mutations: R739H, T1107M and A1136V (Medeiros-Domingo et al. 2009), Fig. 2, which lie outside of the four mutation clusters.

#### **3.4.5 The RyR domains**

Four copies of the RyR domain are present in the ryanodine receptor, of which two belong to the N-terminal and two to the central regions. The function of this domain is unknown (Ponting 2000). In the second RyR domain, two isolated CPVT mutations, which lie outside of the four mutation clusters, have been found to date: R1013Q and R1051P, Fig. 2 (Marjamaa et al. 2009, Medeiros-Domingo et al. 2009).

#### **3.4.6 The RIH\_associated and RR\_TM4-6 domains**

According to PFAM, the RIH\_associated domain should lie between residues 3465–3960. This domain contains the calmodulin binding site, which was localized to the column region of RyR according to cryo-EM (Samso and Wagenknecht 2002). The adjacent RR\_TM4-6 domain was predicted to lie between residues 4333–4599 by PFAM. It contains the divergent region DR1 (Liu et al. 2002), the putative calcium sensor that is responsible for the physiological activation of RyR2 (Chen et al. 1998, Li and Chen 2001), and also the calmodulin-like domain (Xiong et al. 2006). The end of the RIH\_associated domain together with the RR\_TM4-6 domain (aa. 3722–4610) were identified as a separate functional domain (called the I-domain) (George et al. 2004), which contains a third cluster of CPVT/ARVD mutations (CPVT-III, (George et al. 2007), http://www.fsm.it/cardmoc/); cryo-EM locates this domain in the column of RyR1 (Wang et al. 2011).

#### **3.4.7 The Ion\_Trans domain**

336 Bioinformatics – Trends and Methodologies

trisphosphate (Bosanac et al. 2002). By superposing the ligand suppressor domain (PDB ID 1XZZ) and the ligand binding core (PDB ID 1N4K) on the N-terminal part of RyR1 (PDB ID 2XAO), it was proposed that all three domains of IP3R, i.e. Ins145\_P3\_rec, MIR and RIH, interact together, as predicted by Chan et al. (2007), and are arranged similarly as in the Nterminal part of RyR1 (Yuchi and Van Petegem 2011). The PFAM prediction placed this domain between residues 451–655, while in the RyR1 structure its beginning corresponds to the equivalent RyR2 amino acid 410. In RyR2, the RIH domain was reported to contain a leucine–isoleucine zipper between amino acid residues 554 and 585 that mediates binding of the phosphatase PP1 via the spinophilin targeting protein (Marx et al. 2001). This domain was also proposed to contain the binding site for dantrolene (residue 626 in hRyR2), and it contains several of the CPVT/ARVD2 mutations (http://www.fsm.it/cardmoc/), which belong to the first mutation cluster CPVT-I (George et al. 2007). It was located in the clamp region by cryo-EM (Wang et al. 2011). However, docking of the X-ray structure of the Nterminal sequence of RyR1 into the cryo-EM structure of RyR1 predicts that this domain lies

PFAM predicted that the second RIH domain lies between residues 2121-2332. This domain and its C-terminally adjacent region contain the central cluster of CPVT/ARVD mutations (CPVT-II, (George et al. 2007), http://www.fsm.it/cardmoc/) and is flanked by putative binding sites for the protein FKBP 12.6. Cryo-EM places this region in the clamp (Liu et al.

The SPRY domain (**sp**1A kinase and the **ry**anodine receptors) (Ponting et al. 1997) structurally consists of several antiparallel β-strands connected with flexible loops. The precise function of the SPRY domain (and the related B30.2 domain) is unknown; however, it is believed to act as a protein–protein interaction module capable of binding multiple targets by recognizing the conformation of a partner protein rather than a consensus sequence motif (Woo et al. 2006, Yao et al. 2006). The B30.2/SPRY domain has been identified in numerous and diverse proteins across bacterial and eukaryotic species (e.g. pyrin/marenostrin and other butyrophilin-like homologues, ryanodine receptors and midin1), including over 150 proteins in humans (Rhodes et al. 2005), suggesting that the specific function of the B30.2/SPRY domain within a given protein may heavily rely on the other domains in their neighbourhood (Kleiber and Singh 2009). PFAM predicted that RyR2 contains three SPRY domains, corresponding to residues 670–808, 1098–1221, and 1433–1561. The first two domains contain three CPVT/ARVD mutations: R739H, T1107M and A1136V (Medeiros-Domingo et al. 2009), Fig. 2,

Four copies of the RyR domain are present in the ryanodine receptor, of which two belong to the N-terminal and two to the central regions. The function of this domain is unknown (Ponting 2000). In the second RyR domain, two isolated CPVT mutations, which lie outside of the four mutation clusters, have been found to date: R1013Q and R1051P, Fig. 2

According to PFAM, the RIH\_associated domain should lie between residues 3465–3960. This domain contains the calmodulin binding site, which was localized to the column region

in the central rim (Tung et al. 2010).

which lie outside of the four mutation clusters.

(Marjamaa et al. 2009, Medeiros-Domingo et al. 2009).

**3.4.6 The RIH\_associated and RR\_TM4-6 domains** 

2005, Wang et al. 2011).

**3.4.4 The SPRY domains** 

**3.4.5 The RyR domains** 

The Ion\_Trans (ion transport) domain covers the transmembrane region of the ryanodine receptor. This domain is found in most voltage-dependent ion channels as well as in RyRs and IP3Rs. The domain usually has six transmembrane helices, the final two of which flank a loop that determines ion selectivity (Unnerstale et al. 2009). The tetrameric ion channels (potassium channels, IP3Rs and RyRs) contain one Ion\_Trans domain per monomer, while the sodium and calcium channels contain four Ion\_Trans domain repeats. This domain is located in the transmembrane region, embedded in the membrane of the SR, Fig. 1A,B, and contains the ion conducting pore. It has been proposed that the pore includes the GVRAGGGIGD amino acid sequence (Du et al. 2001, Zhao et al. 1999), where the amino acids GGIG were proposed to form a selectivity filter (Balshaw et al. 1999, Gao et al. 2000). The fourth cluster of CPVT/ARVD2 mutations (CPVT-IV, (George et al. 2007), http://www.fsm.it/cardmoc/) occurs in this domain and in the flanking regions on both its sides.

#### **3.4.8 Regions without a known domain structure**

Two of the three regions of isoform sequence diversity (DR2 and DR3; (Perez et al. 2003, Zhang et al. 2003)) are located outside of the PFAM-predicted domains. DR2 is located between the second and the third SPRY domains (residues 1353-1397, Liu et al. 2004), while DR3 (residues 1852-1890) is located between SPRY3 and RIH2. Both DR2 and DR3 are found in the clamp region of RyR1 (Wang et al. 2011). DR3 contains two isolated CPVT/ARVD2 mutations, G1885E and G1886S (Milting et al. 2006), which lie outside of the four mutation clusters.

The second and third leucine-isoleucine zipper (aa. 1604–1644 and 3004–3041), that were found to bind PP2A and PKA with the help of the adapter proteins PR130 and mAKAP, respectively (Marx et al. 2001), are also located outside the PFAM-predicted domains. The location of these motifs in the 3D structure of the RyR2 is unknown.

#### **4. Cloning, expression and characterization of predicted N-terminal hRyR2 domains**

In our previous work we concentrated on the production and characterization of recombinant N-terminal domains of hRyR2 (residues 1–759) in *Escherichia coli* expression systems ((Bauerova-Hlinkova et al. 2010); unpublished results). Based on the bioinformatics analysis described above, we assumed that the predicted domains would form individual entities and might behave as stable proteins. We designed several constructs covering the

Bioinformatics Domain Structure Prediction

was fully recovered upon cooling (data not shown).

and Homology Modeling of Human Ryanodine Receptor 2 339

with thioredoxin A, its CD signal indicated a gradual loss of α-helicity (*ca.* 25% at 95°C) that

Fig. 4. Far-UV CD spectra of recombinant hRyR2 fragments 1–606 (A) and 409-606 (B), and as fusion proteins with thioredoxin A at the N-terminus Trx-384-606 (C) and Trx-409-606 (D). Spectra were recorded in an 0.02-cm cell at 4°C (solid line) and 37°C (dashed line; 35°C for hRyR2 1–606). Samples were dialyzed into 100 mM NaF, 20 mM Tris SO4 pH 7.5 or 8.0, including either 0.1% Tween-20 (A-C) or sulfobetaine SB14 (D). Deconvolution of the 4°C spectra using the CDsstr algorithm as implemented in Dichroweb with various reference databases ((Whitmore and Wallace 2004); http://dichroweb.cryst.bbk.ac.uk/) results in the

The results of the thermofluor shift assay, performed with the longest hRyR2 fragment 1– 606, were in good agreement with the temperature dependence of the circular dichroism spectra. The thermal stability of the fragment was tested in a wide range of buffers (Tris, Hepes, MES, citrate, Na/K phosphate, Bicine, Tricine; pH range 5.0–9.0), Fig. 5, and in the presence or absence of 1 and 5 mM Ca2+, Mg2+, ATP and 5–20% glycerol. The fragment is the most stable in neutral-basic pH (7.0–8.0) with a Tm of ~45°C, Fig. 5B, C. The presence of Ca2+

amount of α-helix and β-strand shown at the top right of each panel.

predicted first three N-terminal domains (Ins145\_P3\_rec, MIR, RIH) with various starting and terminating residues, Table 1, taking into consideration the predicted secondary structure elements and the known structure of the related IP3R domains*.* All fragments were designed not to disrupt the predicted secondary structure elements, Fig. 6. In this study we obtained three authentic recombinant hRyR2 fragments with good expression yields and solubility: 1–606 (involves first three putative N-terminal domains), 391–606 and 409–606 (involves the core of the predicted RIH domain) and several hRyR2 fragments expressed with a fusion partner, either thioredoxin or Nus A protein, Table 1.


Table 1. Designed fragments of the N-terminal part of hRyR2 with good expression and solubility. The fragment 1–247.His6 contains the Ins145\_P3\_rec domain. The longest Nterminal hRyR2 fragment 1–606.His6 involves all three putative N-terminal domains. The hRyR fragments 384–606, 391–606 and 409–606 involve the core of the RIH domain. To improve solubility, some fragments were expressed with fusion partners – Nus A protein and thioredoxin. Quantification of expression: ++ 1–5 mg/g expressed cells, +++ more than 5 mg/g expressed cells. The amount of the protein was determined after IMAC purification (Bauerova-Hlinkova et al. 2010).

The folding and thermal stability of these expressed fragments was assessed by circular dichroism (CD) spectroscopy (Fig. 4) and a thermofluor shift assay (Fig. 5). The secondary structure content of the fragments was derived from the CD spectra using the CDsstr algorithm (Johnson 1999) with an extensive set of reference databases (Whitmore and Wallace 2004). For the longest fragment, 1–606, the amount of α-helices and β-strands resulted in *ca.* 23 and 29%, respectively (Fig. 4A; (Bauerova-Hlinkova et al. 2010). For the Cterminal domain covering residues 409 to 606 (Fig. 4B), a higher degree of α-helicity (*ca.* 50%) was found. This value is lower than expected from the model for this region (62%, see section 4.1.), which indicates that the expected N-terminal helix might be only partially folded. Such disorder within the terminal regions is frequently observed in NMR and X-ray derived structures of protein fragments. The spectra of the thioredoxin fusion protein fragments 384–606 (Fig. 4C) and 409–606 (Fig. 4D) indicated α-helix and β-strand contents of *ca.* 40 and 10%, respectively.

With temperature increasing up to 35°C, only small changes of the CD signal could be observed for fragment 1–606 (Fig. 5A), but further heating resulted in irreversible precipitation under our experimental conditions (Bauerova-Hlinkova et al. 2010). A similar behaviour was observed for the fragment 409–606, which was stable up to 37°C (Fig. 4B) and started to precipitate at T > 42°C (data not shown). When the fragment 409–606 was fused

predicted first three N-terminal domains (Ins145\_P3\_rec, MIR, RIH) with various starting and terminating residues, Table 1, taking into consideration the predicted secondary structure elements and the known structure of the related IP3R domains*.* All fragments were designed not to disrupt the predicted secondary structure elements, Fig. 6. In this study we obtained three authentic recombinant hRyR2 fragments with good expression yields and solubility: 1–606 (involves first three putative N-terminal domains), 391–606 and 409–606 (involves the core of the predicted RIH domain) and several hRyR2 fragments expressed

> hRyR2 fragment Calculated Mw × 103 Protein Expression 1–247.His6 27.9 ++ 1–606.His6 68.6 ++ 391–606.His6 25.2 ++ 409–606.His6 23.2 +++ Nus.1–606 128.6 ++ Nus.230–606 104.0 ++ Trx.384–606.His6 43.3 +++ Trx.391–606.His6 42.5 +++ Trx.409–606.His6 40.5 +++

Table 1. Designed fragments of the N-terminal part of hRyR2 with good expression and solubility. The fragment 1–247.His6 contains the Ins145\_P3\_rec domain. The longest Nterminal hRyR2 fragment 1–606.His6 involves all three putative N-terminal domains. The hRyR fragments 384–606, 391–606 and 409–606 involve the core of the RIH domain. To improve solubility, some fragments were expressed with fusion partners – Nus A protein and thioredoxin. Quantification of expression: ++ 1–5 mg/g expressed cells, +++ more than 5 mg/g expressed cells. The amount of the protein was determined after IMAC purification

The folding and thermal stability of these expressed fragments was assessed by circular dichroism (CD) spectroscopy (Fig. 4) and a thermofluor shift assay (Fig. 5). The secondary structure content of the fragments was derived from the CD spectra using the CDsstr algorithm (Johnson 1999) with an extensive set of reference databases (Whitmore and Wallace 2004). For the longest fragment, 1–606, the amount of α-helices and β-strands resulted in *ca.* 23 and 29%, respectively (Fig. 4A; (Bauerova-Hlinkova et al. 2010). For the Cterminal domain covering residues 409 to 606 (Fig. 4B), a higher degree of α-helicity (*ca.* 50%) was found. This value is lower than expected from the model for this region (62%, see section 4.1.), which indicates that the expected N-terminal helix might be only partially folded. Such disorder within the terminal regions is frequently observed in NMR and X-ray derived structures of protein fragments. The spectra of the thioredoxin fusion protein fragments 384–606 (Fig. 4C) and 409–606 (Fig. 4D) indicated α-helix and β-strand contents of

With temperature increasing up to 35°C, only small changes of the CD signal could be observed for fragment 1–606 (Fig. 5A), but further heating resulted in irreversible precipitation under our experimental conditions (Bauerova-Hlinkova et al. 2010). A similar behaviour was observed for the fragment 409–606, which was stable up to 37°C (Fig. 4B) and started to precipitate at T > 42°C (data not shown). When the fragment 409–606 was fused

with a fusion partner, either thioredoxin or Nus A protein, Table 1.

(Bauerova-Hlinkova et al. 2010).

*ca.* 40 and 10%, respectively.

with thioredoxin A, its CD signal indicated a gradual loss of α-helicity (*ca.* 25% at 95°C) that was fully recovered upon cooling (data not shown).

Fig. 4. Far-UV CD spectra of recombinant hRyR2 fragments 1–606 (A) and 409-606 (B), and as fusion proteins with thioredoxin A at the N-terminus Trx-384-606 (C) and Trx-409-606 (D). Spectra were recorded in an 0.02-cm cell at 4°C (solid line) and 37°C (dashed line; 35°C for hRyR2 1–606). Samples were dialyzed into 100 mM NaF, 20 mM Tris SO4 pH 7.5 or 8.0, including either 0.1% Tween-20 (A-C) or sulfobetaine SB14 (D). Deconvolution of the 4°C spectra using the CDsstr algorithm as implemented in Dichroweb with various reference databases ((Whitmore and Wallace 2004); http://dichroweb.cryst.bbk.ac.uk/) results in the amount of α-helix and β-strand shown at the top right of each panel.

The results of the thermofluor shift assay, performed with the longest hRyR2 fragment 1– 606, were in good agreement with the temperature dependence of the circular dichroism spectra. The thermal stability of the fragment was tested in a wide range of buffers (Tris, Hepes, MES, citrate, Na/K phosphate, Bicine, Tricine; pH range 5.0–9.0), Fig. 5, and in the presence or absence of 1 and 5 mM Ca2+, Mg2+, ATP and 5–20% glycerol. The fragment is the most stable in neutral-basic pH (7.0–8.0) with a Tm of ~45°C, Fig. 5B, C. The presence of Ca2+

Bioinformatics Domain Structure Prediction

and Homology Modeling of Human Ryanodine Receptor 2 341

Fig. 6. The sequence alignment of hRyR2 and rabbit RyR1 (PDB ID 2XOA) sequences which was used as a template for the homology modeling of the hRyR2 tertiary structure. The alignment was performed using the alignment.align2d command of Modeller 9v8 (Sali and Blundell 1993). The secondary structure elements of hRyR2 (α-helices and β-strands are shown as red bars and blue arrows, respectively) were predicted by Jpred (Cole et al. 2008). The numbering of predicted secondary structure elements of hRyR2 corresponds to those found in the RyR1 template structure (α-helices and β-strands are shown as pink bars and light blue arrows, respectively). The alignment covers residues 1–543 of the human RyR2 protein. Residues in grey are missing in the RyR1 structure and correspond to flexible loops.

Identical residues (~64%) in both sequences are labelled by asterisks.

or ATP did not change the Tm significantly, suggesting that the fragment does not contain binding sites for these ligands.

Fig. 5. The first derivative of the thermal denaturation curve of recombinant 1–606 hRyR2 fragment obtained by thermofluor shift assay, measured in 200 mM Na-citrate, pH 5.5 (A), and 200 mM Na-phosphate, pH 7.5 (B). C. Tm of the recombinant hRyR2 fragment 1-606 obtained by the thermofluor shift assay in different buffers and pH (1 – K-phosphate 5.0; 2 – Na-phosphate 5.5; 3 – Na-citrate 5.5; 4 – MES 5.8; 5 – K-phosphate 6.0; 6 – MES 6.2; 7 – Naphosphate 6.5; 8 – Na-cacodylate 6.5; 9 – MES 6.5; 10 – K-phosphate 7.0; 11 – HEPES 7.0; 12 – Na-acetate 7.3; 13 – Na-phosphate 7.5; 14 – Tris 7.5; 15 – Imidazol 8.0; 16 – HEPES 8.0; 17 – Tris 8.0; 18 – Bicine 8.0; 19 – Tris 8.5; 20 – Bicine 9.0). The conditions under which RyR2 1– 606 was the most stable are in red. For each measurement 8 μg of protein were used.

#### **4.1 Model structure of the N-terminal part of hRyR2**

The amino-acid sequence of the recently determined structure of the N-terminal region of rabbit RyR1 (PDB ID 2XOA (Tung et al. 2010)) is 63% identical and 77% similar to the corresponding sequence of hRyR2 (similarity is here defined as having a Gonnet Pam250 matrix score > 0.5 as determined by ClustalX 2.1). The 2XOA structure therefore represents an excellent template for constructing a homology model of the N-terminal region of hRyR2. The structure covers residues 12–543 of the hRyR2 sequence. The homology model was constructed using Modeller 9v8 (Sali and Blundell 1993). The hRyR2 sequence was first aligned with the template structure using the alignment.align2d command of Modeller and then manually edited to improve the alignment (Fig. 6). This alignment was used to build a homology model using the automodel class. Residues 90–107 of the human sequence, which had been disordered in the template structure, were constrained to form an α-helix in accordance with the secondary structure predictions, see below. The structure was thoroughly refined using automodel.library\_schedule = autosched.slow and automodel.max\_var\_iterations = 300 for the initial optimization step, which was then followed by molecular dynamics refinement using automodel.md\_level = refine.slow. The whole process was repeated twice. The refinement target function (objective function) minimized the geometry restraints and Charmm energy terms enforcing proper stereochemistry (described in (Sali and Blundell 1993)).

#### Bioinformatics Domain Structure Prediction and Homology Modeling of Human Ryanodine Receptor 2 341

340 Bioinformatics – Trends and Methodologies

or ATP did not change the Tm significantly, suggesting that the fragment does not contain

Fig. 5. The first derivative of the thermal denaturation curve of recombinant 1–606 hRyR2 fragment obtained by thermofluor shift assay, measured in 200 mM Na-citrate, pH 5.5 (A), and 200 mM Na-phosphate, pH 7.5 (B). C. Tm of the recombinant hRyR2 fragment 1-606 obtained by the thermofluor shift assay in different buffers and pH (1 – K-phosphate 5.0; 2 – Na-phosphate 5.5; 3 – Na-citrate 5.5; 4 – MES 5.8; 5 – K-phosphate 6.0; 6 – MES 6.2; 7 – Naphosphate 6.5; 8 – Na-cacodylate 6.5; 9 – MES 6.5; 10 – K-phosphate 7.0; 11 – HEPES 7.0; 12 – Na-acetate 7.3; 13 – Na-phosphate 7.5; 14 – Tris 7.5; 15 – Imidazol 8.0; 16 – HEPES 8.0; 17 – Tris 8.0; 18 – Bicine 8.0; 19 – Tris 8.5; 20 – Bicine 9.0). The conditions under which RyR2 1– 606 was the most stable are in red. For each measurement 8 μg of protein were used.

The amino-acid sequence of the recently determined structure of the N-terminal region of rabbit RyR1 (PDB ID 2XOA (Tung et al. 2010)) is 63% identical and 77% similar to the corresponding sequence of hRyR2 (similarity is here defined as having a Gonnet Pam250 matrix score > 0.5 as determined by ClustalX 2.1). The 2XOA structure therefore represents an excellent template for constructing a homology model of the N-terminal region of hRyR2. The structure covers residues 12–543 of the hRyR2 sequence. The homology model was constructed using Modeller 9v8 (Sali and Blundell 1993). The hRyR2 sequence was first aligned with the template structure using the alignment.align2d command of Modeller and then manually edited to improve the alignment (Fig. 6). This alignment was used to build a homology model using the automodel class. Residues 90–107 of the human sequence, which had been disordered in the template structure, were constrained to form an α-helix in accordance with the secondary structure predictions, see below. The structure was thoroughly refined using automodel.library\_schedule = autosched.slow and automodel.max\_var\_iterations = 300 for the initial optimization step, which was then followed by molecular dynamics refinement using automodel.md\_level = refine.slow. The whole process was repeated twice. The refinement target function (objective function) minimized the geometry restraints and Charmm energy terms enforcing proper

**4.1 Model structure of the N-terminal part of hRyR2** 

stereochemistry (described in (Sali and Blundell 1993)).

binding sites for these ligands.

Fig. 6. The sequence alignment of hRyR2 and rabbit RyR1 (PDB ID 2XOA) sequences which was used as a template for the homology modeling of the hRyR2 tertiary structure. The alignment was performed using the alignment.align2d command of Modeller 9v8 (Sali and Blundell 1993). The secondary structure elements of hRyR2 (α-helices and β-strands are shown as red bars and blue arrows, respectively) were predicted by Jpred (Cole et al. 2008). The numbering of predicted secondary structure elements of hRyR2 corresponds to those found in the RyR1 template structure (α-helices and β-strands are shown as pink bars and light blue arrows, respectively). The alignment covers residues 1–543 of the human RyR2 protein. Residues in grey are missing in the RyR1 structure and correspond to flexible loops. Identical residues (~64%) in both sequences are labelled by asterisks.

Bioinformatics Domain Structure Prediction

partner (another RyR2 domain or an interacting ligand).

α4 belong to interface 6 (Tung et al. 2010).

**5. Conclusion** 

of α-helices and β-strands was 23% and 24%, respectively.

and Homology Modeling of Human Ryanodine Receptor 2 343

predicted by JPred with small variations at the beginnings or the ends in the model. In the region of the predicted helix α4, two shorter α-helices were modelled which are separated from each other only by a two-residue turn, as found in the template structure. Neither the

The first predicted α-helix (α1) deserves additional consideration. In comparison to RyR1, the RyR2 sequence contains a 12-residue insertion, Fig. 6, suggesting that the structure of this region differs in these two proteins. In hRyR2, helix α1 was predicted to be 33 residues long (residues 75–107). However, the Ins145\_P3\_rec domain, which includes the helix α1, was determined independently three times in RyR2 and contains only the first ten residues of this helix, followed by a gap (Lobo and Van Petegem 2009). This indicates that the predicted helix α1 cannot span the whole range. Instead, most likely, this region forms a helix-turn-helix motif (aa. 75–110), as found in the structure of the IP3R ligand binding suppressor domain (Bosanac et al. 2005). The motif has to be rather flexible to explain its absence in the previously solved RyR structures as well as the conformational flexibility predicted by (Bosanac et al. 2005). Inclusion of an 18-residue α1A, separated from the helix α1 observed in the template structure by a six-residue linker, resulted in five different conformations, which shows that the modelled motif has sufficient mobility to explain its absence in the solved structures. Most likely, this motif will be stabilized in the whole RyR2 protein by interaction with a binding

Between the helix α1A and the long disordered loop containing residues 464–477, there is a very large cavity, which suggests the existence of a large binding pocket. The helix α1A and the loop 464–477, which sit atop of this cavity, both contain several aromatic residues (W98, F100, H469, H472). The presence of the aromatic residues in the surface, three of which are exposed to the cavity (F100, H469, H472) may indicate protein-protein binding events with a putative ligand. The P466A mutant, located at the beginning of the loop, is known to substantially disrupt the proper physiological function of RyR2, causing syncope (Tester et al. 2005). Proline residues have lower conformational flexibility than other residues, so the most likely reason for the importance of P466 would be a necessity to decrease the flexibility of the loop containing residues 464–477. In addition to the P466A mutant, two other mutants in this area are also known to cause RyR2 dysfunction: A77V, causing CPVT and ARVD2 (d'Amati et al. 2005), and V186M, causing CPVT (Tester et al. 2006). All three mutations in this area induce only small changes in the surface shape of the protein. Taken together, all this implies that this part of the structure might be involved in the binding of RyR2 to other interacting proteins or its own domains. This is consistent with the results seen in RyR1, where the docking of the first three N-terminal domains into the full-length cryo-EM density map revealed that the loop involving P455 (P466 in hRyR2) as well as the beginning of helix

The reliability of the hRyR2 homology model is further confirmed by the CD-spectroscopy of hRyR2 fragment 1–606, Fig. 4A, which revealed 23% α-helices and 29% β-strands in the fragment. This is in good agreement with the hRyR2 model structure, in which the content

In this work, bioinformatics analysis of the whole human RyR2 is presented. The analysis shows that the protein is composed of 14 domains. We were concerned particularly with the

shortest α-helix (aa. 325–328) nor four out of the five 310 helices were predicted.

Twenty trial structures were generated and the best one was chosen by first discarding all those with unreasonable geometry in the loop regions, and then selecting the one with the lowest objective function score. The loops in this structure were then refined using the functions found in the Modeller loopmodel class (described in Fiser et al. (2000)). This was done in two stages. First, the loop containing helix α1A (see below) was refined to give five different positions. Second, for each of these positions an additional five structures were generated with different positions for all of the loops. The final structure was selected based on the lowest objective function. The final structure was inserted into the RyR1 crystal packing arrangement to check for possible clashes with neighbouring molecules, which might indicate unlikely loop conformations; none were found. The final model is shown in Fig. 7A,B.

The hRyR2 model contains three domains, aa. 12–219 (Ins145\_P3\_rec), 228–408 (MIR) and 411–543 (RIH; residue numbers refer to the hRyR2), analogous to those of the template structure (Tung et al. 2010). The RyR1 template structure and hRyR2 homology model are in agreement with the PFAM predictions shown in Fig. 2; however, the beginning of the third domain, RIH, is shifted by about 40 aa. residues towards the N-terminus.

With a few exceptions, the hRyR2 homology model (Fig. 7) confirmed the secondary structure predictions, nearly all of the β-strands (22 out of 27) were found in the predicted positions in the template structure, although with a minor shift of one to three residues, Fig. 6.

Fig. 7. (A) hRyR2 homology model constructed by Modeller 9v8 (Sali and Blundell 1993) using the alignment with rabbit RyR1, shown in Fig. 6. Loops which were disordered in the template structure and the modelled α1A-helix are in orange. Residues which cause physiological dysfunction of hRyR2 when mutated are indicated. (B) The RyR2 homology model coloured according to the domains predicted using the PFAM database. The colour scheme is the same as in Fig. 2; Ins145\_P3\_rec is green, MIR is red, and RIH is yellow. The grey α-helix, α2 (aa. 409–436), according to the PFAM results, precedes the RIH domain, but is not part of the MIR domain.

The beginning of the first β-strand (β1A; aa. 11–16) was missing in the template structure so that its presence could not be verified. The short β-strands β5A, β9A, and β13A were not present in the RyR1 structure. β23 was present in the template and the modelled structure but was not predicted by JPred. Three long α-helices (α2, α3, and α6) were correctly

Twenty trial structures were generated and the best one was chosen by first discarding all those with unreasonable geometry in the loop regions, and then selecting the one with the lowest objective function score. The loops in this structure were then refined using the functions found in the Modeller loopmodel class (described in Fiser et al. (2000)). This was done in two stages. First, the loop containing helix α1A (see below) was refined to give five different positions. Second, for each of these positions an additional five structures were generated with different positions for all of the loops. The final structure was selected based on the lowest objective function. The final structure was inserted into the RyR1 crystal packing arrangement to check for possible clashes with neighbouring molecules, which might indicate unlikely loop conformations; none were found. The final model is shown in

The hRyR2 model contains three domains, aa. 12–219 (Ins145\_P3\_rec), 228–408 (MIR) and 411–543 (RIH; residue numbers refer to the hRyR2), analogous to those of the template structure (Tung et al. 2010). The RyR1 template structure and hRyR2 homology model are in agreement with the PFAM predictions shown in Fig. 2; however, the beginning of the third

With a few exceptions, the hRyR2 homology model (Fig. 7) confirmed the secondary structure predictions, nearly all of the β-strands (22 out of 27) were found in the predicted positions in

Fig. 7. (A) hRyR2 homology model constructed by Modeller 9v8 (Sali and Blundell 1993) using the alignment with rabbit RyR1, shown in Fig. 6. Loops which were disordered in the

The beginning of the first β-strand (β1A; aa. 11–16) was missing in the template structure so that its presence could not be verified. The short β-strands β5A, β9A, and β13A were not present in the RyR1 structure. β23 was present in the template and the modelled structure but was not predicted by JPred. Three long α-helices (α2, α3, and α6) were correctly

template structure and the modelled α1A-helix are in orange. Residues which cause physiological dysfunction of hRyR2 when mutated are indicated. (B) The RyR2 homology model coloured according to the domains predicted using the PFAM database. The colour scheme is the same as in Fig. 2; Ins145\_P3\_rec is green, MIR is red, and RIH is yellow. The grey α-helix, α2 (aa. 409–436), according to the PFAM results, precedes the RIH domain, but

domain, RIH, is shifted by about 40 aa. residues towards the N-terminus.

the template structure, although with a minor shift of one to three residues, Fig. 6.

Fig. 7A,B.

is not part of the MIR domain.

predicted by JPred with small variations at the beginnings or the ends in the model. In the region of the predicted helix α4, two shorter α-helices were modelled which are separated from each other only by a two-residue turn, as found in the template structure. Neither the shortest α-helix (aa. 325–328) nor four out of the five 310 helices were predicted.

The first predicted α-helix (α1) deserves additional consideration. In comparison to RyR1, the RyR2 sequence contains a 12-residue insertion, Fig. 6, suggesting that the structure of this region differs in these two proteins. In hRyR2, helix α1 was predicted to be 33 residues long (residues 75–107). However, the Ins145\_P3\_rec domain, which includes the helix α1, was determined independently three times in RyR2 and contains only the first ten residues of this helix, followed by a gap (Lobo and Van Petegem 2009). This indicates that the predicted helix α1 cannot span the whole range. Instead, most likely, this region forms a helix-turn-helix motif (aa. 75–110), as found in the structure of the IP3R ligand binding suppressor domain (Bosanac et al. 2005). The motif has to be rather flexible to explain its absence in the previously solved RyR structures as well as the conformational flexibility predicted by (Bosanac et al. 2005). Inclusion of an 18-residue α1A, separated from the helix α1 observed in the template structure by a six-residue linker, resulted in five different conformations, which shows that the modelled motif has sufficient mobility to explain its absence in the solved structures. Most likely, this motif will be stabilized in the whole RyR2 protein by interaction with a binding partner (another RyR2 domain or an interacting ligand).

Between the helix α1A and the long disordered loop containing residues 464–477, there is a very large cavity, which suggests the existence of a large binding pocket. The helix α1A and the loop 464–477, which sit atop of this cavity, both contain several aromatic residues (W98, F100, H469, H472). The presence of the aromatic residues in the surface, three of which are exposed to the cavity (F100, H469, H472) may indicate protein-protein binding events with a putative ligand. The P466A mutant, located at the beginning of the loop, is known to substantially disrupt the proper physiological function of RyR2, causing syncope (Tester et al. 2005). Proline residues have lower conformational flexibility than other residues, so the most likely reason for the importance of P466 would be a necessity to decrease the flexibility of the loop containing residues 464–477. In addition to the P466A mutant, two other mutants in this area are also known to cause RyR2 dysfunction: A77V, causing CPVT and ARVD2 (d'Amati et al. 2005), and V186M, causing CPVT (Tester et al. 2006). All three mutations in this area induce only small changes in the surface shape of the protein. Taken together, all this implies that this part of the structure might be involved in the binding of RyR2 to other interacting proteins or its own domains. This is consistent with the results seen in RyR1, where the docking of the first three N-terminal domains into the full-length cryo-EM density map revealed that the loop involving P455 (P466 in hRyR2) as well as the beginning of helix α4 belong to interface 6 (Tung et al. 2010).

The reliability of the hRyR2 homology model is further confirmed by the CD-spectroscopy of hRyR2 fragment 1–606, Fig. 4A, which revealed 23% α-helices and 29% β-strands in the fragment. This is in good agreement with the hRyR2 model structure, in which the content of α-helices and β-strands was 23% and 24%, respectively.

#### **5. Conclusion**

In this work, bioinformatics analysis of the whole human RyR2 is presented. The analysis shows that the protein is composed of 14 domains. We were concerned particularly with the

Bioinformatics Domain Structure Prediction

Science 268: 1045-1049.

Pharmacol 130: 1618-1626.

Mol Pharmacol 63: 174-182.

Nucleic Acids Res 36: W197-201.

Biol 135: 49-59.

receptor. J Mol Biol 373: 1269-1280.

Ca2+ sensor. J Biol Chem 273: 14675-14678.

cardiac myocytes. Am J Physiol 270: C148-C159.

ryanodine receptors. Am J Physiol 266: C1485-1504.

receptors). J Biol Chem 274: 26120-26126.

receptor). J Biol Chem 275: 11778-11783.

and Homology Modeling of Human Ryanodine Receptor 2 345

Cannell MB, Soeller C. 1997. Numerical analysis of ryanodine receptor activation by L-type

Cannell MB, Cheng H, Lederer WJ. 1994. Spatial non-uniformities in [Ca2+]i during excitation-contraction coupling in cardiac myocytes. Biophys J 67: 1942- 1956. Cannell MB, Cheng H, Lederer WJ. 1995. The control of calcium release in heart muscle.

Chan J, Whitten AE, Jeffries CM, Bosanac I, Mal TK, Ito J, Porumb H, Michikawa T,

Chan WM, Welch W, Sitsapesan R. 2000. Structural factors that determine the ability of

Chan WM, Welch W, Sitsapesan R. 2003. Structural characteristics that govern binding to,

Chen SR, Ebisawa K, Li X, Zhang L. 1998. Molecular identification of the ryanodine receptor

Cheng H, Lederer WJ, Cannell MB. 1993. Calcium sparks: elementary events underlying excitation-contraction coupling in heart muscle. Science 262: 740-744. Cheng H, Lederer MR, Lederer WJ, Cannell MB. 1996. Calcium sparks and [Ca2+]i waves in

Chu A, Fill M, Stefani E, Entman ML. 1993. Cytoplasmic Ca2+ does not inhibit the cardiac

Cole C, Barber JD, Barton GJ. 2008. The Jpred 3 secondary structure prediction server.

Copello JA, Barg S, Sonnleitner A, Porta M, Diaz-Sylvester P, Fill M, Schindler H, Fleischer

muscle ryanodine receptors after block by Mg2+. J Membr Biol 187: 51-64. Coronado R, Morrissette J, Sukhareva M, Vaughan DM. 1994. Structure and function of

d'Amati G, Bagattin A, Bauce B, Rampazzo A, Autore C, Basso C, King K, Romeo MD, Gallo

Du GG, Khanna VK, MacLennan DH. 2000. Mutation of divergent region 1 alters caffeine

Du GG, Guo X, Khanna VK, MacLennan DH. 2001. Functional characterization of mutants

(ryanodine receptor isoform 2). J Biol Chem 276: 31760-31771.

evidence of specific morphological substrates. Hum Pathol 36: 761-767. Du GG, Maclennan DH. 1999. Ca(2+) inactivation sites are located in the COOH-terminal

muscle sarcoplasmic reticulum ryanodine receptor Ca2+ channel, although Ca(2+) induced Ca2+ inactivation of Ca2+ release is observed in native vesicles. J Membr

S. 2002. Differential activation by Ca2+, ATP and caffeine of cardiac and skeletal

P, Thiene G, Danieli GA, Nava A. 2005. Juvenile sudden death in a family with polymorphic ventricular arrhythmias caused by a novel RyR2 gene mutation:

quarter of recombinant rabbit skeletal muscle Ca(2+) release channels (ryanodine

and Ca(2+) sensitivity of the skeletal muscle Ca(2+) release channel (ryanodine

in the predicted pore region of the rabbit cardiac muscle Ca(2+) release channel

Mikoshiba K, Trewhella J, Ikura M. 2007. Ligand-induced conformational changes via flexible linkers in the amino-terminal region of the inositol 1,4,5-trisphosphate

adenosine and related compounds to activate the cardiac ryanodine receptor. Br J

and modulation through, the cardiac ryanodine receptor nucleotide binding site.

channel activity in the cardiac muscle diad. Biophys J 73: 112-122.

first three N-terminal domains of the protein (Ins145\_P3\_rec, MIR, RIH). Verification that the domains identified can behave as separate, independent protein units was provided by their successful expression in *E. coli* and subsequent characterization. CD-spectroscopy was used to determine the domain organization and to identify the secondary structure elements of the N-terminal part of hRyR2. The amino acid sequence identity of hRyR2 with that of rabbit RyR1, the X-ray structure of which is known, is higher than 60%, which allowed the construction of a reliable homology model of the N-terminal part of hRyR2. Its reliability was further strengthened by its conformity with the bioinformatics analysis and the CDspectroscopy study. This model should allow a clearer insight to be gained into the possible influence of mutations on the cardiac diseases CPVT1 and ARVD2.

#### **6. Acknowledgment**

This work was supported by VEGA grants 2/0131/10 and 2/0190/10 and by APVV-0628-10 and APVV-0721-10. VB would like to thank Dr. Elena Hlinková for encouragement and support during the writing of the chapter.

#### **7. References**


first three N-terminal domains of the protein (Ins145\_P3\_rec, MIR, RIH). Verification that the domains identified can behave as separate, independent protein units was provided by their successful expression in *E. coli* and subsequent characterization. CD-spectroscopy was used to determine the domain organization and to identify the secondary structure elements of the N-terminal part of hRyR2. The amino acid sequence identity of hRyR2 with that of rabbit RyR1, the X-ray structure of which is known, is higher than 60%, which allowed the construction of a reliable homology model of the N-terminal part of hRyR2. Its reliability was further strengthened by its conformity with the bioinformatics analysis and the CDspectroscopy study. This model should allow a clearer insight to be gained into the possible

This work was supported by VEGA grants 2/0131/10 and 2/0190/10 and by APVV-0628-10 and APVV-0721-10. VB would like to thank Dr. Elena Hlinková for encouragement and

Amador FJ, Liu S, Ishiyama N, Plevin MJ, Wilson A, MacLennan DH, Ikura M. 2009. Crystal

Baartscheer A, Schumacher CA, Fiolet JW. 1998. Cytoplasmic sodium, calcium and free

Balshaw D, Gao L, Meissner G. 1999. Luminal loop of the ryanodine receptor: a pore-

Bauerova-Hlinkova V, Hostinova E, Gasperik J, Beck K, Borko L, Lai FA, Zahradnikova A,

Bers DM. 2004. Macromolecular complexes regulating cardiac ryanodine receptor function. J

Bhat MB, Zhao J, Takeshima H, Ma J. 1997. Functional calcium release channel formed by the carboxyl-terminal portion of ryanodine receptor. Biophys J 73: 1329-1336. Bosanac I, Alattia JR, Mal TK, Chan J, Talarico S, Tong FK, Tong KI, Yoshikawa F, Furuichi

Bosanac I, Yamazaki H, Matsu-Ura T, Michikawa T, Mikoshiba K, Ikura M. 2005. Crystal

Bridge JH, Ershler PR, Cannell MB. 1999. Properties of Ca2+ sparks evoked by action potentials in mouse ventricular myocytes. J Physiol (Lond) 518: 469-478.

forming segment? Proc Natl Acad Sci U S A 96: 3345-3347.

structure of type I ryanodine receptor amino-terminal beta-trefoil domain reveals a disease-associated mutation "hot spot" loop. Proc Natl Acad Sci U S A 106: 11040-

energy change of the Na+/Ca2+-exchanger in rat ventricular myocytes. J Mol Cell

Sevcik J. 2010. Bioinformatic mapping and production of recombinant N-terminal domains of human cardiac ryanodine receptor 2. Protein Expr Purif 71: 33-41. Berridge MJ. 1994. The biology and medicine of calcium signalling. Mol Cell Endocrinol 98:

T, Iwai M, Michikawa T, Mikoshiba K, Ikura M. 2002. Structure of the inositol 1,4,5 trisphosphate receptor binding core in complex with its ligand. Nature 420: 696-

structure of the ligand binding suppressor domain of type 1 inositol 1,4,5-

influence of mutations on the cardiac diseases CPVT1 and ARVD2.

**6. Acknowledgment** 

11044.

119-124.

700.

**7. References** 

support during the writing of the chapter.

Cardiol 30: 2437-2447.

Mol Cell Cardiol 37: 417-429.

trisphosphate receptor. Mol Cell 17: 193-203.


Bioinformatics Domain Structure Prediction

structures. Proteins 35: 307-312.

Res Cell Motil 14: 554-556.

Perspect Biol 2: a003996.

Physiol 132: 429-446.

1. J Biol Chem 277: 46712-46719.

receptor. J Biol Chem 280: 37941-37947.

213-229.

316.

and Homology Modeling of Human Ryanodine Receptor 2 347

Ikemoto N, Yamamoto T. 2000. Postulated role of inter-domain interaction within the

Johnson WC. 1999. Analyzing protein circular dichroism spectra for accurate secondary

Kagaya Y, Weinberg EO, Ito N, Mochizuki T, Barry WH, Lorell BH. 1995. Glycolytic

Kettlun C, Gonzalez A, Rios E, Fill M. 2003. Unitary Ca2+ current through mammalian

Kleiber ML, Singh SM. 2009. Divergence of the vertebrate sp1A/ryanodine receptor domain

Lai FA, Erickson HP, Rousseau E, Liu QY, Meissner G. 1988. Purification and reconstitution of the calcium release channel from skeletal muscle. Nature 331: 315-319. Lamb GD. 1993. Ca2+ inactivation, Mg2+ inhibition and malignant hyperthermia. J Muscle

Lanner JT, Georgiou DK, Joshi AD, Hamilton SL. 2010. Ryanodine receptors: structure,

Laver DR. 2007. Ca2+ stores regulate ryanodine receptor Ca2+ release channels via luminal

Laver DR. 2009. Luminal Ca(2+) activation of cardiac ryanodine receptors by luminal and

Laver DR, Honen BN. 2008. Luminal Mg2+, a key factor controlling RYR2-mediated Ca2+

Laver DR, Baynes TM, Dulhunty AF. 1997. Magnesium inhibition of ryanodine-receptor

Li P, Chen SR. 2001. Molecular basis of Ca(2)+ activation of the mouse cardiac Ca(2)+ release

Liu Z, Zhang J, Li P, Chen SR, Wagenknecht T. 2002. Three-dimensional reconstruction of

Liu Z, Zhang J, Wang R, Wayne Chen SR, Wagenknecht T. 2004. Location of divergent

Liu Z, Wang R, Zhang J, Chen SR, Wagenknecht T. 2005. Localization of a disease-associated

Liu Z, Zhang J, Sharma MR, Li P, Chen SR, Wagenknecht T. 2001. Three-dimensional

hypertrophied rat ventricular myocytes. J Clin Invest 95: 2766-2776.

physiological ionic conditions. J Gen Physiol 122: 407-417.

within the mouse brain. Genomics 93: 358-366.

and cytosolic Ca2+ sites. Biophys J 92: 3541-3555.

channel (ryanodine receptor). J Gen Physiol 118: 33-44.

receptor/calcium release channel. J Mol Biol 338: 533-545.

amino terminus. Proc Natl Acad Sci U S A 98: 6104-6109.

cytoplasmic domains. Eur Biophys J 39: 19-26.

ryanodine receptor in Ca(2+) channel regulation. Trends Cardiovasc Med 10: 310-

inhibition: effects on diastolic relaxation and intracellular calcium handling in

cardiac and amphibian skeletal muscle ryanodine receptor channels under near-

and SOCS box-containing (Spsb) gene family and its expression and regulation

expression, molecular details, and function in calcium release. Cold Spring Harb

release: cytoplasmic and luminal regulation modeled in a tetrameric channel. J Gen

calcium channels: Evidence for two independent mechanisms. J Membr Biol 156:

the recombinant type 2 ryanodine receptor and localization of its divergent region

region 2 on the three-dimensional structure of cardiac muscle ryanodine

mutation site in the three-dimensional structure of the cardiac muscle ryanodine

reconstruction of the recombinant type 3 ryanodine receptor and localization of its


Durham WJ, Wehrens XH, Sood S, Hamilton SL. 2007. Diseases associated with altered

El-Hayek R, Saiki Y, Yamamoto T, Ikemoto N. 1999. A postulated role of the near amino-

Fabiato A. 1985. Time and calcium dependence of activation and inactivation of calcium-

Faltinova A, Gaburjakova J, Zahradnikova A. 2011. Activation of the rat cardiac ryanodine

Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR,

Fiser A, Do RK, Sali A. 2000. Modeling of loops in protein structures. Protein Sci 9: 1753-

Gaburjakova J, Gaburjakova M. 2006. Comparison of the effects exerted by luminal Ca2+ on

Gao L, Balshaw D, Xu L, Tripathy A, Xin C, Meissner G. 2000. Evidence for a role of the

George CH, Yin CC, Lai FA. 2005. Toward a molecular understanding of the structure-

George CH, Jundi H, Thomas NL, Scoote M, Walters N, Williams AJ, Lai FA. 2004.

Gyorke I, Gyorke S. 1998. Regulation of the cardiac ryanodine receptor channel by luminal

Gyorke I, Hester N, Jones LR, Gyorke S. 2004. The role of calsequestrin, triadin, and junctin

Gyorke S, Fill M. 1993. Ryanodine receptor adaptation: Control mechanism of Ca2+-induced

Gyorke S, Velez P, Suarez-Isla B, Fill M. 1994. Activation of single cardiac and skeletal

Hamada T, Tashiro K, Tada H, Inazawa J, Shirozu M, Shibahara K, Nakamura T, Martina N,

Ca2+ involves luminal Ca2+ sensing sites. Biophys J 75: 2801-2810.

recombinant expression systems. Cell Biochem Biophys 42: 197-222. George CH, Jundi H, Thomas NL, Fry DL, Lai FA. 2007. Ryanodine receptors and

and transmembrane domains. Mol Biol Cell 15: 2627-2638.

terminal domain of the ryanodine receptor in the regulation of the sarcoplasmic

induced release of calcium from the sarcoplasmic reticulum of a skinned canine

Sonnhammer EL, Bateman A. 2008. The Pfam protein families database. Nucleic

the sensitivity of the cardiac ryanodine receptor to caffeine and cytosolic Ca2+. J

lumenal M3-M4 loop in skeletal muscle Ca(2+) release channel (ryanodine receptor)

function of ryanodine receptor Ca2+ release channels: perspectives from

ventricular arrhythmias: emerging trends in mutations, mechanisms and therapies.

Ryanodine receptor regulation by intramolecular interaction between cytoplasmic

in conferring cardiac ryanodine receptor responsiveness to luminal calcium.

ryanodine receptor channels by flash photolysis of caged Ca2+. Biophys J 66: 1879-

Nakano T, Honjo T. 1996. Isolation and characterization of a novel secretory protein, stromal cell-derived factor-2 (SDF-2) using the signal sequence trap

ryanodine receptor activity. Subcell Biochem 45: 273-321.

reticulum Ca(2+) channel. J Biol Chem 274: 33341-33347.

cardiac Purkinje cell. J Gen Physiol 85: 247-289.

activity and conductance. Biophys J 79: 828-840.

Acids Res 36: D281-288.

Membr Biol 212: 17-28.

J Mol Cell Cardiol 42: 34-50.

Biophys J 86: 2121-2128.

method. Gene 176: 211-214.

1886.

Ca2+ release in heart. Science 260: 807-809.

1773.

Ebashi S, Ogawa Y. 1988. Ca2+ in contractile processes. Biophys Chem 29: 137-143.

receptor by its domain peptide DPcpvt-C. Physiol Res 60: *in press*.


Bioinformatics Domain Structure Prediction

552.

449-481.

and Homology Modeling of Human Ryanodine Receptor 2 349

Orlova EV, Serysheva II, van Heel M, Hamilton SL, Chiu W. 1996. Two structural

Parker I, Zang WJ, Wier WG. 1996. Ca2+ sparks involving multiple Ca2+ release sites along

Perez CF, Mukherjee S, Allen PD. 2003. Amino acids 1-1,680 of ryanodine receptor type 1

Ponting C, Schultz J, Bork P. 1997. SPRY domains in ryanodine receptors (Ca(2+)-release

Ponting CP. 2000. Novel repeats in ryanodine and IP3 receptors and protein O-

Protasi F. 2002. Structural interaction between RYRs and DHPRs in calcium release units of

Qin J, Valle G, Nani A, Nori A, Rizzi N, Priori SG, Volpe P, Fill M. 2008. Luminal Ca2+

Rhodes DA, de Bono B, Trowsdale J. 2005. Relationship between SPRY and B30.2 protein

Rios E, Brum G. 1987. Involvement of dihydropyridine receptors in excitation-contraction

Rios E, Karhanek M, Ma J, Gonzalez A. 1993. An allosteric model of the molecular

Sali A, Blundell TL. 1993. Comparative protein modelling by satisfaction of spatial

Samso M, Wagenknecht T. 2002. Apocalmodulin and Ca2+-calmodulin bind to neighboring

Samso M, Feng W, Pessah IN, Allen PD. 2009. Coordinated movement of cytoplasmic and

Santana LF, Cheng H, Gomez AM, Cannell MB, Lederer WJ. 1996. Relation between the

Schiefer A, Meissner G, Isenberg G. 1995. Ca2+ activation and Ca2+ inactivation of canine

Serysheva, II, Hamilton SL, Chiu W, Ludtke SJ. 2005. Structure of Ca2+ release channel at 14

Serysheva, II, Schatz M, van Heel M, Chiu W, Hamilton SL. 1999. Structure of the skeletal

Serysheva, II, Ludtke SJ, Baker ML, Cong Y, Topf M, Eramian D, Sali A, Hamilton SL, Chiu

sarcolemmal Ca2+ current and Ca2+ sparks and local control theories for cardiac

reconstituted cardiac sarcoplasmic reticulum Ca(2+)-release channels. J Physiol

muscle calcium release channel activated with Ca2+ and AMP-PCP. Biophys J 77:

W. 2008. Subnanometer-resolution electron cryomicroscopy-based domain models for the cytoplasmic region of skeletal muscle RyR channel. Proc Natl Acad Sci U S

locations on the ryanodine receptor. J Biol Chem 277: 1349-1353.

transmembrane domains of RyR1 upon gating. PLoS Biol 7: e85.

excitation-contraction coupling. Circ Res 78: 166-171.

regulation of single cardiac ryanodine receptors: insights provided by calsequestrin

domains. Evolution of a component of immune defence? Immunology 116: 411-417.

interactions of excitation-contraction coupling in skeletal muscle. J Gen Physiol 102:

Z-lines in rat heart cells. J Physiol (Lond) 497: 31-38.

channels). Trends Biochem Sci 22: 193-194.

and its mutants. J Gen Physiol 131: 325-334.

restraints. J Mol Biol 234: 779-815.

A resolution. J Mol Biol 345: 427-431.

(Lond) 489: 337-348.

1936-1944.

A 105: 9610-9615.

coupling in skeletal muscle. Nature 325: 717-720.

of divergence domain D2. J Biol Chem 278: 39644-39652.

mannosyltransferases. Trends Biochem Sci 25: 48-50.

cardiac and skeletal muscle cells. Front Biosci 7: d650-658.

configurations of the skeletal muscle calcium release channel. Nat Struct Biol 3: 547-

hold critical determinants of skeletal type for excitation-contraction coupling. Role


Lobo PA, Van Petegem F. 2009. Crystal structures of the N-terminal domains of cardiac and

Lopez-Lopez JR, Shacklock PS, Balke CW, Wier WG. 1995. Local calcium transients

Lukyanenko V, Gyorke I, Subramanian S, Smirnov A, Wiesner TF, Gyorke S. 2000. Inhibition

Marjamaa A, Laitinen-Forsblom P, Lahtinen AM, Viitasalo M, Toivonen L, Kontula K, Swan

Marx SO, Marks AR. 2002. Regulation of the ryanodine receptor in heart failure. Basic Res

Marx SO, Reiken S, Hisamatsu Y, Gaburjakova M, Gaburjakova J, Yang YM, Rosemblit N,

Medeiros-Domingo A, Bhuiyan ZA, Tester DJ, Hofman N, Bikker H, van Tintelen JP,

Meissner G. 1994. Ryanodine receptor/Ca2+ release channels and their regulation by

Meissner G. 2002. Regulation of mammalian ryanodine receptors. Front Biosci 7: d2072-2080. Meissner G. 2004. Molecular regulation of cardiac ryanodine receptor ion channel. Cell

Meissner G, Henderson JS. 1987. Rapid calcium release from cardiac sarcoplasmic reticulum

Mejia-Alvarez R, Kettlun C, Rios E, Stern M, Fill M. 1999. Unitary Ca2+ current through

Milting H, Lukas N, Klauke B, Korfer R, Perrot A, Osterziel KJ, Vogt J, Peters S, Thieleczek

Murzin AG, Lesk AM, Chothia C. 1992. beta-Trefoil fold. Patterns of structure and sequence

Ono M, et al. 2010. Dissociation of calmodulin from cardiac ryanodine receptor causes

aberrant Ca(2+) release in heart failure. Cardiovasc Res 87: 609-617.

vesicles is dependent on Ca2+ and is modulated by Mg2+, adenine nucleotide, and

cardiac ryanodine receptor channels under quasi-physiological ionic conditions. J

R, Varsanyi M. 2006. Composite polymorphisms in the ryanodine receptor 2 gene associated with arrhythmogenic right ventricular cardiomyopathy. Cardiovasc Res

in the Kunitz inhibitors interleukins-1 beta and 1 alpha and fibroblast growth

novel role for leucine/isoleucine zippers. J Cell Biol 153: 699-708.

mutational analysis. J Am Coll Cardiol 54: 2065-2074.

endogenous effectors. Annu Rev Physiol 56: 485-508.

calmodulin. J Biol Chem 262: 3065-3073.

1505-1514.

1042-1045.

Biophys J 79: 1273-1284.

BMC Med Genet 10: 12.

Calcium 35: 621-628.

Gen Physiol 113: 177-186.

factors. J Mol Biol 223: 531-543.

71: 496-505.

Cardiol 97 Suppl 1: I49-51.

skeletal muscle ryanodine receptors: insights into disease mutations. Structure 17:

triggered by single L-type calcium channel currents in cardiac cells. Science 268:

of Ca(2+) sparks by ruthenium red in permeabilized rat ventricular myocytes.

H. 2009. Search for cardiac calcium cycling gene mutations in familial ventricular arrhythmias resembling catecholaminergic polymorphic ventricular tachycardia.

Marks AR. 2001. Phosphorylation-dependent regulation of ryanodine receptors: a

Mannens MM, Wilde AA, Ackerman MJ. 2009. The RYR2-encoded ryanodine receptor/calcium release channel in patients diagnosed previously with either catecholaminergic polymorphic ventricular tachycardia or genotype negative, exercise-induced long QT syndrome: a comprehensive open reading frame


Bioinformatics Domain Structure Prediction

Front Biosci 7: d1464-1474.

101: 3979-3984.

32: W668-673.

182.

660-666.

11625.

Chem 266: 11144-11152.

receptor. J Biol Chem 286: 12202-12212.

receptor channels. Q Rev Biophys 34: 61-104.

into the B30.2/SPRY domain. EMBO J 25: 1353-1363.

reticulum lumenal Ca2+. Biophys J 75: 2302-2312.

calcium release channel. Biophys J 78: 1270-1281.

simulated ischemic conditions. Circ Res 79: 1100-1109.

and Homology Modeling of Human Ryanodine Receptor 2 351

Velez P, Gyorke S, Escobar AL, Vergara J, Fill M. 1997. Adaptation of single cardiac

Wagenknecht T, Samso M. 2002. Three-dimensional reconstruction of ryanodine receptors.

Wang R, Zhong X, Meng X, Koop A, Tian X, Jones PP, Fruen BR, Wagenknecht T, Liu Z,

Wang SQ, Song LS, Lakatta EG, Cheng H. 2001. Ca2+ signalling between single L-type Ca2+ channels and ryanodine receptors in heart cells. Nature 410: 592-596. Wang SQ, Stern MD, Rios E, Cheng H. 2004. The quantal nature of Ca2+ sparks and in situ

Whitmore L, Wallace BA. 2004. DICHROWEB, an online server for protein secondary

Williams AJ. 1992. Ion conduction and discrimination in the sarcoplasmic reticulum ryanodine receptor/calcium-release channel. J Muscle Res Cell Motil 13: 7-26. Williams AJ, West DJ, Sitsapesan R. 2001. Light at the end of the Ca(2+)-release channel

Witcher DR, Kovacs RJ, Schulman H, Cefali DC, Jones LR. 1991. Unique phosphorylation

Woo JS, Imm JH, Min CK, Kim KJ, Cha SS, Oh BH. 2006. Structural and functional insights

Xiong L, Zhang JZ, He R, Hamilton SL. 2006. A Ca2+-binding domain in RyR1 that interacts

Xu L, Meissner G. 1998. Regulation of cardiac muscle Ca2+ release channel by sarcoplasmic

Xu L, Mann G, Meissner G. 1996. Regulation of cardiac Ca2+ release channel (ryanodine

Xu X, Bhat MB, Nishi M, Takeshima H, Ma J. 2000. Molecular cloning of cDNA encoding a

Xu X, et al. 2010. Defective calmodulin binding to the cardiac ryanodine receptor plays a key

Yamamoto T, Ikemoto N. 2002. Peptide probe study of the critical regulatory domain of the cardiac ryanodine receptor. Biochem Biophys Res Commun 291: 1102-1108. Yamamoto T, El-Hayek R, Ikemoto N. 2000. Postulated role of interdomain interaction

Chen SR. 2011. Localization of the dantrolene-binding sequence near the FK506 binding protein-binding site in the three-dimensional structure of the ryanodine

operation of the ryanodine receptor array in cardiac cells. Proc Natl Acad Sci U S A

structure analyses from circular dichroism spectroscopic data. Nucleic Acids Res

tunnel: structures and mechanisms involved in ion translocation in ryanodine

site on the cardiac ryanodine receptor regulates calcium channel activity. J Biol

with the calmodulin binding site and modulates channel activity. Biophys J 90: 173-

receptor) by Ca2+, H+, Mg2+, and adenine nucleotides under normal and

drosophila ryanodine receptor and functional studies of the carboxyl-terminal

role in CPVT-associated channel dysfunction. Biochem Biophys Res Commun 394:

within the ryanodine receptor in Ca(2+) channel regulation. J Biol Chem 275: 11618-

ryanodine receptor channels. Biophys J 72: 691-697.


Shannon TR, Guo T, Bers DM. 2003. Ca2+ scraps: local depletions of free [Ca2+] in cardiac

Sharma MR, Jeyakumar LH, Fleischer S, Wagenknecht T. 2000. Three-dimensional structure

Sharma MR, Penczek P, Grassucci R, Xin HB, Fleischer S, Wagenknecht T. 1998.

Shatsky M, Nussinov R, Wolfson HJ. 2004. A method for simultaneous alignment of

Smith JS, Imagawa T, Ma J, Fill M, Campbell KP, Coronado R. 1988. Purified ryanodine

Sorrentino V. 1995. The ryanodine receptor family of intracellular calcium release channels.

Stern MD. 1992. Theory of excitation - contraction coupling in cardiac muscle. Biophys J 63:

Strahl-Bolsinger S, Scheinost A. 1999. Transmembrane topology of pmt1p, a member of an

Takasago T, Imagawa T, Furukawa K, Ogurusu T, Shigekawa M. 1991. Regulation of the

Tateishi H, Yano M, Mochizuki M, Suetomi T, Ono M, Xu X, Uchinoumi H, Okuda S, Oda T,

cause of diastolic Ca2+ leak in failing hearts. Cardiovasc Res 81: 536-545. Terentyev D, Kubalova Z, Valle G, Nori A, Vedamoorthyrao S, Terentyeva R, Viatchenko-

Tester DJ, Kopplin LJ, Will ML, Ackerman MJ. 2005. Spectrum and prevalence of cardiac

explicitly for long QT syndrome genetic testing. Heart Rhythm 2: 1099-1105. Tester DJ, Arya P, Will M, Haglund CM, Farley AL, Makielski JC, Ackerman MJ. 2006.

Tung CC, Lobo PA, Kimlicka L, Van Petegem F. 2010. The amino-terminal disease hotspot of ryanodine receptors forms a cytoplasmic vestibule. Nature 468: 585-588. Unnerstale S, Lind J, Papadopoulos E, Maler L. 2009. Solution structure of the HsapBK K+ channel voltage-sensor paddle sequence. Biochemistry 48: 5813-5821. Valdivia HH, Kaplan JH, Ellis-Davies GC, Lederer WJ. 1995. Rapid adaptation of cardiac

mutations linked to sudden cardiac death. Biophys J 95: 2037-2048.

cryo-electron microscopy. J Biol Chem 275: 9485-9491.

multiple protein structures. Proteins 56: 143-156.

Res 93: 40-45.

Biol Chem 273: 18429-18434.

reticulum. J Gen Physiol 92: 1-26.

Adv Pharmacol 33: 67-90.

497-517.

274: 9068-9075.

Biochem 109: 163-170.

Heart Rhythm 3: 800-805.

2000.

sarcoplasmic reticulum during contractions leave substantial Ca2+ reserve. Circ

of ryanodine receptor isoform three in two conformational states as visualized by

Cryoelectron microscopy and image analysis of the cardiac ryanodine receptor. J

receptor from rabbit skeletal muscle is the calcium-release channel of sarcoplasmic

evolutionarily conserved family of protein O-mannosyltransferases. J Biol Chem

cardiac ryanodine receptor by protein kinase-dependent phosphorylation. J

Kobayashi S, Yamamoto T, Ikeda Y, Ohkusa T, Ikemoto N, Matsuzaki M. 2009. Defective domain-domain interactions within the ryanodine receptor as a critical

Karpinski S, Bers DM, Williams SC, Volpe P, Gyorke S. 2008. Modulation of SR Ca release by luminal Ca and calsequestrin in cardiac myocytes: effects of CASQ2

ryanodine receptor (RyR2) mutations in a cohort of unrelated patients referred

Genotypic heterogeneity and phenotypic mimicry among unrelated patients referred for catecholaminergic polymorphic ventricular tachycardia genetic testing.

ryanodine receptors: modulation by Mg2+ and phosphorylation. Science 267: 1997-


**0**

**17**

*USA*

**Identifying Enzyme Knockout Strategies on**

Bin Song1, I. Esra Büyüktahtakın2, Nirmalya Bandyopadhyay1,

Many biochemical engineering applications in drug discovery, food generation and cosmetic production, aim to modify the metabolism of a given organism to increase or decrease the

1. Fatty acid biosynthesis pathway converts fatty acids that are used in the cosmetic industry

2. Butanoate metabolism produces poly-*β*-hydroxybutyrate which is essential for producing

3. Mevalonic acid pathway and MEP/DOXP pathway produce carotenoid that are often used as anti-oxidant in food industry. The metabolisms of many organisms, such as bacteria, algae and plants naturally produce these compounds. A common practice is to extract

Enzymes play a significant role in metabolism. They catalyze the chemical reactions that transform a set of substrates (i.e., input compounds) into products (i.e., output compounds). Metabolic engineering techniques often aim to manipulate a small set of genes to alter the speed of the targeted enzymatic reactions. Their eventual goal is to reach a desired level of compound concentrations produced or consumed by these reactions. One way to alter the speed of the reactions dramatically is to knockout a set of enzymes. When an enzyme is knocked out, it cannot catalyze a subset of the reactions, resulting in changes to the

When detailed *in silico* models are available, computational methods can be successfully used to determine the enzyme set to knockout. These methods, when applicable, have much lower time and cost requirements as compared to *in vitro* or *in vivo* experiments conducted in wet labs. Wet-lab experiments often require substantial effort and time of the domain experts and overall time requirements may be hours to several days. Moreover, the cost of wet-lab experimentation significantly increases when the number of enzymes that needs to be knocked out is more than one. Manipulations that involve four to six enzymes are not uncommon. As a result, biologists often employ computational methods as a preprocessing

production of a specific compound or a set of compounds. For example:

**1. Introduction**

plastics.

in creams and lotions.

them from these organisms.

productions of compounds.

step to filter out less promising compounds.

<sup>2</sup>*Systems and Industrial Engineering, University of Arizona, Tucson*

**Multiple Enzyme Associations**

<sup>1</sup>*CISE Department, University of Florida, Gainesville*

Sanjay Ranka<sup>1</sup> and Tamer Kahveci1


### **Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations**

Bin Song1, I. Esra Büyüktahtakın2, Nirmalya Bandyopadhyay1, Sanjay Ranka<sup>1</sup> and Tamer Kahveci1

<sup>1</sup>*CISE Department, University of Florida, Gainesville* <sup>2</sup>*Systems and Industrial Engineering, University of Arizona, Tucson USA*

#### **1. Introduction**

352 Bioinformatics – Trends and Methodologies

Yamamoto T, Yano M, Xu X, Uchinoumi H, Tateishi H, Mochizuki M, Oda T, Kobayashi S,

Yang Z, Ikemoto N, Lamb GD, Steele DS. 2006. The RyR2 central domain peptide DPc10

Yano M, Yamamoto T, Ikeda Y, Matsuzaki M. 2006. Mechanisms of Disease: ryanodine

Yao S, Liu MS, Masters SL, Zhang JG, Babon JJ, Nicola NA, Nicholson SE, Norton RS. 2006.

Yuchi Z, Van Petegem F. 2011. Common allosteric mechanisms between ryanodine and

Zahradnik I, Gyorke S, Zahradnikova A. 2005. Calcium activation of ryanodine receptor

Zahradnikova A, Zahradnik I. 1999. Analysis of calcium-induced calcium release in cardiac

Zahradnikova A, Valent I, Zahradnik I. 2010. Frequency and release flux of calcium sparks in rat cardiac myocytes: a relation to RYR gating. J Gen Physiol 136: 101-116. Zahradnikova A, Zahradnik I, Gyorke I, Gyorke S. 1999. Rapid activation of the cardiac ryanodine receptor by submillisecond calcium stimuli. J Gen Physiol 114: 787-798. Zahradnikova A, Dura M, Gyorke I, Escobar AL, Zahradnik I, Gyorke S. 2003. Regulation of

physiological conditions. Am J Physiol Cell Physiol 285: C1059-1070. Zhang J, Liu Z, Masumiya H, Wang R, Jiang D, Li F, Wagenknecht T, Chen SR. 2003. Three-

inositol-1,4,5-trisphosphate receptors. Channels (Austin) 5: 120-123.

cardiomyocytes. Cardiovasc Res 70: 475-485.

functional loops. Protein Sci 15: 2761-2772.

Biochim Biophys Acta 1418: 268-284.

762-772.

Med 3: 43-52.

Physiol 126: 515-527.

14211-14218.

Ther 256: 938-946.

Ikemoto N, Matsuzaki M. 2008. Identification of target domains of the cardiac ryanodine receptor to correct channel disorder in failing hearts. Circulation 117:

lowers the threshold for spontaneous Ca2+ release in permeabilized

receptor defects in heart failure and fatal arrhythmia. Nat Clin Pract Cardiovasc

Dynamics of the SPRY domain-containing SOCS box protein 2: flexibility of key

channels--reconciling RyR gating models with tetrameric channel structure. J Gen

sarcoplasmic reticulum vesicles using models derived from single-channel data.

dynamic behavior of cardiac ryanodine receptor by Mg2+ under simulated

dimensional localization of divergent region 3 of the ryanodine receptor to the clamp-shaped structures adjacent to the FKBP binding sites. J Biol Chem 278:

rat cardiac and rabbit skeletal muscle sarcoplasmic reticulum. J Pharmacol Exp

Zhao M, Li P, Li X, Zhang L, Winkfein RJ, Chen SR. 1999. Molecular identification of the ryanodine receptor pore-forming segment. J Biol Chem 274: 25971-25974. Zimanyi I, Pessah IN. 1991. Comparison of [3H]ryanodine receptors and Ca++ release from Many biochemical engineering applications in drug discovery, food generation and cosmetic production, aim to modify the metabolism of a given organism to increase or decrease the production of a specific compound or a set of compounds. For example:


Enzymes play a significant role in metabolism. They catalyze the chemical reactions that transform a set of substrates (i.e., input compounds) into products (i.e., output compounds). Metabolic engineering techniques often aim to manipulate a small set of genes to alter the speed of the targeted enzymatic reactions. Their eventual goal is to reach a desired level of compound concentrations produced or consumed by these reactions. One way to alter the speed of the reactions dramatically is to knockout a set of enzymes. When an enzyme is knocked out, it cannot catalyze a subset of the reactions, resulting in changes to the productions of compounds.

When detailed *in silico* models are available, computational methods can be successfully used to determine the enzyme set to knockout. These methods, when applicable, have much lower time and cost requirements as compared to *in vitro* or *in vivo* experiments conducted in wet labs. Wet-lab experiments often require substantial effort and time of the domain experts and overall time requirements may be hours to several days. Moreover, the cost of wet-lab experimentation significantly increases when the number of enzymes that needs to be knocked out is more than one. Manipulations that involve four to six enzymes are not uncommon. As a result, biologists often employ computational methods as a preprocessing step to filter out less promising compounds.

One can easily generalize the notion of collaborative and substitute enzymes. Thus, a complex topology consisting of multiple enzymes connected by a combination of *OR* and *AND* may

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 355

D-Xylose ABC Transporter

XylF XylG XylH

&

XYLabc

Glyceraldehyde 3- Phosphate Dehydrogenase

GapA GapC

GAPD

Fig. 1. The figure depicts two examples of reactions catalyzed by multiple enzymes. In the top portion, D-Xylose ABC Transporter is responsible for exporting/importing a variety of molecules to and from bacteria. To carry out this function the genes XylF, XylG and XylH jointly work to catalyze the reaction XYLabc with *AND* association. In the other portion, Glyceraldehyde 3-Phosphate Dehydrogenase works in a number of metabolic pathways such

Our goal aims to find the optimal set of enzymes in the presence of multiple enzymes jointly catalyzing the same reaction to knock out so that the production of the system is optimal. In

• We prove that the problem of finding the optimal enzyme set to knockout using MIPL-based approaches is NP-hard even when only one enzyme catalyzes each reaction. This proof is also corroborated by the fact that when the network size increases, the

• We develop two solutions to deal with multiple enzyme association along with linear constraints. Our solutions eliminate the limitation that each reaction is catalyzed by a single enzyme. In our model, we allow multiple substitute and collaborative enzymes. Our first solution uses a small number of binary variables in the underlying MILP formulation. The second method increases the number of binary variables but requires a smaller number of constraints. Inclusion of multiple enzymes significantly extends the applicability of our

Our experiments using the synthetic and real datasets demonstrate that allowing multiple enzymes to catalyze a reaction increases the computational cost of the solution as compared

as the Glycolysis / Gluconeogenesis pathway or biosynthesis of phenylpropanoids.

Protein

Reaction

Reaction

summary, the main contributions of this chapter are as follows:

execution time of Optknock framework increases exponentially.

methods, as in real networks, multiple enzymes can catalyze a reaction.

Protein

catalyze a reaction.

A number of heuristic *in silico* solutions exist to find a promising set of enzymes. However, finding the set of enzymes whose knockout leads to achieving the optimal compound production rate is a computationally difficult problem. The number of possible subsets of enzymes that can be considered for manipulation grows exponentially with the number of enzymes in the pathway. Even if the size of each potential subset is limited to at most four, the number of possible subsets for a pathway consisting of 500 enzymes is more than 2.5 billion. Therefore, efficient methods that avoid inspecting the entire search space are necessary.

In order to find a promising set of enzymes to knock out, we first need to provide a computational method to evaluate the metabolic system after some enzymes are knocked out. There are several models to simulate the steady state of a metabolic network. We categorize these methods into three different groups named, boolean models, linear models and non-linear models. Boolean models can be an oversimplification of the metabolic network, especially if the number of reactions and their connectivity increase. Non-linear models require additional information about the network, which may not be available. *Flux Balance Analysis, (FBA)* is a popular linear model which is widely used to compute the flux distribution on the steady state of metabolic networks (Bonarius et al., 1997; Forster et al., 2003; Kauffman et al., 2003). Segre et al. presented a quadratic programming method, named minimization of metabolic adjustment(MOMA) (Segre et al., 2002). Shlomi et al. described a MIP method, called regulatory on/off minimization (ROOM) for predicting the metabolic steady states after the gene or enzyme knockouts (Shlomi et al., 2005).

It is easy to use these models to determine the impact on the metabolism, when a given set of genes are knocked out. However, as discussed earlier, we are interested in finding the subset of enzymes that lead to a desired impact. Optknock (Burgard et al., 2003), OptReg (Pharkya & Maranas, 2006) and OptStrain (Pharkya et al., 2004) are three MIP based methods for identifying the enzymes to be knocked out for the FBA model. All these methods make the simplifying assumption that each reaction can be catalyzed by only one enzyme. This simplification allows a quick conversion of the underlying variables using linear constraints, where MILP or quadratic programming can be used to solve the problem. However, in real metabolic networks, more than one enzyme can be involved in catalyzing a reaction. In particular, more than two enzymes can substitute each other or work collaboratively to catalyze a reaction. Figure 1 illustrates this on a real example we adopted from Reed et al. (Reed et al., 2003). Here we describe these two kinds of enzyme collaborations in brief.


2 Will-be-set-by-IN-TECH

A number of heuristic *in silico* solutions exist to find a promising set of enzymes. However, finding the set of enzymes whose knockout leads to achieving the optimal compound production rate is a computationally difficult problem. The number of possible subsets of enzymes that can be considered for manipulation grows exponentially with the number of enzymes in the pathway. Even if the size of each potential subset is limited to at most four, the number of possible subsets for a pathway consisting of 500 enzymes is more than 2.5 billion. Therefore, efficient methods that avoid inspecting the entire search space are necessary. In order to find a promising set of enzymes to knock out, we first need to provide a computational method to evaluate the metabolic system after some enzymes are knocked out. There are several models to simulate the steady state of a metabolic network. We categorize these methods into three different groups named, boolean models, linear models and non-linear models. Boolean models can be an oversimplification of the metabolic network, especially if the number of reactions and their connectivity increase. Non-linear models require additional information about the network, which may not be available. *Flux Balance Analysis, (FBA)* is a popular linear model which is widely used to compute the flux distribution on the steady state of metabolic networks (Bonarius et al., 1997; Forster et al., 2003; Kauffman et al., 2003). Segre et al. presented a quadratic programming method, named minimization of metabolic adjustment(MOMA) (Segre et al., 2002). Shlomi et al. described a MIP method, called regulatory on/off minimization (ROOM) for predicting the metabolic

It is easy to use these models to determine the impact on the metabolism, when a given set of genes are knocked out. However, as discussed earlier, we are interested in finding the subset of enzymes that lead to a desired impact. Optknock (Burgard et al., 2003), OptReg (Pharkya & Maranas, 2006) and OptStrain (Pharkya et al., 2004) are three MIP based methods for identifying the enzymes to be knocked out for the FBA model. All these methods make the simplifying assumption that each reaction can be catalyzed by only one enzyme. This simplification allows a quick conversion of the underlying variables using linear constraints, where MILP or quadratic programming can be used to solve the problem. However, in real metabolic networks, more than one enzyme can be involved in catalyzing a reaction. In particular, more than two enzymes can substitute each other or work collaboratively to catalyze a reaction. Figure 1 illustrates this on a real example we adopted from Reed et al. (Reed et al., 2003). Here we describe these two kinds of enzyme collaborations in brief. • **Collaborative enzymes:** Some reactions require the presence of two or more proteins or enzymes simultaneously. We call such enzymes as *collaborative enzymes*. In this case, absence of even one of these enzymes is sufficient to slow down or stop the reaction. Logically, there is an Boolean *AND* relation among these enzymes. In Figure 1 (top portion), D-Xylose ABC Transporter is responsible for exporting/importing a variety of molecules to and from bacteria. To carry out this function the genes XylF, XylG and XylH

• **Substitute enzymes:** Two or more enzymes can substitute each other in catalyzing a reaction. We call such enzymes as *substitute enzymes*. In this case, the presence of one of the substitute enzymes suffices to carry out that reaction. Logically, there is a Boolean *OR* relation among these enzymes. In Figure 1 (bottom portion), Glyceraldehyde 3-Phosphate Dehydrogenase works in a number of metabolic pathways such as the Glycolysis / Gluconeogenesis pathway or biosynthesis of phenylpropanoids. In a number of organisms such as *Arabidopsis thaliana* (*A. thaliana*) this can be done by GapA or GapC with *OR*

steady states after the gene or enzyme knockouts (Shlomi et al., 2005).

jointly work to catalyze the reaction XYLabc.

association.

One can easily generalize the notion of collaborative and substitute enzymes. Thus, a complex topology consisting of multiple enzymes connected by a combination of *OR* and *AND* may catalyze a reaction.

Fig. 1. The figure depicts two examples of reactions catalyzed by multiple enzymes. In the top portion, D-Xylose ABC Transporter is responsible for exporting/importing a variety of molecules to and from bacteria. To carry out this function the genes XylF, XylG and XylH jointly work to catalyze the reaction XYLabc with *AND* association. In the other portion, Glyceraldehyde 3-Phosphate Dehydrogenase works in a number of metabolic pathways such as the Glycolysis / Gluconeogenesis pathway or biosynthesis of phenylpropanoids.

Our goal aims to find the optimal set of enzymes in the presence of multiple enzymes jointly catalyzing the same reaction to knock out so that the production of the system is optimal. In summary, the main contributions of this chapter are as follows:


Our experiments using the synthetic and real datasets demonstrate that allowing multiple enzymes to catalyze a reaction increases the computational cost of the solution as compared

Shlomi et al. describes a mixed integer programming method, named regulatory on/off minimization (ROOM), for predicting the metabolic steady states after gene or enzyme knockouts (Shlomi et al., 2005). ROOM finds the flux distribution which minimizes the number of significant flux changes from the wild-type flux distribution. Experiments demonstrate that MOMA and ROOM are superior to FBA in their ability to predict the resultant states after gene or enzyme knockouts. Optknock is an enzyme knockout strategy based on the FBA model (Burgard et al., 2003). It uses a bi-level programming framework for identifying the enzymes to be knocked out. In the inner level, the optimization finds the flux distribution for a given cellular objective such as maximization of biomass yield or minimization of metabolic adjustment (MOMA) (Alper et al., 2005; Segre et al., 2002) etc). In the outer level, the optimization finds the enzymes to be knocked out to optimize a biological objective (e.g., chemical production). OptReg is another bilevel programming method for the enzyme knockout strategy (Pharkya & Maranas, 2006). The difference between Optknock and OptReg is that Optknock framework considers only two states (knockout vs non-knockout) for each reaction which are controlled by enzymes. However, OptReg considers three sets of binary variables for each reaction. These correspond to knockout or non-knockout and down regulation or up regulation. Thus, OptReg provides more candidate manipulation solutions for enzymes. OptStrain, a MILP based method, identifies desired phenotypes by adding or deleting genes or enzymes (Pharkya et al., 2004). All of the above methods use a MILP or a quadratic programming method. Although the objective function of these methods may not be linear, the constraints are

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 357

**- Non-linear models:** Although less prevalent, these methods are also used to describe metabolic networks. These methods incorporate further details about the network and thus can simulate the cell system better than the linear model. S-systems (Savageau & Voit, 1987; Voit, 2000) and GMA model (Peschel & Mende, 1986; Voit, 2000) are two examples of non-linear models for metabolic networks. Song et al. proposes methods for these non-linear models (Song et al., 2011). Patil et al. presents an evolutionary programming method which can be applied to non-linear models (Patil et al., 2005). These heuristic solutions use non-linear models with non-linear constraints. They are not guaranteed to produce optimal solutions. Also, the non-linear constraint models require additional information about the network, which may not be available. Therefore, in this chapter we

The methods described in this chapter use a linear model. Our major contribution, as discussed already, is to allow multiple enzymes to catalyze a reaction. This significantly extends the usability of such methods, as in real networks more than one enzymes can catalyze

Given a metabolic network and an objective function, one standard way to find the optimal set of enzyme knockouts is to solve the problem as an MILP which is modeled using FBA. In this section, we focus on the MILP formulation of the enzyme knockout problem and prove

that the problem is NP-hard even when a single enzyme catalyzes each reaction.

linear.

a reaction.

**3. Problem formulation**

still focus on the linear constraint model.

to the cases when all reactions are catalyzed by a single enzyme. In our experiments, we observe that our second method that introduces extra binary variables is significantly superior to our first method in terms of execution time. These results also demonstrate that the enzyme topology can have a substantial influence on the performance of the MILP solution.

The rest of the chapter is organized as follows. Section 2 discusses the related work for this chapter. Section 3 proves that finding the optimal set of enzymes to knock out using MILP is NP-hard even when we allow only one enzyme to catalyze each reaction. Section 4 describes the proposed methods when a reaction is catalyzed by multiple enzymes. Section 5 discusses experimental results. We conclude our discussion in Section 6.

#### **2. Related work**

In order to identify a promising set of enzymes to knock out, we first require a computational method to evaluate the state of the metabolic system after multiple knockouts. There are several models to simulate the steady state of a metabolic network. These methods can be classified into three categories named Boolean models, linear models and non-linear models.


4 Will-be-set-by-IN-TECH

to the cases when all reactions are catalyzed by a single enzyme. In our experiments, we observe that our second method that introduces extra binary variables is significantly superior to our first method in terms of execution time. These results also demonstrate that the enzyme

The rest of the chapter is organized as follows. Section 2 discusses the related work for this chapter. Section 3 proves that finding the optimal set of enzymes to knock out using MILP is NP-hard even when we allow only one enzyme to catalyze each reaction. Section 4 describes the proposed methods when a reaction is catalyzed by multiple enzymes. Section 5 discusses

In order to identify a promising set of enzymes to knock out, we first require a computational method to evaluate the state of the metabolic system after multiple knockouts. There are several models to simulate the steady state of a metabolic network. These methods can be classified into three categories named Boolean models, linear models and non-linear models. **- Boolean Models:** Boolean models consider each enzyme as a boolean variable. Each variable can take a either true or false value representing whether the corresponding enzyme is active or inactive. Each reaction is a boolean predicate that depends on these variables. A reaction takes place only if its predicate evaluates to true. Sridhar et al. and Song et al. propose a boolean model of the enzyme knockout strategy (Song et al., 2007; Sridhar et al., 2007; 2008). These methods require the user to supply a list of *targeted compounds* along with a metabolic network. The goal is to identify the set of enzymes whose deletion stop producing all the targeted compounds while causing minimum *damage*. Here, we define damage as the number of non-targeted compounds that are eliminated because of the knockouts. Minimum damage is defined as the minimum number of non-targeted compounds eliminated from the metabolism while eliminating the targeted compounds given all possible ways of eliminating the targeted compounds. Sridhar et al. propounds an optimal algorithm for this model (Sridhar et al., 2008). Song et al. discusses a heuristic algorithm for finding the knockout enzyme strategy (Song et al., 2007). Klamt et al. finds the enzymes for knockout by finding a minimal set of reactions whose deletion leads to an infeasible balanced flux distribution. It employs a minimum

**- Linear models:** Boolean models can be an oversimplification of the metabolic network, specially when the number of reactions and their connectivity increase. *Flux Balance Analysis, (FBA)* is a popular technique used to analyze the steady state of metabolic networks (Bonarius et al., 1997; Forster et al., 2003; Kauffman et al., 2003). FBA describes a metabolic network as a set of linear equations. FBA finds an optimal steady-state flux distribution that maximizes growth rate under constraints such as mass balance and capacity. FBA achieves a successful description of the metabolic state system by predicting growth rate and by-products of the metabolism (Edwards & Palsson, 2000a;b; Kauffman et al., 2003). However, FBA may not be able to predict an accurate metabolic state after gene or enzyme knockouts. Segre et al. presents a quadratic programming method named minimization of metabolic adjustment (MOMA) for simulation of the resultant state after knockouts (Segre et al., 2002). MOMA attempts to minimize the changes between the flux distribution after a knockout. MOMA uses linear constraints such as mass balance, capacity, and knockout constraints, which are the same set of constraints used by FBA.

topology can have a substantial influence on the performance of the MILP solution.

experimental results. We conclude our discussion in Section 6.

cut approach to solve the problem (Klamt & Gilles, 2004).

**2. Related work**

Shlomi et al. describes a mixed integer programming method, named regulatory on/off minimization (ROOM), for predicting the metabolic steady states after gene or enzyme knockouts (Shlomi et al., 2005). ROOM finds the flux distribution which minimizes the number of significant flux changes from the wild-type flux distribution. Experiments demonstrate that MOMA and ROOM are superior to FBA in their ability to predict the resultant states after gene or enzyme knockouts. Optknock is an enzyme knockout strategy based on the FBA model (Burgard et al., 2003). It uses a bi-level programming framework for identifying the enzymes to be knocked out. In the inner level, the optimization finds the flux distribution for a given cellular objective such as maximization of biomass yield or minimization of metabolic adjustment (MOMA) (Alper et al., 2005; Segre et al., 2002) etc). In the outer level, the optimization finds the enzymes to be knocked out to optimize a biological objective (e.g., chemical production). OptReg is another bilevel programming method for the enzyme knockout strategy (Pharkya & Maranas, 2006). The difference between Optknock and OptReg is that Optknock framework considers only two states (knockout vs non-knockout) for each reaction which are controlled by enzymes. However, OptReg considers three sets of binary variables for each reaction. These correspond to knockout or non-knockout and down regulation or up regulation. Thus, OptReg provides more candidate manipulation solutions for enzymes. OptStrain, a MILP based method, identifies desired phenotypes by adding or deleting genes or enzymes (Pharkya et al., 2004). All of the above methods use a MILP or a quadratic programming method. Although the objective function of these methods may not be linear, the constraints are linear.

**- Non-linear models:** Although less prevalent, these methods are also used to describe metabolic networks. These methods incorporate further details about the network and thus can simulate the cell system better than the linear model. S-systems (Savageau & Voit, 1987; Voit, 2000) and GMA model (Peschel & Mende, 1986; Voit, 2000) are two examples of non-linear models for metabolic networks. Song et al. proposes methods for these non-linear models (Song et al., 2011). Patil et al. presents an evolutionary programming method which can be applied to non-linear models (Patil et al., 2005). These heuristic solutions use non-linear models with non-linear constraints. They are not guaranteed to produce optimal solutions. Also, the non-linear constraint models require additional information about the network, which may not be available. Therefore, in this chapter we still focus on the linear constraint model.

The methods described in this chapter use a linear model. Our major contribution, as discussed already, is to allow multiple enzymes to catalyze a reaction. This significantly extends the usability of such methods, as in real networks more than one enzymes can catalyze a reaction.

#### **3. Problem formulation**

Given a metabolic network and an objective function, one standard way to find the optimal set of enzyme knockouts is to solve the problem as an MILP which is modeled using FBA. In this section, we focus on the MILP formulation of the enzyme knockout problem and prove that the problem is NP-hard even when a single enzyme catalyzes each reaction.

**3.2 NP completeness**

mixed-integer program:

s.t. ∑

nonnegative and *zij* is binary respectively.

(*i*, *j*) represents reaction *k* as follows:

produce metabolite *j*. For *i* ∈ *N*� and *k* ∈ *M*�

�

*bi* and *ci* = ¯

Note that *S*

that *bi* = ¯

(*i*,*k*)∈*A*

To prove that finding the enzymes to knockout by EKFB is NP-hard, we show that the uncapacitated fixed charge network flow problem, which is NP-hard, is a special case of the EKFB (Ng & Rardin, 1996). Let *G* = (*V*, *A*) be a directed graph, where *V* is the set of nodes, *A* is the set of arcs, *s* ∈ *V* is the single source node, *T* ⊆ *V* is a collection of sink vertices and *dt* > 0 is the demand for node *t*. Let *xij* denote the flow on arc (*i*, *j*) with a cost *cij*. Let the variable *zij* be equal to 1 if arc (*i*, *j*) is selected with a fixed cost *fij* and 0 otherwise. We then define the *uncapacitated fixed charge network flow problem (UFNF)* as the problem of finding a set of arcs that allow a supply node to send resources to a set of demand nodes, such that the sum of fixed and variable costs are minimized. UFNF can be formulated using the following

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 359

*fijzij* + ∑

⎧ ⎪⎨

⎪⎩

*xkj* =

The objective function (6) minimizes the sum of the fixed costs associated with selecting arc (*i*, *j*) and variable costs for sending flow through (*i*, *j*). Constraints (7) are classical flow conservation constraints. Constraints (8) ensure that there can not be any flow if *zij* is 0. Also, the maximum flow can be at most *λ* if *zij* is 1. Constraints (9) and (10) ensure that *xij* is

**Proof:** Let *N*� be the set of the metabolites and *M*� be the set of the reactions in a special case

represents a metabolite *i* ∈ *N*� and each arc represents a reaction *k* ∈ *M*� using metabolite *i* to

, we redefine the stoichiometric matrix *Sik* as *S*

1 if (*i*, *j*) ∈ *M*�

−1 if (*j*, *i*) ∈ *M*�

*ij* has entries 1, −1, and 0, and thus is a special case of *Sik*. We also define a new

0 otherwise.

metabolites that are imposed to the pathway, and *J*� ⊆ *J* be the set of metabolites that will not be used within the pathway after they are produced. Let us define a parameter ¯

;

;

. By using the stoichiometric matrix *S*

(*i*,*j*)∈*A*

− <sup>∑</sup>*t*∈*<sup>T</sup> dt* if *<sup>k</sup>* = *<sup>s</sup>*; *dk* if *k* ∈ *T*;

0 if *k* ∈ *V*\ {*T* ∪ *s*}.

*xij* ≤ *λzij* ∀ (*i*, *j*) ∈ *A* (8) *xij* ≥ 0 ∀ (*i*, *j*) ∈ *A* (9) *zij* ∈ {0, 1} ∀ (*i*, *j*) ∈ *A*. (10)

, *M*�

�

*ij* (*i*, *j* ∈ *N*�

. Let *I*� ⊆ *I* be a set of external

�

), where each node

) such that

(11)

*bi* such

*ij* and the new

*cijxij* (6)

(7)

min ∑ (*i*,*j*)∈*A*

*xik* − ∑ (*k*,*j*)∈*A*

**Theorem 1.** *Finding the enzyme knockout strategy by EKFB is NP-Hard.*

metabolic pathway in EKFB. We model it as a network graph *G*� = (*N*�

*S* � *ij* =

variable *v*¯*ij* as the flux corresponding to reaction *k* ∈ *M*�

*bi* for each *i* ∈ *N*�

variables *v*¯*ij*, the constraint (2) can be written as,

⎧ ⎪⎨

⎪⎩

#### **3.1 Formulation**

Given a set *<sup>N</sup>* <sup>=</sup> {1, ..., *<sup>N</sup>*¯ } of *<sup>N</sup>* metabolites and a set *<sup>M</sup>* <sup>=</sup> {1, ..., *<sup>M</sup>*¯ } of *<sup>M</sup>* metabolic reactions, our goal is to determine the maximum yield of the desired products in a metabolic network while minimizing the enzyme knockout costs. We summarize the decision variables as follows:

*vj* : the flux of reaction *j*;

*yj* : binary variable which equals to 0 if an enzyme in reaction *j* is knocked out, and 1 otherwise.

Other relevant parameters used in this problem are:


Here, *lj* and *uj* are estimated by minimizing and maximizing every reaction flux subject to the constraints from the *enzyme knockout flux balance model (EKFB)* framework given below.

Let *I* be a set of external metabolites that are imposed on the pathway, and *J* be the set of metabolites that will not be used within the pathway once they are produced. We denote the flux of the source metabolites in the metabolic pathway by *bi* and the flux of the sink metabolites by *ci*.

Given these variables and parameters, we represent the integer programming formulation for EKFB as follows:

$$\max \sum\_{j \in M} w\_j v\_j - \sum\_{j \in M} h\_j (1 - y\_j) \tag{1}$$

s.t. ∑ *j*∈*M Sijvj* = ⎧ ⎪⎨ ⎪⎩ −*bi* if *i* ∈ *I*; *ci* if *i* ∈ *J*; 0 if *i* ∈ *N*\ {*J* ∪ *I*}. (2)

$$|l\_j y\_j \le v\_j \le u\_j y\_j \quad j \in M \tag{3}$$

$$\sum\_{j \in M} (1 - y\_j) \le K, \quad \forall j \in M \tag{4}$$

$$y\_j \in \{0, 1\} \qquad \qquad \forall j \in M. \tag{5}$$

The objective function (1) maximizes weighted flux less fixed charge corresponding to the enzyme knockouts. Constraint (2) provides flux balance equations defined by the stoichiometric matrix. Constraint (3) includes the fixed charge variable *yj*. If the enzyme corresponding to reaction *j* is knocked out, the value of the flux is set to zero and a fixed charge *hj* for knocking out the enzyme is imposed. If the fixed charge variable *yj* takes value 1, then the lowest flux value is *lj* while the highest possible flux value is *uj*. Constraint (4) imposes the condition that the maximum number of knockouts is *K*. Constraints (5) enforce integrality on the fixed charge variables. Similar formulations are provided in Burgard et al. (Burgard et al., 2003), Cover et al. (Covert et al., 2001) and Palsson (Palsson, 2000).

#### **3.2 NP completeness**

6 Will-be-set-by-IN-TECH

Given a set *<sup>N</sup>* <sup>=</sup> {1, ..., *<sup>N</sup>*¯ } of *<sup>N</sup>* metabolites and a set *<sup>M</sup>* <sup>=</sup> {1, ..., *<sup>M</sup>*¯ } of *<sup>M</sup>* metabolic reactions, our goal is to determine the maximum yield of the desired products in a metabolic network while minimizing the enzyme knockout costs. We summarize the decision variables

Here, *lj* and *uj* are estimated by minimizing and maximizing every reaction flux subject to the constraints from the *enzyme knockout flux balance model (EKFB)* framework given below. Let *I* be a set of external metabolites that are imposed on the pathway, and *J* be the set of metabolites that will not be used within the pathway once they are produced. We denote the flux of the source metabolites in the metabolic pathway by *bi* and the flux of the sink

Given these variables and parameters, we represent the integer programming formulation for

*wjvj* − ∑ *j*∈*M*

> ⎧ ⎪⎨

−*bi* if *i* ∈ *I*; *ci* if *i* ∈ *J*;

0 if *i* ∈ *N*\ {*J* ∪ *I*}.

*ljyj* ≤ *vj* ≤ *ujyj j* ∈ *M* (3)

*yj* ∈ {0, 1} ∀*j* ∈ *M*. (5)

(1 − *yj*) ≤ *K*, ∀*j* ∈ *M* (4)

⎪⎩

The objective function (1) maximizes weighted flux less fixed charge corresponding to the enzyme knockouts. Constraint (2) provides flux balance equations defined by the stoichiometric matrix. Constraint (3) includes the fixed charge variable *yj*. If the enzyme corresponding to reaction *j* is knocked out, the value of the flux is set to zero and a fixed charge *hj* for knocking out the enzyme is imposed. If the fixed charge variable *yj* takes value 1, then the lowest flux value is *lj* while the highest possible flux value is *uj*. Constraint (4) imposes the condition that the maximum number of knockouts is *K*. Constraints (5) enforce integrality on the fixed charge variables. Similar formulations are provided in Burgard et

al. (Burgard et al., 2003), Cover et al. (Covert et al., 2001) and Palsson (Palsson, 2000).

*hj*(1 − *yj*) (1)

(2)

*yj* : binary variable which equals to 0 if an enzyme in reaction *j* is knocked out,

**3.1 Formulation**

and 1 otherwise.

metabolites by *ci*.

EKFB as follows:

*vj* : the flux of reaction *j*;

Other relevant parameters used in this problem are:

*lj* : minimum possible flow corresponding to flux *j*; *uj* : maximum possible flow corresponding to flux *j*;

*wj* : weight corresponding to the value of flux *j*.

*Sij* : stoichiometric matrix coefficient of metabolite *i* in reaction *j*;

max ∑ *j*∈*M*

*j*∈*M*

∑ *j*∈*M* *Sijvj* =

s.t. ∑

*hj* : cost of blocking the enzyme corresponding to reaction *j*;

as follows:

To prove that finding the enzymes to knockout by EKFB is NP-hard, we show that the uncapacitated fixed charge network flow problem, which is NP-hard, is a special case of the EKFB (Ng & Rardin, 1996). Let *G* = (*V*, *A*) be a directed graph, where *V* is the set of nodes, *A* is the set of arcs, *s* ∈ *V* is the single source node, *T* ⊆ *V* is a collection of sink vertices and *dt* > 0 is the demand for node *t*. Let *xij* denote the flow on arc (*i*, *j*) with a cost *cij*. Let the variable *zij* be equal to 1 if arc (*i*, *j*) is selected with a fixed cost *fij* and 0 otherwise. We then define the *uncapacitated fixed charge network flow problem (UFNF)* as the problem of finding a set of arcs that allow a supply node to send resources to a set of demand nodes, such that the sum of fixed and variable costs are minimized. UFNF can be formulated using the following mixed-integer program:

$$\min \sum\_{(i,j)\in A} f\_{ij} z\_{ij} + \sum\_{(i,j)\in A} c\_{ij} x\_{ij} \tag{6}$$

$$\text{s.t.} \qquad \sum\_{(i,k)\in A} x\_{ik} - \sum\_{(k,j)\in A} x\_{kj} = \begin{cases} -\sum\_{l\in T} d\_l & \text{if } k=s;\\ d\_k & \text{if } k\in T;\\ 0 & \text{if } k\in V\backslash\{T\cup s\}. \end{cases} \tag{7}$$

$$
\omega\_{\rm ij} \le \lambda z\_{\rm ij} \qquad \quad \forall \ (i, j) \in A \tag{8}
$$

$$\mathfrak{x}\_{ij} \ge \mathbf{0} \qquad\qquad\qquad\forall \ (i, j) \in A \tag{9}$$

$$z\_{ij} \in \{0, 1\} \qquad \forall \ (i, j) \in A. \tag{10}$$

The objective function (6) minimizes the sum of the fixed costs associated with selecting arc (*i*, *j*) and variable costs for sending flow through (*i*, *j*). Constraints (7) are classical flow conservation constraints. Constraints (8) ensure that there can not be any flow if *zij* is 0. Also, the maximum flow can be at most *λ* if *zij* is 1. Constraints (9) and (10) ensure that *xij* is nonnegative and *zij* is binary respectively.

#### **Theorem 1.** *Finding the enzyme knockout strategy by EKFB is NP-Hard.*

**Proof:** Let *N*� be the set of the metabolites and *M*� be the set of the reactions in a special case metabolic pathway in EKFB. We model it as a network graph *G*� = (*N*� , *M*� ), where each node represents a metabolite *i* ∈ *N*� and each arc represents a reaction *k* ∈ *M*� using metabolite *i* to produce metabolite *j*.

For *i* ∈ *N*� and *k* ∈ *M*� , we redefine the stoichiometric matrix *Sik* as *S* � *ij* (*i*, *j* ∈ *N*� ) such that (*i*, *j*) represents reaction *k* as follows:

$$S\_{ij}^{'} = \begin{cases} 1 & \text{if } (i, j) \in M^{'}; \\ -1 & \text{if } (j, i) \in M^{'}; \\ 0 & \text{otherwise}. \end{cases} \tag{11}$$

Note that *S* � *ij* has entries 1, −1, and 0, and thus is a special case of *Sik*. We also define a new variable *v*¯*ij* as the flux corresponding to reaction *k* ∈ *M*� . Let *I*� ⊆ *I* be a set of external metabolites that are imposed to the pathway, and *J*� ⊆ *J* be the set of metabolites that will not be used within the pathway after they are produced. Let us define a parameter ¯ *bi* such that *bi* = ¯ *bi* and *ci* = ¯ *bi* for each *i* ∈ *N*� . By using the stoichiometric matrix *S* � *ij* and the new variables *v*¯*ij*, the constraint (2) can be written as,

• A topology consisting only of collaborative enzymes that catalyze any reaction. Each reaction may be catalyzed by a single enzyme or a set of enzymes based on the *AND* association i.e., all of the enzymes need to be present to catalyze the reaction (Section 4.2). • A complex topology consisting of multiple enzymes related by a combination of *OR* and

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 361

Shlomi et al. presents a way of replacing Boolean expressions that contains two Boolean variables with linear inequalities (Shlomi et al., 2007). However, as the number of Boolean variables grows, the number of additional variables required by this method grows rapidly making the problem nontrivial. In the following sections we discuss two alternative strategies to deal with each of these three scenarios. We name these strategies the *Binary Method* and *Continuous Method*. The former one introduces additional Boolean variables. The second one avoids the addition of Boolean variables, but comes at the expense of additional constraints.

In this section, we consider the case when all the enzymes that catalyze the same reaction can substitute each other. In this case, the presence of at least one of the substitute enzymes is sufficient to carry out the corresponding reaction. Let E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *E*2, ··· , *EM*}} denote a set of variables representing the substitute enzymes for reaction *i* (i.e., flux *vi*). Then we

write the function *Fi* that governs the relationship between the variables in E*<sup>i</sup>* as:

*li* max *Eij*∈E*<sup>i</sup>* *Fi* = max *Eij*∈E*<sup>i</sup>*

{*Eij*} ≤ *vi* ≤ *ui* max

We address the problem of nonlinearity in constraint (16) by performing a variable transformation, which leads to a set of linear constraints. We solve them using traditional

Our linearization technique considers lower and upper bounds separately. We linearize lower

is more complex compared to that of the lower bound. For the linearization, we consider two

**Binary method:** In this method, we propose the following linear constraints in order to

Linearization of the upper bounding constraints given by the inequality *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>*

{*Eij*}

*Eij*∈E*<sup>i</sup>*

{*Eij*}. (16)

{*Eij*}

∀*i* (18a)

{*Eij*} ≤ *vi* as follows,

*liEij* ≤ *vi* ∀ *Eij* ∈ E*i*. (17)

*Eij* ∀*i* (18b)

*Fi* ∈ {0, 1} ∀*i* (18c)

*AND* may catalyze a reaction (Section 4.3).

We discuss these in detail in the following sections.

MILP solution techniques such as simplex method.

bounding constraints given by the inequality *li* max*Eij*∈E*<sup>i</sup>*

approaches, namely binary and continuous methods.

*Fi* <sup>≥</sup> <sup>∑</sup>*<sup>j</sup> Eij n*

*Fi* ≤ ∑ *j*

enforce binary restrictions on *Fi* (i.e., *Fi* ∈ {0, 1}):

Thus the constraint (15) becomes:

**4.1 MILP solution in the presence of substitute enzymes**

$$\sum\_{(l,l')\in M'} \vec{\sigma}\_{il} - \sum\_{(l,l)\in M'} \vec{\sigma}\_{lj} = \begin{cases} -\bar{b}\_l & \text{if } l \in I'; \\ \bar{b}\_l & \text{if } l \in J'; \\ 0 & \text{if } l \in N' \backslash \{I' \cup J'\}. \end{cases} \tag{12}$$

We now define a binary variable *z*¯*ij* for each variable *yk*, which assumes value 1 if the arc (*i*, *j*) is selected and 0 otherwise. We define costs *<sup>c</sup>*¯*ij* and ¯ *fij* such that *<sup>c</sup>*¯*ij* <sup>=</sup> <sup>−</sup>*wk*, ¯ *fij* = *hk*. Finally, we define a constant *<sup>λ</sup>*¯ as *<sup>λ</sup>*¯ <sup>=</sup> *uk*, and set *lk* <sup>=</sup> 0 for each reaction *<sup>k</sup>* <sup>∈</sup> *<sup>M</sup>*� , which is defined by the arc (*i*, *j*). Then, the constraint (3) can be written as,

$$0 \le \overline{\sigma}\_{\overline{i}\overline{j}} \le \overline{\lambda} \overline{z}\_{\overline{i}\overline{j}} \qquad \qquad \forall \ (i, j) \in M^{\prime} \tag{13}$$

with an objective function,

$$\min \sum\_{(i,j)\in M'} \overline{c}\_{ij}\overline{v}\_{ij} + \sum\_{(i,j)\in M'} \overline{f}\_{ij}\overline{z}\_{ij} \tag{14}$$

Thus, a special case of EKFB with an objective function (14) and constraints (11), (12), (13) and *z*¯*ij* ∈ {0, 1} is a UFNF and hence EKFB is NP-Hard.

#### **4. Methods for multiple enzymes**

In this section, we develop a more general version of EKFB where we allow multiple enzymes to catalyze a reaction. This extension improves the applicability of our methods as in real networks more than one enzymes can catalyze a reaction. In particular, we focus on the constraints (3) and model the possible interactions between enzymes regarding the reactions they catalyze.

Let *Ei* be a Boolean variable that denotes whether the *i*th enzyme is active (i.e., *Ei* = *true*) or inhibited (i.e., *Ei* = *false*). As discussed earlier, in EKFB, we assume that a reaction can be catalyzed only by a single enzyme. We use the Boolean variable *yi* which is equal to 1 if an enzyme is active, and 0 otherwise.

Let us denote the set of variables for the enzymes that are involved in catalyzing the *i*th reaction with E*<sup>i</sup>* ⊆ {*E*1, *E*2, ··· , *EM*}. For simplicity, we will use the notation E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *<sup>E</sup>*2, ··· , *EM*}} to denote this set. Let *Fi* be a function on {0, 1}|E*i*<sup>|</sup> representing the relationship between the enzymes for the *i*th reaction. This function takes E*<sup>i</sup>* as input and produces an integer. It evaluates to 1 if the *i*th reaction takes place according to the values of the variables in E*i*. It evaluates to 0 otherwise. Also, let the constants *li* and *ui* represent the minimum and the maximum flux values. We write the second set of constraints as:

$$l\_{\bar{l}}F\_{\bar{l}} \le \upsilon\_{\bar{l}} \le u\_{\bar{l}}F\_{\bar{l}}.\tag{15}$$

Depending on association between the enzymes that catalyze a reaction, we formulate *Fi* for three different scenarios.

• A topology consisting only of substitute enzymes that catalyze any reaction. Each reaction may be catalyzed by a single enzyme or a set of enzyme based on the *OR* association i.e., only one of the enzymes need be present to catalyze the reaction (Section 4.1).


Shlomi et al. presents a way of replacing Boolean expressions that contains two Boolean variables with linear inequalities (Shlomi et al., 2007). However, as the number of Boolean variables grows, the number of additional variables required by this method grows rapidly making the problem nontrivial. In the following sections we discuss two alternative strategies to deal with each of these three scenarios. We name these strategies the *Binary Method* and *Continuous Method*. The former one introduces additional Boolean variables. The second one avoids the addition of Boolean variables, but comes at the expense of additional constraints. We discuss these in detail in the following sections.

#### **4.1 MILP solution in the presence of substitute enzymes**

In this section, we consider the case when all the enzymes that catalyze the same reaction can substitute each other. In this case, the presence of at least one of the substitute enzymes is sufficient to carry out the corresponding reaction. Let E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *E*2, ··· , *EM*}} denote a set of variables representing the substitute enzymes for reaction *i* (i.e., flux *vi*). Then we write the function *Fi* that governs the relationship between the variables in E*<sup>i</sup>* as:

$$F\_i = \max\_{E\_{\emptyset} \in \mathcal{E}\_\ell} \{ E\_{i\dot{j}} \}$$

Thus the constraint (15) becomes:

8 Will-be-set-by-IN-TECH

⎧ ⎪⎪⎪⎨ −¯

¯

*bl* if *l* ∈ *I*�

*bl* if *l* ∈ *J*�

0 if *l* ∈ *N*�

;

;

<sup>0</sup> <sup>≤</sup> *<sup>v</sup>*¯*ij* <sup>≤</sup> *<sup>λ</sup>*¯ *<sup>z</sup>*¯*ij* <sup>∀</sup> (*i*, *<sup>j</sup>*) <sup>∈</sup> *<sup>M</sup>*� (13)

¯

\ {*I*� ∪ *J*�

*fij* such that *<sup>c</sup>*¯*ij* <sup>=</sup> <sup>−</sup>*wk*, ¯

*liFi* ≤ *vi* ≤ *uiFi*. (15)

}.

*fijz*¯*ij* (14)

(12)

*fij* = *hk*. Finally,

, which is defined by

⎪⎪⎪⎩

We now define a binary variable *z*¯*ij* for each variable *yk*, which assumes value 1 if the arc (*i*, *j*)

*c*¯*ijv*¯*ij* + ∑

Thus, a special case of EKFB with an objective function (14) and constraints (11), (12), (13) and

In this section, we develop a more general version of EKFB where we allow multiple enzymes to catalyze a reaction. This extension improves the applicability of our methods as in real networks more than one enzymes can catalyze a reaction. In particular, we focus on the constraints (3) and model the possible interactions between enzymes regarding the reactions

Let *Ei* be a Boolean variable that denotes whether the *i*th enzyme is active (i.e., *Ei* = *true*) or inhibited (i.e., *Ei* = *false*). As discussed earlier, in EKFB, we assume that a reaction can be catalyzed only by a single enzyme. We use the Boolean variable *yi* which is equal to 1 if an

Let us denote the set of variables for the enzymes that are involved in catalyzing the *i*th reaction with E*<sup>i</sup>* ⊆ {*E*1, *E*2, ··· , *EM*}. For simplicity, we will use the notation E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *<sup>E</sup>*2, ··· , *EM*}} to denote this set. Let *Fi* be a function on {0, 1}|E*i*<sup>|</sup> representing the relationship between the enzymes for the *i*th reaction. This function takes E*<sup>i</sup>* as input and produces an integer. It evaluates to 1 if the *i*th reaction takes place according to the values of the variables in E*i*. It evaluates to 0 otherwise. Also, let the constants *li* and *ui* represent the minimum and

Depending on association between the enzymes that catalyze a reaction, we formulate *Fi* for

• A topology consisting only of substitute enzymes that catalyze any reaction. Each reaction may be catalyzed by a single enzyme or a set of enzyme based on the *OR* association i.e.,

only one of the enzymes need be present to catalyze the reaction (Section 4.1).

(*i*,*j*)∈*M*�

*v*¯*lj* =

∑ (*i*,*l*)∈*M*�

is selected and 0 otherwise. We define costs *<sup>c</sup>*¯*ij* and ¯

the arc (*i*, *j*). Then, the constraint (3) can be written as,

*z*¯*ij* ∈ {0, 1} is a UFNF and hence EKFB is NP-Hard.

**4. Methods for multiple enzymes**

enzyme is active, and 0 otherwise.

three different scenarios.

with an objective function,

they catalyze.

*<sup>v</sup>*¯*il* − ∑ (*l*,*j*)∈*M*�

we define a constant *<sup>λ</sup>*¯ as *<sup>λ</sup>*¯ <sup>=</sup> *uk*, and set *lk* <sup>=</sup> 0 for each reaction *<sup>k</sup>* <sup>∈</sup> *<sup>M</sup>*�

min ∑ (*i*,*j*)∈*M*�

the maximum flux values. We write the second set of constraints as:

$$\|l\_i \max\_{E\_{ij} \in \mathcal{E}\_l} \{ E\_{ij} \} \le v\_i \le u\_i \max\_{E\_{ij} \in \mathcal{E}\_l} \{ E\_{ij} \}. \tag{16}$$

We address the problem of nonlinearity in constraint (16) by performing a variable transformation, which leads to a set of linear constraints. We solve them using traditional MILP solution techniques such as simplex method.

Our linearization technique considers lower and upper bounds separately. We linearize lower bounding constraints given by the inequality *li* max*Eij*∈E*<sup>i</sup>* {*Eij*} ≤ *vi* as follows,

$$d\_{\bar{i}} E\_{\bar{i}\bar{j}} \le \upsilon\_{\bar{i}} \qquad \qquad \forall \ E\_{\bar{i}\bar{j}} \in \mathcal{E}\_{\bar{i}}.\tag{17}$$

Linearization of the upper bounding constraints given by the inequality *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>* {*Eij*} is more complex compared to that of the lower bound. For the linearization, we consider two approaches, namely binary and continuous methods.

**Binary method:** In this method, we propose the following linear constraints in order to enforce binary restrictions on *Fi* (i.e., *Fi* ∈ {0, 1}):

$$F\_i \ge \frac{\sum\_j E\_{ij}}{n} \tag{18a}$$

$$F\_i \le \sum\_j E\_{ij} \tag{18b}$$

$$F\_l \in \{0, 1\} \tag{18c} \\ \tag{18c} \\ \tag{18c}$$

**Continuous method:** For the continuous method, we replace the lower bound constraint

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 363

In this subsection, we generalize the methods described in the previous two subsections in order to allow associations with arbitrary forms. We consider the case when the reaction can be catalyzed by a set of enzymes such that some of them can substitute for each other and

For example, assume that *i*th reaction can be catalyzed by two alternative enzyme complexes that can substitute each other. Also assume that the first and the second of these complexes are formed from two and three enzymes, respectively. These two or three enzymes in the complexes collaborate with each other. We formulate this relationship as *Fi* = max { min {*Ei*1,

Using standard rules from Boolean algebra, all Boolean equations can be written into disjunctive or conjunctive normal forms. Thus, we transform the equation for each reaction

> { min *Eij*∈E*<sup>k</sup> i*

*<sup>i</sup>* denotes the *k*th set of collaborative enzymes required by the *i*th reaction.

*Fi* = max E*k i*

*<sup>i</sup>* <sup>=</sup> <sup>E</sup>*i*. We define a new binary variable *<sup>Z</sup><sup>k</sup>*

*Zk <sup>i</sup>* = min *Eij*∈E*<sup>k</sup> i*

methods quantitatively in terms of their execution time (in seconds).

*Fi* = max E*k i Zk*

The methods in Section 4.1 and Section 4.2 are used for constraints (26) and (25) respectively

In this section, we evaluate the performance and the limitations of our methods on real and artificially generated metabolic networks. The synthetic datasets provide us a controlled simulation environment that allows us to determine the impact of different characteristics of the network on the performance of our algorithms. We evaluate the performance of our

*Fi* ≤ *Eij* ∀*i* (23a)

*Fi* ≥ 0 ∀*i* (23c)

*Eij* − (*n* − 1) ∀*i* (23b)

{*Eij*}}. (24)

*<sup>i</sup>* . (25)

{*Eij*}. (26)

*<sup>i</sup>* ∈ {0, 1} corresponding to each

{*Eij*} ≤ *vi* with the following linear constraints:

**4.3 MILP solution in the presence of complex association of enzymes**

*Fi* ≥ ∑ *j*

others need to work collaboratively.

*<sup>k</sup>* <sup>E</sup>*<sup>k</sup>*

*<sup>i</sup>* and rewrite Equation (24) as,

to linearize the constraint (15).

*Ei*2}, min {*Ei*3, *Ei*4, *Ei*4} }.

into the following form:

In this equation, <sup>E</sup>*<sup>k</sup>*

Thus, we have

**5. Experiments**

E*k*

where,

*li* min*Eij*∈E*<sup>i</sup>*

**Continuous method:** In this method, we define *Fi* using a continuous variable that takes value in the real domain. We replace the upper bound constraint *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>* {*Eij*} with the following linear constraints:

$$F\_l \le \sum\_j E\_{lj} \tag{19a}$$

$$F\_l \le 1 \tag{19b}$$

$$F\_{\mathbf{i}} \ge E\_{\mathbf{i}\mathbf{j}} \tag{19c}$$

The constraints (19b)- (19c) enforces *Fi* to assume a binary value, even though we do not directly impose binary restrictions on it.

#### **4.2 MILP solution in the presence of collaborative enzymes**

In this section, we consider the case where multiple enzymes collaborate with each other to catalyze the same reaction. In this case, all the enzymes are necessary for the reaction to initiate. Let E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *E*2, ··· , *EM*}} denote a set of variables representing the substitute enzymes for reaction *i* (i.e., flux *vi*). We write the function *Fi* that governs the relationship between the variables in E*<sup>i</sup>* as:

$$F\_{\bar{l}} = \min\_{E\_{\bar{\imath}} \in \mathcal{E}\_{\bar{\imath}}} \{ E\_{\bar{\imath}\bar{\jmath}} \}$$

Thus, we write constraint (15) as,

$$d\_i \min\_{E\_{ij} \in \mathcal{E}\_i} \{ E\_{ij} \} \le \upsilon\_i \le u\_i \min\_{E\_{\varnothing} \in \mathcal{E}\_i} \{ E\_{ij} \}. \tag{20}$$

As we discussed in Section 4.1, constraint (20) is nonlinear. We linearize this constraint using additional variables. We address lower and upper bounds separately.

First, we focus on the upper bound constraints given by the inequality *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>* {*Eij*}. We linearize this part without introducing new variables as follows:

$$
v\_{i} \leq \mu\_{i} E\_{i\bar{j}} \qquad \qquad \forall \ E\_{i\bar{j}} \in \mathcal{E}\_{i} \tag{21}$$

The linearization of the lower bound constraints given by the inequality *li* min*Eij*∈E*<sup>i</sup>* {*Eij*} ≤ *vi* is more complicated. Analogous to the substitute enzyme case, we develop both binary and continuous methods presented in the following two sections.

**Binary method:** We have already assumed that *Fi* ∈ {0, 1}. We linearize the nonlinear constraint *li* min*Eij*∈E*<sup>i</sup>* {*Eij*} ≤ *vi* under this assumption as follows:

$$F\_i \le \frac{\sum\_j E\_{ij}}{n} \tag{22a}$$

$$F\_i > \frac{\sum\_j E\_{ij}}{n} - 1 \tag{22b}$$

$$F\_i \in \{0, 1\} \tag{22c}$$

**Continuous method:** For the continuous method, we replace the lower bound constraint *li* min*Eij*∈E*<sup>i</sup>* {*Eij*} ≤ *vi* with the following linear constraints:

$$F\_i \le E\_{i\bar{j}} \tag{23a}$$

$$F\_{\mathbf{i}} \ge \sum\_{\mathbf{j}} E\_{\mathbf{i}\mathbf{j}} - (n - 1) \tag{23b}$$

$$F\_i \ge 0 \tag{23c}$$

#### **4.3 MILP solution in the presence of complex association of enzymes**

In this subsection, we generalize the methods described in the previous two subsections in order to allow associations with arbitrary forms. We consider the case when the reaction can be catalyzed by a set of enzymes such that some of them can substitute for each other and others need to work collaboratively.

For example, assume that *i*th reaction can be catalyzed by two alternative enzyme complexes that can substitute each other. Also assume that the first and the second of these complexes are formed from two and three enzymes, respectively. These two or three enzymes in the complexes collaborate with each other. We formulate this relationship as *Fi* = max { min {*Ei*1, *Ei*2}, min {*Ei*3, *Ei*4, *Ei*4} }.

Using standard rules from Boolean algebra, all Boolean equations can be written into disjunctive or conjunctive normal forms. Thus, we transform the equation for each reaction into the following form:

$$F\_l = \max\_{\mathcal{E}\_l^k} \{ \min\_{E\_{lj} \in \mathcal{E}\_l^k} \{ E\_{lj} \} \}. \tag{24}$$

In this equation, <sup>E</sup>*<sup>k</sup> <sup>i</sup>* denotes the *k*th set of collaborative enzymes required by the *i*th reaction. Thus, we have *<sup>k</sup>* <sup>E</sup>*<sup>k</sup> <sup>i</sup>* <sup>=</sup> <sup>E</sup>*i*. We define a new binary variable *<sup>Z</sup><sup>k</sup> <sup>i</sup>* ∈ {0, 1} corresponding to each E*k <sup>i</sup>* and rewrite Equation (24) as,

$$F\_{\bar{l}} = \max\_{\mathcal{E}\_l^k} Z\_{\bar{l}}^k. \tag{25}$$

where,

10 Will-be-set-by-IN-TECH

**Continuous method:** In this method, we define *Fi* using a continuous variable that takes

The constraints (19b)- (19c) enforces *Fi* to assume a binary value, even though we do not

In this section, we consider the case where multiple enzymes collaborate with each other to catalyze the same reaction. In this case, all the enzymes are necessary for the reaction to initiate. Let E*<sup>i</sup>* = {*Eij*| *Eij* ∈ {*E*1, *E*2, ··· , *EM*}} denote a set of variables representing the substitute enzymes for reaction *i* (i.e., flux *vi*). We write the function *Fi* that governs the

> *Fi* = min *Eij*∈E*<sup>i</sup>*

{*Eij*} ≤ *vi* ≤ *ui* min

As we discussed in Section 4.1, constraint (20) is nonlinear. We linearize this constraint using

is more complicated. Analogous to the substitute enzyme case, we develop both binary and

**Binary method:** We have already assumed that *Fi* ∈ {0, 1}. We linearize the nonlinear

{*Eij*} ≤ *vi* under this assumption as follows:

First, we focus on the upper bound constraints given by the inequality *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>*

The linearization of the lower bound constraints given by the inequality *li* min*Eij*∈E*<sup>i</sup>*

{*Eij*}

*Eij*∈E*<sup>i</sup>*

*vi* ≤ *uiEij* ∀ *Eij* ∈ E*<sup>i</sup>* (21)

*<sup>n</sup>* <sup>−</sup> <sup>1</sup> <sup>∀</sup>*<sup>i</sup>* (22b)

*Fi* ∈ {0, 1} ∀*i* (22c)

{*Eij*}. (20)

{*Eij*}.

{*Eij*} ≤ *vi*

∀*i* (22a)

*Eij* ∀*i* (19a)

*Fi* ≤ 1 ∀*i* (19b) *Fi* ≥ *Eij* ∀*i*, *j* (19c)

{*Eij*} with

value in the real domain. We replace the upper bound constraint *vi* ≤ *ui* max*Eij*∈E*<sup>i</sup>*

the following linear constraints:

directly impose binary restrictions on it.

relationship between the variables in E*<sup>i</sup>* as:

Thus, we write constraint (15) as,

constraint *li* min*Eij*∈E*<sup>i</sup>*

*Fi* ≤ ∑ *j*

**4.2 MILP solution in the presence of collaborative enzymes**

*li* min *Eij*∈E*<sup>i</sup>*

additional variables. We address lower and upper bounds separately.

We linearize this part without introducing new variables as follows:

continuous methods presented in the following two sections.

*Fi* <sup>≤</sup> <sup>∑</sup>*<sup>j</sup> Eij n*

∑*<sup>j</sup> Eij*

*Fi* >

$$Z\_i^k = \min\_{E\_{i\parallel} \in \mathcal{E}\_l^k} \{ E\_{i\parallel} \}. \tag{26}$$

The methods in Section 4.1 and Section 4.2 are used for constraints (26) and (25) respectively to linearize the constraint (15).

#### **5. Experiments**

In this section, we evaluate the performance and the limitations of our methods on real and artificially generated metabolic networks. The synthetic datasets provide us a controlled simulation environment that allows us to determine the impact of different characteristics of the network on the performance of our algorithms. We evaluate the performance of our methods quantitatively in terms of their execution time (in seconds).

**0.01**

with different number of reactions.

follows:

**50 100 150 200 300 #R**

Fig. 2. The average execution time (in seconds) for the networks on single enzyme set. #R denotes the number of reactions and #C denotes the number of compounds in the network. The execution time grows exponentially as the number of reactions increases for both the

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 365

catalyzes a reaction. We conduct our experiments using the MILP formulation for two different settings. In the first setting, the number of compounds is 25% of that of the reactions, while for the second setting, it is 50%. Figure 2 plots the average execution times for networks

The execution time grows exponentially as the number of reactions increases for both the cases and can be prohibitive even for a few hundred reactions. This time constraint necessitates the advent of heuristic methods for large networks. Also, we observe a steep increase in execution time for larger number of compounds. For the same number of reactions, doubling the number of compounds leads to an overall time increase by several orders of magnitude. It can be concluded that, heuristic methods which can reduce the number of compounds from the constraint set, can have the potential to improve the execution time of the MILP solutions. **Performance analysis for multiple enzymes set:** The results in the previous section (along with the NP-hardness of the problem) show that the MILP solution has exponential execution time complexity in terms of the network size. We now study performance of our two solutions with multiple enzymes per reaction. In this experiment, we study the running time requirements in the presence of multiple substitute and collaborative enzymes. We compare these times to those of single enzymes. Note that, the comparison against single enzyme favors the single enzyme dataset as it has fewer variables. This, however, should serve as a lower bound for execution time for the multiple enzyme cases. We summarize the result as

1. Binary method: Figure 3 depicts the results of our binary method for variable number of compounds and reactions. The results demonstrate that the presence of multiple enzymes

cases and can be prohibitive even for a few hundred reactions.

**#C = #R\*25% #C = #R\* 50%**

**0.1**

**1**

**10**

**100**

**Execution time (sec)**

**1000**

**10000**

**100000**

#### **5.1 Datasets**

In our experiments, we used the following real and synthetic datasets.

**- Synthetic datasets:** We randomly generated ten networks of different sizes (given by the number of compounds and the number of reactions). In order to simulate the real networks accurately, we generated these networks so that the number of reactions that involve a compound is distributed according to the power law distribution (Voit, 2000). In other words, the probability of the number of reactions that each compound involves in decreases exponentially with the number of reactions.

In order to evaluate the impact of multiple enzymes for catalyzing a reaction, on the performance of the algorithms, we generated two types of datasets:

**Single enzyme dataset:** In this dataset, each reaction is catalyzed by only one enzyme. Thus, the number of enzymes is equal to the number of reactions.

**Multiple enzyme dataset:** In this dataset, all the reactions are catalyzed by at least one enzyme. The number of enzymes attached to a reaction is based on the power law distribution: *the probability that a reaction is catalyzed by k enzymes decreases exponentially with k*. Roughly, 40% of the reactions are catalyzed by at least two enzymes; 30% of the reactions are catalyzed by at least three enzymes; 23.5% of reactions are catalyzed by at least four enzymes; 18.5% of reactions are catalyzed by at least five enzymes and 5% of reactions are catalyzed by at least nine enzymes. Based on these probabilities, we build ten synthetic networks for each network size. Section 5.2.1 describes the results for the synthetic datasets.

**- Real dataset:** We use the metabolic pathways of *Homo sapiens* (*H. sapiens*) from KEGG (Kanehisa & Goto, 2000). The entire *H. sapiens* metabolism consists of 640 enzymes, 1176 reactions and 1067 compounds. Section 5.2.2 provides the results for these real datasets.

**Experiment platform:** We implemented our algorithms in C++. We applied ILOG CPLEX 11.2 to find the integer linear programming solutions. We executed our experiments on a system with two Pentium 4 3.2Ghz and 1M cache processors, 6 gigabytes of RAM, and a Linux operating system.

#### **5.2 Results**

In this section, we evaluate the performance of our algorithms on the synthetic (Section 5.2.1) and real datasets (Section 5.2.2).

#### **5.2.1 Evaluation on synthetic datasets**

Our goal in this section is to evaluate the performance of our algorithm for a variety of network parameters using synthetic datasets. These experiments can be decomposed into two sets as described in the previous subsection, namely, single enzyme dataset and multiple enzyme dataset. For an effective comparison, we use identical topology of reactions and compounds for both multiple and single enzyme set. We consider two cases for the multiple enzyme set: a) All multiple enzymes substitute each other. b) All multiple enzymes collaborate with each other.

**Performance analysis on single enzyme set:** Section 3 proves that finding the enzyme knockout strategy using MILP is NP-Hard. Consider the case when only one enzyme 12 Will-be-set-by-IN-TECH

**- Synthetic datasets:** We randomly generated ten networks of different sizes (given by the number of compounds and the number of reactions). In order to simulate the real networks accurately, we generated these networks so that the number of reactions that involve a compound is distributed according to the power law distribution (Voit, 2000). In other words, the probability of the number of reactions that each compound involves in

In order to evaluate the impact of multiple enzymes for catalyzing a reaction, on the

**Single enzyme dataset:** In this dataset, each reaction is catalyzed by only one enzyme.

**Multiple enzyme dataset:** In this dataset, all the reactions are catalyzed by at least one enzyme. The number of enzymes attached to a reaction is based on the power law distribution: *the probability that a reaction is catalyzed by k enzymes decreases exponentially with k*. Roughly, 40% of the reactions are catalyzed by at least two enzymes; 30% of the reactions are catalyzed by at least three enzymes; 23.5% of reactions are catalyzed by at least four enzymes; 18.5% of reactions are catalyzed by at least five enzymes and 5% of reactions are catalyzed by at least nine enzymes. Based on these probabilities, we build ten synthetic networks for each network size. Section 5.2.1 describes the results for the

**- Real dataset:** We use the metabolic pathways of *Homo sapiens* (*H. sapiens*) from KEGG (Kanehisa & Goto, 2000). The entire *H. sapiens* metabolism consists of 640 enzymes, 1176 reactions and 1067 compounds. Section 5.2.2 provides the results for these real

**Experiment platform:** We implemented our algorithms in C++. We applied ILOG CPLEX 11.2 to find the integer linear programming solutions. We executed our experiments on a system with two Pentium 4 3.2Ghz and 1M cache processors, 6 gigabytes of RAM, and a

In this section, we evaluate the performance of our algorithms on the synthetic (Section 5.2.1)

Our goal in this section is to evaluate the performance of our algorithm for a variety of network parameters using synthetic datasets. These experiments can be decomposed into two sets as described in the previous subsection, namely, single enzyme dataset and multiple enzyme dataset. For an effective comparison, we use identical topology of reactions and compounds for both multiple and single enzyme set. We consider two cases for the multiple enzyme set: a) All multiple enzymes substitute each other. b) All multiple enzymes

**Performance analysis on single enzyme set:** Section 3 proves that finding the enzyme knockout strategy using MILP is NP-Hard. Consider the case when only one enzyme

In our experiments, we used the following real and synthetic datasets.

performance of the algorithms, we generated two types of datasets:

Thus, the number of enzymes is equal to the number of reactions.

decreases exponentially with the number of reactions.

**5.1 Datasets**

synthetic datasets.

Linux operating system.

and real datasets (Section 5.2.2).

collaborate with each other.

**5.2.1 Evaluation on synthetic datasets**

datasets.

**5.2 Results**

Fig. 2. The average execution time (in seconds) for the networks on single enzyme set. #R denotes the number of reactions and #C denotes the number of compounds in the network. The execution time grows exponentially as the number of reactions increases for both the cases and can be prohibitive even for a few hundred reactions.

catalyzes a reaction. We conduct our experiments using the MILP formulation for two different settings. In the first setting, the number of compounds is 25% of that of the reactions, while for the second setting, it is 50%. Figure 2 plots the average execution times for networks with different number of reactions.

The execution time grows exponentially as the number of reactions increases for both the cases and can be prohibitive even for a few hundred reactions. This time constraint necessitates the advent of heuristic methods for large networks. Also, we observe a steep increase in execution time for larger number of compounds. For the same number of reactions, doubling the number of compounds leads to an overall time increase by several orders of magnitude. It can be concluded that, heuristic methods which can reduce the number of compounds from the constraint set, can have the potential to improve the execution time of the MILP solutions.

**Performance analysis for multiple enzymes set:** The results in the previous section (along with the NP-hardness of the problem) show that the MILP solution has exponential execution time complexity in terms of the network size. We now study performance of our two solutions with multiple enzymes per reaction. In this experiment, we study the running time requirements in the presence of multiple substitute and collaborative enzymes. We compare these times to those of single enzymes. Note that, the comparison against single enzyme favors the single enzyme dataset as it has fewer variables. This, however, should serve as a lower bound for execution time for the multiple enzyme cases. We summarize the result as follows:

1. Binary method: Figure 3 depicts the results of our binary method for variable number of compounds and reactions. The results demonstrate that the presence of multiple enzymes

**0.01**

exponentially.

**5.2.2 Evaluation on the real dataset**

our methods of great practical importance.

**(12, 50) (25, 50) (50, 100) (25, 100) (50, 200) Test cases (#C, #R)**

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 367

the running time of the binary and continuous method increases further. Therefore, for the

In this section, we evaluate the performance of our algorithm on real metabolic networks taken from the KEGG database. We use the metabolisms of *H. sapiens*. Given, the superior performance of the binary method over continuous method (as described in the previous subsection), we limit ourselves to the binary method on the real dataset. We execute the binary method for purine metabolism, metabolism of cofactors and vitamins, amino acid metabolism and the entire metabolism. However, the KEGG database does not provide the details of enzyme association information. Thus, we consider two alternative cases: a) all the enzymes are collaborations, b) all the enzymes are substitutions. Table 1 demonstrates the running time using the binary method. These results show that our method requires less than one second of execution time and hence, are scalable to practical network sizes for both cases. Even for the entire metabolism of *H. sapiens*, the execution time is less than half a second. This makes

It is worth mentioning that the execution times on the real datasets are substantially lower than that of the synthetic datasets. This is because, the topology of the real networks is much

Fig. 4. The average execution time (in seconds) for the networks with single and multiple enzymes. All multiple enzymes cases are either all substitutions or all collaborations. For multiple enzymes set, we use continuous method. As the network size increases, the gap between the execution times of the multiple enzymes set and the single enzyme set increases

large networks, binary method is the preferred choice.

**single enzyme set**

**multiple enzyme set - substitute multiple enzyme set - collaborate**

**0.1**

**1**

**10**

**Execution time (sec)**

**100**

**1000**

**10000**

Fig. 3. The average execution time (in seconds) for the networks with single and multiple enzymes. All multiple enzymes cases are either all substitutions or all collaborations. For multiple enzymes set, we use binary method. The results demonstrate that the presence of multiple enzymes increases the execution time significantly as compared to the case when only a single enzyme catalyzes a reaction.

increases the execution time significantly as compared to the case when only single enzyme catalyzes a reaction. This improvement holds true for both substitute and collaborative enzymes. The running time for multiple enzymes is two to 16 times that of the single enzyme case. In most of the test cases, collaborative enzymes resulted in a higher increase in execution time.


14 Will-be-set-by-IN-TECH

**single enzyme set**

**multiple enzyme set - substitute multiple enzyme set - collaborate**

**(12, 50) (25, 50) (50, 100) (25, 100) (50, 200) Test cases (#C, #R)**

Fig. 3. The average execution time (in seconds) for the networks with single and multiple enzymes. All multiple enzymes cases are either all substitutions or all collaborations. For multiple enzymes set, we use binary method. The results demonstrate that the presence of multiple enzymes increases the execution time significantly as compared to the case when

increases the execution time significantly as compared to the case when only single enzyme catalyzes a reaction. This improvement holds true for both substitute and collaborative enzymes. The running time for multiple enzymes is two to 16 times that of the single enzyme case. In most of the test cases, collaborative enzymes resulted in a higher increase

2. Continuous method: Figure 4 shows the execution time of multiple enzymes set by continuous method and that for the single enzyme set. Similar to the binary method, multiple enzymes set requires much more time than that of the single enzyme set. As the network size increases, the gap between the execution time of the multiple enzymes set and that of the single enzyme set increases exponentially. This suggests that the presence of multiple enzymes necessitates heuristics solutions for large networks. Also, collaboration among enzymes requires relatively higher execution time as compared to

3. Comparison of the two methods: Recall that the binary method introduces additional binary variables to linearize the constraints. The continuous method only generates additional continuous variables. However, it requires additional constraints. Our experiments (see Figures 3 and 4) demonstrate that the Binary method executes twice or more faster than the continuous method for the case when all multiple enzymes cases are substitutions. When the multiple enzymes collaborate with each other, the gap between

that of the substitution between enzymes in majority of the experiments.

**0.01**

in execution time.

only a single enzyme catalyzes a reaction.

**0.1**

**1**

**Execution time (sec)**

**10**

**100**

Fig. 4. The average execution time (in seconds) for the networks with single and multiple enzymes. All multiple enzymes cases are either all substitutions or all collaborations. For multiple enzymes set, we use continuous method. As the network size increases, the gap between the execution times of the multiple enzymes set and the single enzyme set increases exponentially.

the running time of the binary and continuous method increases further. Therefore, for the large networks, binary method is the preferred choice.

#### **5.2.2 Evaluation on the real dataset**

In this section, we evaluate the performance of our algorithm on real metabolic networks taken from the KEGG database. We use the metabolisms of *H. sapiens*. Given, the superior performance of the binary method over continuous method (as described in the previous subsection), we limit ourselves to the binary method on the real dataset. We execute the binary method for purine metabolism, metabolism of cofactors and vitamins, amino acid metabolism and the entire metabolism. However, the KEGG database does not provide the details of enzyme association information. Thus, we consider two alternative cases: a) all the enzymes are collaborations, b) all the enzymes are substitutions. Table 1 demonstrates the running time using the binary method. These results show that our method requires less than one second of execution time and hence, are scalable to practical network sizes for both cases. Even for the entire metabolism of *H. sapiens*, the execution time is less than half a second. This makes our methods of great practical importance.

It is worth mentioning that the execution times on the real datasets are substantially lower than that of the synthetic datasets. This is because, the topology of the real networks is much

Burgard, A. P., Pharkya, P. & Maranas, C. D. (2003). Optknock: A bilevel programming

Identifying Enzyme Knockout Strategies on Multiple Enzyme Associations 369

Covert, M. W., Schilling, C. H. & Palsson, B. (2001). Regulation of Gene Expression in Flux

Edwards, J. S. & Palsson, B. O. (2000a). Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions, *BMC Bioinformatics* 1(1). Edwards, J. S. & Palsson, B. O. (2000b). The Escherichia coli MG1655 in silico metabolic

Forster, J., Famili, I., Fu, P., Palsson, B. O. & Nielsen, J. (2003). Genome-scale reconstruction of the saccharomyces cerevisiae metabolic network, *Genome Research* 13. Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes, *Nucleic*

Kauffman, K. J., Prakash, P. & Edwards, J. S. (2003). Advances in flux balance analysis, *Current*

Klamt, S. & Gilles, E. D. (2004). Minimal cut sets in biochemical reaction networks,

Ng, P. H. & Rardin, R. L. (1996). Commodity family extended formulations of uncapacitated

Patil, K. R., Rocha, I., Forster, J. & Nielsen, J. (2005). Evolutionary programming as a platform

Peschel, M. & Mende, W. (1986). *The predator-prey model: do we live in a volterra world?*,

Pharkya, P., Burgard, A. P. & Maranas, C. D. (2004). OptStrain: A computational framework

Pharkya, P. & Maranas, C. D. (2006). An optimization framework for identifying reaction

Reed, J. L., Vo, T. D., Schilling, C. H. & Palsson, B. O. (2003). An expanded genomescale model

Savageau, M. & Voit, E. (1987). Recasting nonlinear differential equations as S-systems: a

Segre, D., Vitkup, D. & Church, G. (2002). Analysis of optimality in natural and perturbed

Shlomi, T., Berkman, O. & Ruppin, E. (2005). Regulatory on/off minimization of metabolic flux changes after genetic perturbations, *Proc. Natl. Acad. Sci. USA* 102. Shlomi, T., Eisenberg, Y., Sharan, R. & Ruppin, E. (2007). A genome-scale computational study

Song, B., Buyuktahtakin, I. E., Kahveci, T. & Ranka, S. (2011). Manipulating the steady

Song, B., Sridhar, P., Kahveci, T. & Ranka, S. (2007). Double iterative optimization for

of the interplay between transcriptional regulation and metabolism, *Mol Syst Biol* 3.

state of metabolic pathways, *, IEEE/ACM Transactions on Computational Biology and*

metabolic network-based drug target identification, *International Journal of Data*

activation/inhibition or elimination candidates for overproduction in microbial

fixed charge network flow problems, *Networks* 30(1).

Palsson, B. O. (2000). The challenges of in silico biology, *Nature Biotechnology* 18.

for in silico metabolic engineering, *BMC Bioinformatics* 6(308).

for redesign of microbial production systems, *Genome Res.* 14.

of escherichia coli k-12 (ijr904 gsm/gpr), *Genome Biology* 4(R54).

Balance Models of Metabolism, *Journal of Theoretical Biology* 213(1).

*Biotechnology and Bioengineering* 84.

97.

*Acids Res.* 28(1): 27–30.

*Bioinformatics* 20(2).

*opinion in biotechnology* 14(5).

Akademie-Verlag, Berlin.

systems, *Metab. Eng.* 8(1).

canonical nonlinear form, *Math. Biosci* 87.

*Bioinformatics (IEEE TCBB)*, 8(3).

*Mining and Bioinformatics*, 3(2).

metabolic networks, *Proc. Natl. Acad. Sci. USA* 99(23).

framework for identifying gene knockout strategies for microbial strain optimization,

genotype: Its definition, characteristics, and capabilities, *Proc Natl Acad SciUSA*


sparser than the ones we used for our synthetic experiments. Therefore, less time is required to find the flux distribution on the real networks.

Table 1. Execution time in seconds of our binary method for the metabolisms of *H. sapiens* from KEGG. #E, #R and #C denote the number of enzymes, reactions and compounds respectively in the metabolism. The results demonstrate that our method requires less than one second of execution time. Hence it is scalable to practical network sizes for both the cases.

#### **6. Conclusions**

Given a metabolic network and a goal, such as maximizing or minimizing the production of a set of compounds, we considered the problem of computationally determining the optimal enzyme knockouts to modify the production of compounds using the Flux Balance Analysis (FBA) model. We proved that the problem of finding the optimal enzyme set to knockout is NP-hard even when only one enzyme catalyzes a reaction.

We developed two strategies to identify the enzymes to knockout, when multiple enzymes catalyze a single reaction. We allowed multiple substitute and collaborative enzymes. In the proposed solutions, we eliminate this limitation of single enzyme. Our first solution uses a small number of binary variables in the underlying MILP formulation. The second method increases the number of binary variables but requires a smaller number of constraints.

Our experiments using synthetic and real datasets demonstrated that adding extra binary variables is significantly superior to adding additional constraints in terms of execution time. For the metabolism consisting of all the pathways of H. sapiens, our binary method requires less than one second. This makes our methods of great practical importance.

We believe that the approach presented in this chapter is not limited to MILP based strategies. It should also be applicable to other linear constraint strategies, e.g. quadratic programming, where the objective function is non-linear but the constraints are linear.

#### **7. Acknowledgment**

This work was supported partially by NSF under grants CCF-0829867 and IIS-0845439.

#### **8. References**

Alper, H., Jin, Y., Moxley, J. & Stephanopoulos, G. (2005). Identifying gene targets for the metabolic engineering of lycopene biosynthesis in E. coli, *Metab. Eng.* 7(3).

Bonarius, H. P. J., Schmid, G. & Tramper, J. (1997). Flux analysis of underdetermined metabolic networks: The quest for the missing constraints, *Trends Biotechnology* 15.

16 Will-be-set-by-IN-TECH

sparser than the ones we used for our synthetic experiments. Therefore, less time is required

Metabolism of cofactors and vitamins 90 132 122 0.04 0.05

Table 1. Execution time in seconds of our binary method for the metabolisms of *H. sapiens* from KEGG. #E, #R and #C denote the number of enzymes, reactions and compounds respectively in the metabolism. The results demonstrate that our method requires less than one second of execution time. Hence it is scalable to practical network sizes for both the

Given a metabolic network and a goal, such as maximizing or minimizing the production of a set of compounds, we considered the problem of computationally determining the optimal enzyme knockouts to modify the production of compounds using the Flux Balance Analysis (FBA) model. We proved that the problem of finding the optimal enzyme set to knockout is

We developed two strategies to identify the enzymes to knockout, when multiple enzymes catalyze a single reaction. We allowed multiple substitute and collaborative enzymes. In the proposed solutions, we eliminate this limitation of single enzyme. Our first solution uses a small number of binary variables in the underlying MILP formulation. The second method increases the number of binary variables but requires a smaller number of constraints. Our experiments using synthetic and real datasets demonstrated that adding extra binary variables is significantly superior to adding additional constraints in terms of execution time. For the metabolism consisting of all the pathways of H. sapiens, our binary method requires

We believe that the approach presented in this chapter is not limited to MILP based strategies. It should also be applicable to other linear constraint strategies, e.g. quadratic programming,

This work was supported partially by NSF under grants CCF-0829867 and IIS-0845439.

Alper, H., Jin, Y., Moxley, J. & Stephanopoulos, G. (2005). Identifying gene targets for the metabolic engineering of lycopene biosynthesis in E. coli, *Metab. Eng.* 7(3). Bonarius, H. P. J., Schmid, G. & Tramper, J. (1997). Flux analysis of underdetermined metabolic networks: The quest for the missing constraints, *Trends Biotechnology* 15.

less than one second. This makes our methods of great practical importance.

where the objective function is non-linear but the constraints are linear.

Pathway #E #R #C Collaborative Substitute

Purine metabolism 52 92 65 0.07 0.13

Amino acid metabolism 195 317 305 0.05 0.06 the entire metabolism 640 1176 1067 0.38 0.28

to find the flux distribution on the real networks.

NP-hard even when only one enzyme catalyzes a reaction.

cases.

**6. Conclusions**

**7. Acknowledgment**

**8. References**


**Part 6** 

**Genome Analysis** 


**Part 6** 

**Genome Analysis** 

18 Will-be-set-by-IN-TECH

370 Bioinformatics – Trends and Methodologies

Sridhar, P., Kahveci, T. & Ranka, S. (2007). An iterative algorithm for metabolic network-based

Sridhar, P., Song, B., Kahveci, T. & Ranka, S. (2008). OPMET: A metabolic network-based

Voit, E. O. (2000). *Computational analysis of biochemical systems: a practical guide for biochemists*

algorithm for optimal drug target identification, *Pacific Symposium on Biocomputing* .

drug target identification, *Pacific Symposium on Biocomputing* .

*and molecular biologists*, Cambridge University Press.

**18** 

**Virtual Genomes** 

*Uppsala University, Uppsala* 

*1Australia 2Sweden 3USA* 

Abhirami Ratnakumar1,2, Wesley Barris1,3, Sean McWilliam1 and Brian P. Dalrymple1 *1CSIRO Livestock Industries, St. Lucia, QLD* 

*3now at Cobb-Vantress, Siloam Springs, Arkansas* 

**Using Bacterial Artificial Chromosomes to** 

Recent years have seen an explosion in the sequencing of genomes, including those of ruminants. A number of assemblies of the sequence of the bovine genome are now available (Elsik, et al., 2009; Zimin, et al., 2009). Although the sheep genome sequence is not such a high priority, the International Sheep Genomics Consortium (ISGC\_website) has a long term strategy to develop a number of tools for the application of genomics in sheep research and breeding (Archibald, et al., 2010). We have demonstrated recently how comparative genomics and Bacterial Artificial Chromosome (BAC)-libraries can be used to construct detailed virtual genomes as a framework for genome assemblies of related species (Dalrymple, et al., 2007). As new and improved genome assemblies of the genomes contributing to an initial virtual genome assembly are produced, the virtual genomes will need to be regularly updated to incorporate the latest available information. In the original analysis, three genomes (bovine, dog and human) with various levels of coverage and stages of assembly were used (Dalrymple, et al., 2007). With the availability of increasing numbers of assemblies, the benefit of using more than three genomes, or the most appropriate evolutionary distances of the genomes, is not immediately clear. Here we describe the construction of a modified version of the bovine Btau3.1 assembly using cattle and sheep BACs and the use of this assembly in the construction of an updated virtual sheep genome, combining information from the original sheep virtual genome (vsg 1.2) and the horse (Wade, et al., 2009) and dog (Lindblad-Toh, et al., 2005) genomes. The impact of inclusion of additional genome sequences is analysed. The approach described here for sheep is an example of an approach which can be applied more broadly to genomes of any source, for example for the fish species, tilapia (Soler, et al., 2010) and catfish (Liu, et al., 2009). Indeed, the same principles also apply to the detection of differences between different individuals

**1. Introduction** 

of the same species.

**Refine Genome Assemblies and to Build** 

*2now at the Department of Medical Biochemistry and Microbiology,* 

### **Using Bacterial Artificial Chromosomes to Refine Genome Assemblies and to Build Virtual Genomes**

Abhirami Ratnakumar1,2, Wesley Barris1,3, Sean McWilliam1 and Brian P. Dalrymple1 *1CSIRO Livestock Industries, St. Lucia, QLD 2now at the Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala 3now at Cobb-Vantress, Siloam Springs, Arkansas 1Australia 2Sweden 3USA* 

#### **1. Introduction**

Recent years have seen an explosion in the sequencing of genomes, including those of ruminants. A number of assemblies of the sequence of the bovine genome are now available (Elsik, et al., 2009; Zimin, et al., 2009). Although the sheep genome sequence is not such a high priority, the International Sheep Genomics Consortium (ISGC\_website) has a long term strategy to develop a number of tools for the application of genomics in sheep research and breeding (Archibald, et al., 2010). We have demonstrated recently how comparative genomics and Bacterial Artificial Chromosome (BAC)-libraries can be used to construct detailed virtual genomes as a framework for genome assemblies of related species (Dalrymple, et al., 2007). As new and improved genome assemblies of the genomes contributing to an initial virtual genome assembly are produced, the virtual genomes will need to be regularly updated to incorporate the latest available information. In the original analysis, three genomes (bovine, dog and human) with various levels of coverage and stages of assembly were used (Dalrymple, et al., 2007). With the availability of increasing numbers of assemblies, the benefit of using more than three genomes, or the most appropriate evolutionary distances of the genomes, is not immediately clear. Here we describe the construction of a modified version of the bovine Btau3.1 assembly using cattle and sheep BACs and the use of this assembly in the construction of an updated virtual sheep genome, combining information from the original sheep virtual genome (vsg 1.2) and the horse (Wade, et al., 2009) and dog (Lindblad-Toh, et al., 2005) genomes. The impact of inclusion of additional genome sequences is analysed. The approach described here for sheep is an example of an approach which can be applied more broadly to genomes of any source, for example for the fish species, tilapia (Soler, et al., 2010) and catfish (Liu, et al., 2009). Indeed, the same principles also apply to the detection of differences between different individuals of the same species.

Using Bacterial Artificial Chromosomes

"unpaired" group.

in the genome assembly.

**2.4 Construction of Btau3.5x** 

**2.3 Assigning BACs to groups and building BAC contigs** 

to Refine Genome Assemblies and to Build Virtual Genomes 375

BACs were assigned to the groups; "tail-to-tail", "tail-to-head" etc. on the basis of the relative orientations of the two BESs from each BAC on the relevant genome assembly and the distance apart of the BESs. "Outsize" BACs were those with the two BESs mapped to the same chromosome in the relevant genome assembly and less than 10 kb, or more than 200 kb, apart. Data processing was undertaken using a series of Perl scripts. BACs with both BESs mapped to the genome, but mapped to two different chromosomes, were assigned to the "breaks" group. BACs with only one BES mapped to the genome were assigned to the

BAC-comparative genomic contigs (BAC-CGCs) were constructed for the BACs from each species mapped to each genome assembly using Perl scripts to process the data (Dalrymple, et al., 2007). Starting from the beginning of each chromosome the first BAC that overlapped with a second BAC was identified, the BAC-CGC was extended until no further overlapping BACs were identified. This process was repeated along the chromosome until the last BAC mapped on the chromosome was reached. The process was repeated for each chromosome

Using Perl scripts and the data set of the mapping of the bovine BESs to the scaffolds of the Btau3.1 genome assembly an initial minimization of the number of non-tail-to-tail BACs was undertaken. The scripts started with the first scaffold on chromosome 1 of the assembly and by testing the number of BAC links between this scaffold and all other scaffolds in the assembly identified the most likely adjacent scaffold and the orientation of the scaffold based on maximising the number of tail-to-tail BACs. Two or more linking tail-to-tail BACs without overlapping BES mapping coordinates on both scaffolds were required to continue the chain. Only high confidence bovine BACs (Ratnakumar, et al., 2009) were used in the assembly. Adjacent scaffolds assigned to the same chromosome in the Btau3.1 assembly were preferred over a more highly linked scaffold assigned to another chromosome, if the preferred scaffold on the original chromosome was itself linked to an adjacent scaffold on the original chromosome. If no scaffold assigned to the same chromosome as the rest of the chain was linked into the chain by BACs, or the less strongly linked scaffold from the same chromosome terminated the chain, the most highly linked scaffold from another chromosome was incorporated. If the newly added scaffold was linked back to the original chromosome at the next step of scaffold incorporation it was retained in the chain, otherwise the chain was terminated and the chromosome changing scaffold was also removed from the scaffold chain. For each scaffold in the chain BAC-links from both ends of the scaffold were assessed to enable to inclusion of scaffolds preceding the initiating scaffold, or located between two scaffolds in a chain, but which were only linked to an adjacent following scaffold. The scaffold chain building process was continued until it was terminated with a scaffold not linked by two or more BACs to another scaffold. The penultimate scaffold in the chain was then tested for BAC links to a second scaffold and incorporated if it met the criteria described above. The chain building process was then continued. If no second linked scaffold could be identified the scaffold chain building was terminated. The next unincorporated scaffold from the same chromosome of the Btau3.1 assembly was then used to initiate the next scaffold chain. When all scaffolds from the first chromosome had been tested the first scaffold from the next chromosome was used and the process repeated until

all scaffolds assigned to a chromosome of the Btau3.1 assembly had been tested.

#### **2. Materials and methods**

#### **2.1 Data sources and sequence search parameters**

All genome sequences, except Btau3.#x versions, were downloaded from the UCSC comparative genomics website (UCSC; Fujita, et al., 2011). The full set of BAC-end sequences (BESs) from the CHORI-243 sheep BAC library, deposited in GenBank with the following accession numbers; CL632218-CL639051, CZ920079-CZ926973 and DU169919-DU532729 (Dalrymple, et al., 2007), were filtered to remove duplicate sequences and to identify the set of high confidence BACs (Ratnakumar, et al., 2010a). The filtered set of sheep BAC-end sequences were aligned to the lower case masked versions of the bovine genome assembly (Btau3.1) and the revised bovine genome assembly (Btau3.5x) using MegaBLASTn with the following optimised parameters: -r 1 -q -1 -X 40 -W 8, as previously described (Ratnakumar, et al., 2010b). The filtered set of sheep BESs were aligned to the lower case masked versions of the dog genome sequence assembly (canFam2), the horse genome sequence assembly (equCab1) and to the human genome sequence assembly (hg17) using BLASTn with the following parameters: -W 7 -r 17 -q -21 -G 29 -E 22 -X 240 -e 1 -f 280 -F m -U T and -z 3076781887 (human) and -z 2531657226 (dog), as previously described (Dalrymple, et al., 2007). No cut offs were applied to the BLAST output except that for each sheep BAC-end sequence only the best hit from each of the genomes was used for the next steps. If two BESs hits to the same genome assembly with equal scores were obtained the hit on the same chromosome as the best hit for the BES determined from the other end of the BAC was retained. If more than two hits with equal scores were obtained the BES hit was discarded.

The BESs from the CHORI-240 cattle BAC library, GenBank accession numbers; BZ830806- BZ891831, BZ896446-BZ956676, CC447354-CC447937, CC466118-CC470858, CC470880- CC596504, CC761663-CC775995, CG917936-CG918393, CG976420-CG992944, CL603252- CL610093, CW848133-CW848163, CZ012846-CZ027312 (Snelling, et al., 2007), were aligned to the cattle and virtual sheep genome sequences using BLAT (Kent, 2002).

Bovine genome assembly Btau3.1 sequence contigs were aligned to the human, dog and horse genomes using MegaBLASTn as described above.

#### **2.2 Genome coordinate conversion**

The coordinates from the mapping of the sheep BESs to the dog and human genomes were converted to the framework of the bovine genome assembly Btau3.1 using the LiftOver utility (LiftOver; Fujita, et al., 2011) and the canFam2 to Btau3.1 and hg17 to Btau3.1 coordinate conversion chain files respectively, also downloaded from UCSC genome bioinformatics site (UCSC; Fujita, et al., 2011). If the initial application of LiftOver was not successful for a region of the genome, regions of 100 bases either side of the BAC-end sequence were taken and positioned using LiftOver (pseudoliftOver). If this was again unsuccessful the process was repeated in steps of 100 bases until a successful application of the LiftOver utility for a region was achieved, or a distance of 10kb was reached (Dalrymple, et al., 2007).

Coordinate conversion (chain) files able to be read by the LiftOver utility to convert bovine genome assembly Btau3.1 coordinates to bovine genome assembly Btau3.#x version coordinates were built based on the revised order of Btau3.1 contigs and scaffolds in Btau3.#x version. Similarly a coordinate conversion file to convert Btau3.5x coordinates to virtual sheep genome assembly coordinates was built based on the order of Btau3.5x scaffolds in the virtual sheep genome.

#### **2.3 Assigning BACs to groups and building BAC contigs**

BACs were assigned to the groups; "tail-to-tail", "tail-to-head" etc. on the basis of the relative orientations of the two BESs from each BAC on the relevant genome assembly and the distance apart of the BESs. "Outsize" BACs were those with the two BESs mapped to the same chromosome in the relevant genome assembly and less than 10 kb, or more than 200 kb, apart. Data processing was undertaken using a series of Perl scripts. BACs with both BESs mapped to the genome, but mapped to two different chromosomes, were assigned to the "breaks" group. BACs with only one BES mapped to the genome were assigned to the "unpaired" group.

BAC-comparative genomic contigs (BAC-CGCs) were constructed for the BACs from each species mapped to each genome assembly using Perl scripts to process the data (Dalrymple, et al., 2007). Starting from the beginning of each chromosome the first BAC that overlapped with a second BAC was identified, the BAC-CGC was extended until no further overlapping BACs were identified. This process was repeated along the chromosome until the last BAC mapped on the chromosome was reached. The process was repeated for each chromosome in the genome assembly.

#### **2.4 Construction of Btau3.5x**

374 Bioinformatics – Trends and Methodologies

All genome sequences, except Btau3.#x versions, were downloaded from the UCSC comparative genomics website (UCSC; Fujita, et al., 2011). The full set of BAC-end sequences (BESs) from the CHORI-243 sheep BAC library, deposited in GenBank with the following accession numbers; CL632218-CL639051, CZ920079-CZ926973 and DU169919-DU532729 (Dalrymple, et al., 2007), were filtered to remove duplicate sequences and to identify the set of high confidence BACs (Ratnakumar, et al., 2010a). The filtered set of sheep BAC-end sequences were aligned to the lower case masked versions of the bovine genome assembly (Btau3.1) and the revised bovine genome assembly (Btau3.5x) using MegaBLASTn with the following optimised parameters: -r 1 -q -1 -X 40 -W 8, as previously described (Ratnakumar, et al., 2010b). The filtered set of sheep BESs were aligned to the lower case masked versions of the dog genome sequence assembly (canFam2), the horse genome sequence assembly (equCab1) and to the human genome sequence assembly (hg17) using BLASTn with the following parameters: -W 7 -r 17 -q -21 -G 29 -E 22 -X 240 -e 1 -f 280 -F m -U T and -z 3076781887 (human) and -z 2531657226 (dog), as previously described (Dalrymple, et al., 2007). No cut offs were applied to the BLAST output except that for each sheep BAC-end sequence only the best hit from each of the genomes was used for the next steps. If two BESs hits to the same genome assembly with equal scores were obtained the hit on the same chromosome as the best hit for the BES determined from the other end of the BAC was retained. If more than two hits with equal scores were obtained the BES hit was discarded. The BESs from the CHORI-240 cattle BAC library, GenBank accession numbers; BZ830806- BZ891831, BZ896446-BZ956676, CC447354-CC447937, CC466118-CC470858, CC470880- CC596504, CC761663-CC775995, CG917936-CG918393, CG976420-CG992944, CL603252- CL610093, CW848133-CW848163, CZ012846-CZ027312 (Snelling, et al., 2007), were aligned

to the cattle and virtual sheep genome sequences using BLAT (Kent, 2002).

horse genomes using MegaBLASTn as described above.

**2.2 Genome coordinate conversion** 

scaffolds in the virtual sheep genome.

et al., 2007).

Bovine genome assembly Btau3.1 sequence contigs were aligned to the human, dog and

The coordinates from the mapping of the sheep BESs to the dog and human genomes were converted to the framework of the bovine genome assembly Btau3.1 using the LiftOver utility (LiftOver; Fujita, et al., 2011) and the canFam2 to Btau3.1 and hg17 to Btau3.1 coordinate conversion chain files respectively, also downloaded from UCSC genome bioinformatics site (UCSC; Fujita, et al., 2011). If the initial application of LiftOver was not successful for a region of the genome, regions of 100 bases either side of the BAC-end sequence were taken and positioned using LiftOver (pseudoliftOver). If this was again unsuccessful the process was repeated in steps of 100 bases until a successful application of the LiftOver utility for a region was achieved, or a distance of 10kb was reached (Dalrymple,

Coordinate conversion (chain) files able to be read by the LiftOver utility to convert bovine genome assembly Btau3.1 coordinates to bovine genome assembly Btau3.#x version coordinates were built based on the revised order of Btau3.1 contigs and scaffolds in Btau3.#x version. Similarly a coordinate conversion file to convert Btau3.5x coordinates to virtual sheep genome assembly coordinates was built based on the order of Btau3.5x

**2. Materials and methods** 

**2.1 Data sources and sequence search parameters** 

Using Perl scripts and the data set of the mapping of the bovine BESs to the scaffolds of the Btau3.1 genome assembly an initial minimization of the number of non-tail-to-tail BACs was undertaken. The scripts started with the first scaffold on chromosome 1 of the assembly and by testing the number of BAC links between this scaffold and all other scaffolds in the assembly identified the most likely adjacent scaffold and the orientation of the scaffold based on maximising the number of tail-to-tail BACs. Two or more linking tail-to-tail BACs without overlapping BES mapping coordinates on both scaffolds were required to continue the chain. Only high confidence bovine BACs (Ratnakumar, et al., 2009) were used in the assembly. Adjacent scaffolds assigned to the same chromosome in the Btau3.1 assembly were preferred over a more highly linked scaffold assigned to another chromosome, if the preferred scaffold on the original chromosome was itself linked to an adjacent scaffold on the original chromosome. If no scaffold assigned to the same chromosome as the rest of the chain was linked into the chain by BACs, or the less strongly linked scaffold from the same chromosome terminated the chain, the most highly linked scaffold from another chromosome was incorporated. If the newly added scaffold was linked back to the original chromosome at the next step of scaffold incorporation it was retained in the chain, otherwise the chain was terminated and the chromosome changing scaffold was also removed from the scaffold chain. For each scaffold in the chain BAC-links from both ends of the scaffold were assessed to enable to inclusion of scaffolds preceding the initiating scaffold, or located between two scaffolds in a chain, but which were only linked to an adjacent following scaffold. The scaffold chain building process was continued until it was terminated with a scaffold not linked by two or more BACs to another scaffold. The penultimate scaffold in the chain was then tested for BAC links to a second scaffold and incorporated if it met the criteria described above. The chain building process was then continued. If no second linked scaffold could be identified the scaffold chain building was terminated. The next unincorporated scaffold from the same chromosome of the Btau3.1 assembly was then used to initiate the next scaffold chain. When all scaffolds from the first chromosome had been tested the first scaffold from the next chromosome was used and the process repeated until all scaffolds assigned to a chromosome of the Btau3.1 assembly had been tested.

Using Bacterial Artificial Chromosomes

Btau3.5x genome sequence.

sheep genome browser (VSG).

**3. Results and discussion** 

such as BLAST and BLAT with the vsg2.0 DNA sequence.

BAC library is 184kb (Dalrymple, et al., 2007).

to Refine Genome Assemblies and to Build Virtual Genomes 377

chromosome. Thus all nucleotides in the bovine genome sequence were included in a block and therefore the virtual sheep genome sequence is exactly the same length as the bovine

The order and orientation of the bovine genome assembly Btau3.5x-based sheep BAC-CGCs in the vsg2.0 was determined on the location and organisation of the sheep linkage map markers (Maddox, et al., 2001) mapped to the Btau3.1 genome and converted to the Btau3.5x assembly using the Btau3.1 to Btau3.5x coordinate conversion chain file and the LiftOver utility. Using Perl scripts the agp file (AGP\_file\_specification) was built and used to generate the sequence of the virtual sheep genome assembly, the coordinate conversion chain file (for use by the LiftOver utility), and the contig and scaffold tracks for the virtual

Using the LiftOver utility and the Btau3.5x to virtual sheep genome coordinate conversion chain file, the BES and BAC-CGC mapping coordinates, and any other features mapped to the Btau3.5x bovine genome, were converted to the virtual sheep genome coordinates. Features were also transferred from the Btau3.1 genome assembly by first converting to Btau3.5x coordinates using the Btau3.1 to Btau3.5x coordinate conversion file and the LiftOver utility and then converting from Btau3.5x to vsg2.0 coordinates. Other features were mapped directly onto the virtual sheep genome using sequence alignment programs

**3.1 Identification of problems with the Btau3.1 assembly of the bovine genome** 

sheep assembly at all levels above that of the individual sequence contigs.

The cow is the most closely related organism to sheep for which a genome assembly is available. When this project was commenced, an early draft of the bovine genome assembly Btau3.1 (Elsik, et al., 2009) was in the public domain. Since the sheep genome assembly would be built comparatively on the bovine genome, and sheep sequence contigs from the low coverage six animals at approximately 0.5 fold coverage each, were expected to be very small, the accuracy of the bovine genome assembly would determine the accuracy of the

To assess the validity of this strategy the sheep BESs from the CHORI-243 library were mapped to the Btau3.1 genome assembly to identify the extent of segments of conserved synteny between the two genomes. The reader should keep in mind that the only BACs counted as being in the same organisation in the comparison genome as in the source genome (i. e. congruent) are the tail-to-tail BACs less than 200kb in length. Unexpectedly large numbers of sheep BACs, more than 17% of the BACs with both ends mapped, had both BESs positioned on the bovine Btau3.1 genome assembly within 200kb of each other, but not in the expected tail-to-tail organisation, i. e. many BACs had their two BESs mapped in the tail-to-head and head-to-head organisations (Table 1). In addition, large numbers of BACs had both BESs positioned on the same chromosome, but more than 200kb apart, the outsize groups (Table 1). The average insert size of the BACs in the sheep

Such a result would normally suggest a substantial number of intra-chromosomal rearrangements between the sheep and cattle genomes. However, almost as many, more than 14%, of bovine BACs were also not positioned as tail-to-tail BACs on the bovine Btau3.1 genome assembly (Table 1). The organisation of sheep BACs at the locations of these apparent rearrangements between the two genomes was compared with the organisation of

Once the scaffold chain assembly had been completed the scaffolds not assigned to chromosomes in the Btau3.1 assembly (the UnChr) were then linked into the scaffold chains in a similar, but separate process. The resulting scaffold chains were then ordered and oriented using the consensus of the mapping of the order of the BACs in the physical bovine BAC map (Snelling, et al., 2007) to the BACs in the bovine scaffold chains. The initial data set Btau3.1x was then displayed as a browseable genome using Gbrowse (Stein, et al., 2002) to allow the integrity of the assembly of the scaffolds to be visually assessed. Genome contigs, scaffolds, bovine BAC mapping positions were displayed as separate groups. Clusters on non-congruent BACs (i.e. not tail-to-tail) identified regions with remaining assembly problems.

Using Perl scripts and the data set of the mapping of the sheep BESs to the Btau3.1 genome assembly, including BES mapping data integrated onto the Btau3.1 assembly from the horse, dog and vsg1.2 assemblies sheep BACs were assigned to tail-to-tail etc. groups and displayed on the Btau3.1x genome browser in a series of tracks. Positions of the BESs mapped to the separate genomes were integrated on Btau3.1 as previously described (Dalrymple, et al., 2007).

The mappings of the bovine genome assembly Btau3.1 sequence contigs to the human, dog and horse genomes were displayed as separate tracks on the Btau3.1x genome browser using the UCSC chromosome colour scheme (Fujita, et al., 2011) to identify the chromosome of best match in the relevant species. Asymmetric symbols were used to represent the orientation of the mapping of the contigs to the human, dog and horse genomes relative to the bovine genome. The chromosomal coordinates of the mapping in the non-bovine genome were also readily accessible to the users of the browser using mouse-over and mouse-click display boxes. This information was used in the manual refinement of the assembly, in particular in the definition of scaffold split points for the insertion of other scaffolds and/or the inversion of small numbers of adjacent contigs within a scaffold, where extensive use was made of comparative genomics information at the level of the sequence contigs.

Subsequently four major rounds of revision and refinement of the bovine genome assembly were undertaken manually and decisions on the chromosomal assignment, order of scaffolds and orientation of scaffolds and of sequence contigs were made based on the cattle and sheep BAC mapping and the comparative genomics. Generally in cases of ambiguity parsimony was applied. For the construction of each new version of the assembly changes were recorded in an Excel spreadsheet and Perl scripts were used to convert the Excel spreadsheet into a genome assembly agp file (AGP\_file\_specification). The agp file was used to generate the sequence of the genome assembly, the coordinate conversion chain file (for use by the LiftOver utility) and the contig and scaffold tracks for the genome browser version for the new assembly. For each successive version of the revised assembly of the bovine genome the manual revision was undertaken interactively using the tracks on the genome browser to make decisions.

#### **2.5 Construction of the virtual sheep genome vsg2.0**

To generate the virtual sheep genome assembly the mid point between each pair of BAC-CGCs built using sheep BACs on the bovine Btau3.5x genome assembly was identified. If the mid point was located in a gene (NCBI human RefSeq mRNAs (NCBI\_RefSeq) were used to define the extent of a gene) the position closest to the midpoint and not in a gene was identified. The flanking BAC-CGCs were then extended to this point, or in the case of the first and last BAC-CGCs on a chromosome to the start or end coordinate of the

Once the scaffold chain assembly had been completed the scaffolds not assigned to chromosomes in the Btau3.1 assembly (the UnChr) were then linked into the scaffold chains in a similar, but separate process. The resulting scaffold chains were then ordered and oriented using the consensus of the mapping of the order of the BACs in the physical bovine BAC map (Snelling, et al., 2007) to the BACs in the bovine scaffold chains. The initial data set Btau3.1x was then displayed as a browseable genome using Gbrowse (Stein, et al., 2002) to allow the integrity of the assembly of the scaffolds to be visually assessed. Genome contigs, scaffolds, bovine BAC mapping positions were displayed as separate groups. Clusters on non-congruent BACs (i.e. not tail-to-tail) identified regions with remaining

Using Perl scripts and the data set of the mapping of the sheep BESs to the Btau3.1 genome assembly, including BES mapping data integrated onto the Btau3.1 assembly from the horse, dog and vsg1.2 assemblies sheep BACs were assigned to tail-to-tail etc. groups and displayed on the Btau3.1x genome browser in a series of tracks. Positions of the BESs mapped to the separate genomes were integrated on Btau3.1 as previously described

The mappings of the bovine genome assembly Btau3.1 sequence contigs to the human, dog and horse genomes were displayed as separate tracks on the Btau3.1x genome browser using the UCSC chromosome colour scheme (Fujita, et al., 2011) to identify the chromosome of best match in the relevant species. Asymmetric symbols were used to represent the orientation of the mapping of the contigs to the human, dog and horse genomes relative to the bovine genome. The chromosomal coordinates of the mapping in the non-bovine genome were also readily accessible to the users of the browser using mouse-over and mouse-click display boxes. This information was used in the manual refinement of the assembly, in particular in the definition of scaffold split points for the insertion of other scaffolds and/or the inversion of small numbers of adjacent contigs within a scaffold, where extensive use was made of

Subsequently four major rounds of revision and refinement of the bovine genome assembly were undertaken manually and decisions on the chromosomal assignment, order of scaffolds and orientation of scaffolds and of sequence contigs were made based on the cattle and sheep BAC mapping and the comparative genomics. Generally in cases of ambiguity parsimony was applied. For the construction of each new version of the assembly changes were recorded in an Excel spreadsheet and Perl scripts were used to convert the Excel spreadsheet into a genome assembly agp file (AGP\_file\_specification). The agp file was used to generate the sequence of the genome assembly, the coordinate conversion chain file (for use by the LiftOver utility) and the contig and scaffold tracks for the genome browser version for the new assembly. For each successive version of the revised assembly of the bovine genome the manual revision was undertaken interactively using the tracks on the

To generate the virtual sheep genome assembly the mid point between each pair of BAC-CGCs built using sheep BACs on the bovine Btau3.5x genome assembly was identified. If the mid point was located in a gene (NCBI human RefSeq mRNAs (NCBI\_RefSeq) were used to define the extent of a gene) the position closest to the midpoint and not in a gene was identified. The flanking BAC-CGCs were then extended to this point, or in the case of the first and last BAC-CGCs on a chromosome to the start or end coordinate of the

comparative genomics information at the level of the sequence contigs.

assembly problems.

(Dalrymple, et al., 2007).

genome browser to make decisions.

**2.5 Construction of the virtual sheep genome vsg2.0** 

chromosome. Thus all nucleotides in the bovine genome sequence were included in a block and therefore the virtual sheep genome sequence is exactly the same length as the bovine Btau3.5x genome sequence.

The order and orientation of the bovine genome assembly Btau3.5x-based sheep BAC-CGCs in the vsg2.0 was determined on the location and organisation of the sheep linkage map markers (Maddox, et al., 2001) mapped to the Btau3.1 genome and converted to the Btau3.5x assembly using the Btau3.1 to Btau3.5x coordinate conversion chain file and the LiftOver utility. Using Perl scripts the agp file (AGP\_file\_specification) was built and used to generate the sequence of the virtual sheep genome assembly, the coordinate conversion chain file (for use by the LiftOver utility), and the contig and scaffold tracks for the virtual sheep genome browser (VSG).

Using the LiftOver utility and the Btau3.5x to virtual sheep genome coordinate conversion chain file, the BES and BAC-CGC mapping coordinates, and any other features mapped to the Btau3.5x bovine genome, were converted to the virtual sheep genome coordinates. Features were also transferred from the Btau3.1 genome assembly by first converting to Btau3.5x coordinates using the Btau3.1 to Btau3.5x coordinate conversion file and the LiftOver utility and then converting from Btau3.5x to vsg2.0 coordinates. Other features were mapped directly onto the virtual sheep genome using sequence alignment programs such as BLAST and BLAT with the vsg2.0 DNA sequence.

#### **3. Results and discussion**

#### **3.1 Identification of problems with the Btau3.1 assembly of the bovine genome**

The cow is the most closely related organism to sheep for which a genome assembly is available. When this project was commenced, an early draft of the bovine genome assembly Btau3.1 (Elsik, et al., 2009) was in the public domain. Since the sheep genome assembly would be built comparatively on the bovine genome, and sheep sequence contigs from the low coverage six animals at approximately 0.5 fold coverage each, were expected to be very small, the accuracy of the bovine genome assembly would determine the accuracy of the sheep assembly at all levels above that of the individual sequence contigs.

To assess the validity of this strategy the sheep BESs from the CHORI-243 library were mapped to the Btau3.1 genome assembly to identify the extent of segments of conserved synteny between the two genomes. The reader should keep in mind that the only BACs counted as being in the same organisation in the comparison genome as in the source genome (i. e. congruent) are the tail-to-tail BACs less than 200kb in length. Unexpectedly large numbers of sheep BACs, more than 17% of the BACs with both ends mapped, had both BESs positioned on the bovine Btau3.1 genome assembly within 200kb of each other, but not in the expected tail-to-tail organisation, i. e. many BACs had their two BESs mapped in the tail-to-head and head-to-head organisations (Table 1). In addition, large numbers of BACs had both BESs positioned on the same chromosome, but more than 200kb apart, the outsize groups (Table 1). The average insert size of the BACs in the sheep BAC library is 184kb (Dalrymple, et al., 2007).

Such a result would normally suggest a substantial number of intra-chromosomal rearrangements between the sheep and cattle genomes. However, almost as many, more than 14%, of bovine BACs were also not positioned as tail-to-tail BACs on the bovine Btau3.1 genome assembly (Table 1). The organisation of sheep BACs at the locations of these apparent rearrangements between the two genomes was compared with the organisation of

Using Bacterial Artificial Chromosomes

ordered and reoriented scaffolds was undertaken.

during the course of the analysis (data not shown).

genomes.

**genome** 

to Refine Genome Assemblies and to Build Virtual Genomes 379

Genome Assembly Btau3.1 Btau3.1 Btau3.5x vsg2.0 vsg1.2 BAC origin cattle sheep cattle sheep sheep tail-to-tail (%) 86.7% 82.7% 95.63% 94.0% 89.6% tail-to-tail outsize 2.0% 2.6% 0.4% 0.8% 1.7% tail-to-head 3.5% 4.4% 2.6% 2.1% 2.6% tail-to-head outsize 5.2% 7.1% 0.9% 2.1% 4.4% head-to-head 0.4% 0.4% 0.4% 0.2% 0.3% head-to-head-outsize 2.1% 2.8% 0.1% 0.6% 1.4% tail-to-tail (number) 67,352 47,818 82,765 95,757 84,624 breaks 13,192 19,151 27,829 unpaired 79,172 50,142 52,663

Table 1. Mapping of cattle and sheep BACs to assemblies of the cattle and virtual sheep

**3.2 Using cattle and sheep BACs to reorganise the Btau3.1 assembly of the bovine** 

The first step in the generation of the virtual sheep genome was therefore to construct the best approximation to the correct order of the bovine sequence contigs and scaffolds in the bovine genome using the bovine and sheep BACs and comparative genomics. Initially, the scaffolds in the bovine genome assembly (Btau3.1) were kept intact and scaffolds were reordered and reoriented within bovine chromosomes to minimize the number of both cattle and sheep BACs that were not in the tail-to-tail organisation. Then scaffolds apparently assigned to the wrong chromosomes on the basis of the BAC-based links to other scaffolds in the assembly were moved, including being inserted into gaps in other scaffolds guided by the mapping of the BESs. Generally these moves were also supported by the mapping of the sequence contigs to the human, dog and horse genomes (Fig. 2). In addition, scaffolds not assigned to chromosomes in Btau3.1 were included in the assembly where BACs provided unambiguous links. Finally, reordering and reorienting of contigs within the new set of

Given the size of the BACs and the variation in the length of the genomic DNA contained within the BACs the correct position to insert many segments of the bovine assembly was ambiguous based solely on the BAC-end data. Throughout this process, which was mainly undertaken manually, the alignment of the bovine genome assembly contigs to the human, dog and horse genome assemblies was used in making the final decision about where exactly to insert or break scaffolds. In other words, a breakpoint between sequence contigs in an assembly scaffold was chosen that was consistent with the cattle and sheep BES data and the organisation of the human, dog and horse genomes (Fig. 2). Where conflicts between the comparative genome assemblies occurred two out of three consistent organisations were required. However, the integrity of sequence contigs was maintained throughout the process, although evidence for chimeric sequence contigs was also identified

To avoid ovinising the bovine genome at least one bovine BAC was required to support all reorganisations, except reordering and reorienting scaffolds within chromosomes in cases where the bovine BAC fingerprint map (Snelling, et al., 2007) also supported the

bovine BACs at the same locations in the genomes. Frequently clusters of tail-to-head sheep BACs overlapped with clusters of tail-to-head bovine BACs (Fig 1), suggesting that many such occurrences were in fact due to an incorrect assembly of the bovine genome, not true differences in the structures of the two genomes themselves. However, many clusters of tailto-head sheep BACs that did not overlap with tail-to-head bovine BACs were also observed (Fig 1). These BACs probably represent rearrangements in the sheep genome relative to the bovine genome.

Fig. 1. Segment of chromosome one of the bovine Btau3.1 genome assembly showing the positions and orientations of sheep and cattle BACs. BCM genome assembly contigs are coloured based on the human chromosome to which they have the highest scoring match. The circles identify regions of likely inversion in the bovine and/or sheep genomes relative to the Btau3.1 genome assembly.

bovine BACs at the same locations in the genomes. Frequently clusters of tail-to-head sheep BACs overlapped with clusters of tail-to-head bovine BACs (Fig 1), suggesting that many such occurrences were in fact due to an incorrect assembly of the bovine genome, not true differences in the structures of the two genomes themselves. However, many clusters of tailto-head sheep BACs that did not overlap with tail-to-head bovine BACs were also observed (Fig 1). These BACs probably represent rearrangements in the sheep genome relative to the

Fig. 1. Segment of chromosome one of the bovine Btau3.1 genome assembly showing the positions and orientations of sheep and cattle BACs. BCM genome assembly contigs are coloured based on the human chromosome to which they have the highest scoring match. The circles identify regions of likely inversion in the bovine and/or sheep genomes relative

to the Btau3.1 genome assembly.

bovine genome.


Table 1. Mapping of cattle and sheep BACs to assemblies of the cattle and virtual sheep genomes.

#### **3.2 Using cattle and sheep BACs to reorganise the Btau3.1 assembly of the bovine genome**

The first step in the generation of the virtual sheep genome was therefore to construct the best approximation to the correct order of the bovine sequence contigs and scaffolds in the bovine genome using the bovine and sheep BACs and comparative genomics. Initially, the scaffolds in the bovine genome assembly (Btau3.1) were kept intact and scaffolds were reordered and reoriented within bovine chromosomes to minimize the number of both cattle and sheep BACs that were not in the tail-to-tail organisation. Then scaffolds apparently assigned to the wrong chromosomes on the basis of the BAC-based links to other scaffolds in the assembly were moved, including being inserted into gaps in other scaffolds guided by the mapping of the BESs. Generally these moves were also supported by the mapping of the sequence contigs to the human, dog and horse genomes (Fig. 2). In addition, scaffolds not assigned to chromosomes in Btau3.1 were included in the assembly where BACs provided unambiguous links. Finally, reordering and reorienting of contigs within the new set of ordered and reoriented scaffolds was undertaken.

Given the size of the BACs and the variation in the length of the genomic DNA contained within the BACs the correct position to insert many segments of the bovine assembly was ambiguous based solely on the BAC-end data. Throughout this process, which was mainly undertaken manually, the alignment of the bovine genome assembly contigs to the human, dog and horse genome assemblies was used in making the final decision about where exactly to insert or break scaffolds. In other words, a breakpoint between sequence contigs in an assembly scaffold was chosen that was consistent with the cattle and sheep BES data and the organisation of the human, dog and horse genomes (Fig. 2). Where conflicts between the comparative genome assemblies occurred two out of three consistent organisations were required. However, the integrity of sequence contigs was maintained throughout the process, although evidence for chimeric sequence contigs was also identified during the course of the analysis (data not shown).

To avoid ovinising the bovine genome at least one bovine BAC was required to support all reorganisations, except reordering and reorienting scaffolds within chromosomes in cases where the bovine BAC fingerprint map (Snelling, et al., 2007) also supported the

Using Bacterial Artificial Chromosomes

to Refine Genome Assemblies and to Build Virtual Genomes 381

In Btau3.5x the number of bovine assembly scaffolds was reduced from 3053 scaffolds assigned to chromosomes in Btau3.1 to 537 super-scaffolds linked by the cattle and sheep BACs. Of the chromosomally assigned scaffolds in Btau3.1, 974 scaffolds were inverted, and 683 scaffolds were split into 1720 pieces, of which 710 were inverted. 14 scaffolds were moved to a different chromosome and 2192 scaffolds previously not assigned to chromosomes were incorporated into the assembly. 104 of these scaffolds were split into 233 pieces. Coverage of the genome with scaffolds assigned to chromosomes increased from 2.4 Gb to 2.77 Gb. Even after this process it is likely that there remained a number of segments

of the bovine genome assembly which may not have been correctly assembled.

1 BTAUn.418 inv 1 BTAUn.728 inv

1 BTAUn.1364 inv

1 BTAUn.2125 inv

1 BTAUn.1438 inv

1 BTAUn.5341 inv

1 BTAUn.208

1 BTAUn.1381

1 BTAUn.3041

1 BTA1.8

relative to the Btau3.1 build.

1 BTA1.2 1 BTA1.3 1 BTA1.4

chromosome scaffold orientation integrity

1 BTA1.1 split

1 BTA1.6 split

1 BTA1.6 split

1 BTA1.5 split

1 BTA1.5 split 1 BTA1.7 split

1 BTA1.7 split

Table 2. The first twenty scaffolds of the bovine Btau3.5x assembly, scaffolds numbered BTA1.\* were assigned in numerical order to chromosome 1 of the bovine genome assembly Btau3.1 build. Scaffolds numbered BTAUn.\* were not assigned to a chromosome in the bovine Btau3.1 build. "inv" indicates scaffolds inverted in the Btau3.5x genome build relative to the Btau3.1 build, and "split" indicates scaffolds split in the Btau3.5x build

Fig. 2. Segment of chromosome one of the bovine Btau3.5x genome assembly showing the positions and orientations of sheep and cattle BACs. BCM genome assembly contigs are coloured and orientated based on the relevant species chromosome to which they have the highest scoring match The UCSC chromosome colour scheme was used (Fujita, et al., 2011).

reorganisation. This process was undertaken reiteratively to resolve any errors introduced or new links identified as the chromosome structures approached the most likely structure of the bovine genome. This revised assembly of the bovine genome based on Btau3.1 was named Btau.3.5x.

Fig. 2. Segment of chromosome one of the bovine Btau3.5x genome assembly showing the positions and orientations of sheep and cattle BACs. BCM genome assembly contigs are coloured and orientated based on the relevant species chromosome to which they have the highest scoring match The UCSC chromosome colour scheme was used (Fujita, et al., 2011). reorganisation. This process was undertaken reiteratively to resolve any errors introduced or new links identified as the chromosome structures approached the most likely structure of the bovine genome. This revised assembly of the bovine genome based on Btau3.1 was

named Btau.3.5x.



Table 2. The first twenty scaffolds of the bovine Btau3.5x assembly, scaffolds numbered BTA1.\* were assigned in numerical order to chromosome 1 of the bovine genome assembly Btau3.1 build. Scaffolds numbered BTAUn.\* were not assigned to a chromosome in the bovine Btau3.1 build. "inv" indicates scaffolds inverted in the Btau3.5x genome build relative to the Btau3.1 build, and "split" indicates scaffolds split in the Btau3.5x build relative to the Btau3.1 build.

Using Bacterial Artificial Chromosomes

related species.

to Refine Genome Assemblies and to Build Virtual Genomes 383

BACs with one end directly mapped to the bovine genome and other end mapped to the bovine genome via the horse genome (Table 6). Subsequent addition of the BESs mapped via the dog genome added many fewer BACs than adding the BESs mapped via the horse genome (Table 6). The subsequent addition of the human genome data, incorporated in the vsg1.2, added slightly more BACs than the addition of the dog genome (Table 4). Thus including the dog genome had only a small impact on the improvement in the coverage of the virtual sheep genome whereas the more distant, but the better assembled/higher coverage, human genome was a useful addition to the virtual genome construction, but not unexpectedly the biggest contributions came from well assembled genomes of closely

 bovine horse dog vsg1.2 total bovine 76,251 76,251 horse 13,231 709 13,940 dog 1,911 112 45 2,068 vsg1.2 3,131 284 68 55 3,538 total 95,797 Table 5. Genomes providing mapping information for the sheep BACs mapped tail-to-tail in

> Genome BACs positioned by number Both BESs vsg1.2 80,146 Both BESs bovine 13,196 One BES bovine or vsg1.2, other BES horse or dog 2431 Both BESs horse 16 One BES horse, other BES dog 5 Both BESs dog 3 95,797

Table 6. Genomes used to position the BACs on the virtual sheep genome. Datasets were

In other words, building on top of vsg1.2 and the use of a higher quality assembly of the bovine genome contributed a large number of new BACs with both ends positioned on the bovine assembly (Table 5). A large group of BACs were positioned with one end using the bovine or vsg1.2 position and the other using horse or dog. Very few BACs were positioned solely using horse and/or dog positions (Table 6). On this basis further improvement of the vsg would appear to be difficult and most likely to come from filling of gaps in the bovine

Based on the mapping of the sheep BACs to the reorganised bovine genome assembly 943 blocks of conserved synteny, defined by overlapping sheep BACs, were identified between the sheep and cattle genomes (Table 7). Assuming a genome size of 3Gb, the blocks had an average length of just over 3Mb. Although initially disappointing, even in the bovine genome assembly 537 BAC-based super-scaffolds were required to cover the complete

added in the order, vsg1.2, bovine, horse, and dog.

genome sequence itself.

the vsg2.0. Datasets were added in the order, bovine, horse, dog and vsg1.2.

#### **3.3 Integration of the positions of sheep BESs on the Btau3.5x, dog, horse and vsg1.2 genome assemblies**

We then used the virtual genome strategy (Dalrymple, et al., 2007), integrating the separate mapping of the sheep BESs to the original virtual sheep genome (v1.2), the dog and the horse genome assemblies, to maximise the positioning of sheep BESs on Btau.3.5x. There was little change in the human genome assembly over the course of the work and mapping of the sheep BACs to the human genome was captured by using the virtual sheep genome v1.2. Thus the virtual sheep genome version 2 was build on top of v1.2, rather than being a completely *de novo* version. This approach, which uses much lower specificity BLAST parameters, increased the number of sheep BACs able to be positioned on the bovine genome substantially, from 47,818 (in the initial alignments) to 95,757 in the virtual sheep genome, effectively doubling the coverage of the genome (Table 1). The number of sheep BACs able to be positioned in the tail-to-tail organisation in a genome is a complex function of the sequence coverage, assembly stage and evolutionary distance from the bovine genome. The greater distance of the dog genome appears to be partially compensated for by the more advanced state of the assembly used in this analysis. Very similar numbers of BACs were mapped in the tail-to-tail organisation to the two genomes (Table 3) with similar numbers of unique BACs (Table 4).


Table 3. Tail-to-tail BACs within each dataset generated by independently mapping the sheep BESs to each genome and in the intersections between each of the datasets.


Table 4. Tail-to-tail BACs unique to each dataset.

The high coverage and quality of the human genome assembly and the use of the integration strategy presumably contributed to the large number of unique BACs in the tailto-tail organisation present in vsg1.2 (Tables 3 and 4). Over and above the newer assembly of the bovine genome the inclusion of the mapping of the sheep BACs to the horse genome assembly has the biggest impact on the number of BACs assigned and on the number of BAC contigs, where fewer is better (Table 5). This is not surprising since, of the genomes used, the horse is the most closely related species to the two ruminants.

Adding the horse mapping of the sheep BESs positions to the bovine mapping of the sheep BESs positions increased the number of BACs mapped by 13,940, mainly by generating

**3.3 Integration of the positions of sheep BESs on the Btau3.5x, dog, horse and vsg1.2** 

We then used the virtual genome strategy (Dalrymple, et al., 2007), integrating the separate mapping of the sheep BESs to the original virtual sheep genome (v1.2), the dog and the horse genome assemblies, to maximise the positioning of sheep BESs on Btau.3.5x. There was little change in the human genome assembly over the course of the work and mapping of the sheep BACs to the human genome was captured by using the virtual sheep genome v1.2. Thus the virtual sheep genome version 2 was build on top of v1.2, rather than being a completely *de novo* version. This approach, which uses much lower specificity BLAST parameters, increased the number of sheep BACs able to be positioned on the bovine genome substantially, from 47,818 (in the initial alignments) to 95,757 in the virtual sheep genome, effectively doubling the coverage of the genome (Table 1). The number of sheep BACs able to be positioned in the tail-to-tail organisation in a genome is a complex function of the sequence coverage, assembly stage and evolutionary distance from the bovine genome. The greater distance of the dog genome appears to be partially compensated for by the more advanced state of the assembly used in this analysis. Very similar numbers of BACs were mapped in the tail-to-tail organisation to the two genomes (Table 3) with similar

bovine horse dog vsg1.2

vsg1.2 62,889 56,636 54,503 84,624 Table 3. Tail-to-tail BACs within each dataset generated by independently mapping the sheep BESs to each genome and in the intersections between each of the datasets.

> genome including vsg1.2 excluding vsg1.2 bovine 10,550 20,171 horse 701 3,035 dog 211 2,063 vsg1.2 9,204 not applicable

The high coverage and quality of the human genome assembly and the use of the integration strategy presumably contributed to the large number of unique BACs in the tailto-tail organisation present in vsg1.2 (Tables 3 and 4). Over and above the newer assembly of the bovine genome the inclusion of the mapping of the sheep BACs to the horse genome assembly has the biggest impact on the number of BACs assigned and on the number of BAC contigs, where fewer is better (Table 5). This is not surprising since, of the genomes

Adding the horse mapping of the sheep BESs positions to the bovine mapping of the sheep BESs positions increased the number of BACs mapped by 13,940, mainly by generating

**genome assemblies** 

numbers of unique BACs (Table 4).

bovine 77,320

Table 4. Tail-to-tail BACs unique to each dataset.

horse 49,225 60,971

used, the horse is the most closely related species to the two ruminants.

dog 46,355 47,142 57,192

BACs with one end directly mapped to the bovine genome and other end mapped to the bovine genome via the horse genome (Table 6). Subsequent addition of the BESs mapped via the dog genome added many fewer BACs than adding the BESs mapped via the horse genome (Table 6). The subsequent addition of the human genome data, incorporated in the vsg1.2, added slightly more BACs than the addition of the dog genome (Table 4). Thus including the dog genome had only a small impact on the improvement in the coverage of the virtual sheep genome whereas the more distant, but the better assembled/higher coverage, human genome was a useful addition to the virtual genome construction, but not unexpectedly the biggest contributions came from well assembled genomes of closely related species.


Table 5. Genomes providing mapping information for the sheep BACs mapped tail-to-tail in the vsg2.0. Datasets were added in the order, bovine, horse, dog and vsg1.2.


Table 6. Genomes used to position the BACs on the virtual sheep genome. Datasets were added in the order, vsg1.2, bovine, horse, and dog.

In other words, building on top of vsg1.2 and the use of a higher quality assembly of the bovine genome contributed a large number of new BACs with both ends positioned on the bovine assembly (Table 5). A large group of BACs were positioned with one end using the bovine or vsg1.2 position and the other using horse or dog. Very few BACs were positioned solely using horse and/or dog positions (Table 6). On this basis further improvement of the vsg would appear to be difficult and most likely to come from filling of gaps in the bovine genome sequence itself.

Based on the mapping of the sheep BACs to the reorganised bovine genome assembly 943 blocks of conserved synteny, defined by overlapping sheep BACs, were identified between the sheep and cattle genomes (Table 7). Assuming a genome size of 3Gb, the blocks had an average length of just over 3Mb. Although initially disappointing, even in the bovine genome assembly 537 BAC-based super-scaffolds were required to cover the complete

Using Bacterial Artificial Chromosomes

genome analysis.

to Refine Genome Assemblies and to Build Virtual Genomes 385

Sheep chromosome Cattle chromosome OAR1 BTA3 (inv) + BTA1 OAR2 BTA8 (inv) +BTA2 OAR3 BTA11 (inv) + BTA5

OAR9 BTA9 (part, inv) +BTA14

Table 8. High level comparison of the sheep and cattle genomes based on virtual sheep

The cattle and sheep BAC and BES locations are displayed on the chromosome overview track of the virtual sheep genome browser (VSG) allowing a quick assessment of the quality of the assembly to be made (Fig 3). In addition, the sheep virtual genome assembly was annotated with the locations of the sheep markers, SNPs on the 1536 pilot sheep SNP chip (Kijas, et al., 2009) and the Illumina Ovine SNP50 BeadChip, and human and bovine mRNA

**3.6 Construction of the virtual sheep genome (vsg2.0) genome browser** 

RefSeqs downloaded from the NCBI (NCBI\_RefSeq).

OAR4 BTA4 OAR5 BTA7 OAR6 BTA6 OAR7 BTA10 OAR8 BTA9 (part)

OAR10 BTA12 OAR11 BTA19 OAR12 BTA16 OAR13 BTA13 OAR14 BTA18 OAR15 BTA15 OAR16 BTA20 OAR17 BTA17 OAR18 BTA21 OAR19 BTA22 OAR20 BTA23 OAR21 BTA29 OAR22 BTA26 OAR23 BTA24 OAR24 BTA25 OAR25 BTA28 OAR26 BTA27 OARX BTAX (inv)

genome. The comparison of the number of blocks of conserved synteny identified across the different combinations of datasets demonstrates that the inclusion of additional species beyond the horse has a much greater impact on the reduction in the number of blocks of conserved synteny than it has on the total number of BACs positioned tail-to-tail. Only a 25% increase in the number of BACs, but a 56% decrease in the number blocks of conserved synteny, i. e. on average every block of conserved synteny defined based on the mapping of BACs to the bovine genome has been extended to include one adjacent block of conserved synteny.


Table 7. Building the virtual sheep genome.

#### **3.4 Remaining ambiguities in the build of the bovine genome**

Since there were many occasions on which there was no unambiguous basis on which to identify the correct break points in the bovine genome assembly a large number of probable inversions identified by BACs remained in the final version of the bovine genome. Most of these inversions were also supported by sheep BACs (Fig 2). In addition, whilst potentially chimeric bovine genomic sequence contigs were identified during the reassembly process, their structure has not been changed in Btau3.5x.

#### **3.5 Construction of the virtual sheep genome (vsg2.0)**

The sheep markers (sheep map version 4.7) were used to reorganise the bovine genome assembly into the vsg. In the main this involved renumbering of the bovine chromosomes, with five inverted chromosomes (or segments of chromosomes), four chromosome fusions and a single chromosome breakage (Table 8). Reordering of the segments of the bovine genome defined by the BAC comparative genome contigs (CGCs) was undertaken on four chromosomes, 7, 12, 13 and X. Apart from the X chromosome, these were local changes and involved a small number of BAC CGCs covering a small region of the genome. Given the variation in the size of BACs, and the lack of comparative data from other genomes for species specific breaks, the boundaries of such breaks could not be unambiguously identified with the data currently available. Thus no attempt was made to resolve the small potential sheep specific rearrangements within chromosomes where the break points were ambiguous and there was not sufficient marker evidence to support a change in the organisation (Fig 3).

The vsg 2.0 has been used in a number of analyses of the genome organisation of sheep and in general a high level of congruence with maps determined using other approaches has been observed (Drogemuller, et al., 2008; Wu, et al., 2008; Goldammer, et al., 2009c; Wu, et al., 2009), although the vsg 2 X chromosome build appears to contain a number of significant discrepancies (Goldammer, et al., 2009a; Goldammer, et al., 2009b).

genome. The comparison of the number of blocks of conserved synteny identified across the different combinations of datasets demonstrates that the inclusion of additional species beyond the horse has a much greater impact on the reduction in the number of blocks of conserved synteny than it has on the total number of BACs positioned tail-to-tail. Only a 25% increase in the number of BACs, but a 56% decrease in the number blocks of conserved synteny, i. e. on average every block of conserved synteny defined based on the mapping of BACs to the bovine genome has been extended to include one adjacent block of conserved

> genomes Sheep BAC contigs dog 2,146 horse 1,470 bovine 1,411 bovine + dog + vsg1.2 1,299 bovine + horse + dog + vsg1.2 943

Since there were many occasions on which there was no unambiguous basis on which to identify the correct break points in the bovine genome assembly a large number of probable inversions identified by BACs remained in the final version of the bovine genome. Most of these inversions were also supported by sheep BACs (Fig 2). In addition, whilst potentially chimeric bovine genomic sequence contigs were identified during the reassembly process,

The sheep markers (sheep map version 4.7) were used to reorganise the bovine genome assembly into the vsg. In the main this involved renumbering of the bovine chromosomes, with five inverted chromosomes (or segments of chromosomes), four chromosome fusions and a single chromosome breakage (Table 8). Reordering of the segments of the bovine genome defined by the BAC comparative genome contigs (CGCs) was undertaken on four chromosomes, 7, 12, 13 and X. Apart from the X chromosome, these were local changes and involved a small number of BAC CGCs covering a small region of the genome. Given the variation in the size of BACs, and the lack of comparative data from other genomes for species specific breaks, the boundaries of such breaks could not be unambiguously identified with the data currently available. Thus no attempt was made to resolve the small potential sheep specific rearrangements within chromosomes where the break points were ambiguous and there was not sufficient marker evidence to support a change in the

The vsg 2.0 has been used in a number of analyses of the genome organisation of sheep and in general a high level of congruence with maps determined using other approaches has been observed (Drogemuller, et al., 2008; Wu, et al., 2008; Goldammer, et al., 2009c; Wu, et al., 2009), although the vsg 2 X chromosome build appears to contain a number of significant

synteny.

Table 7. Building the virtual sheep genome.

their structure has not been changed in Btau3.5x.

organisation (Fig 3).

**3.5 Construction of the virtual sheep genome (vsg2.0)** 

discrepancies (Goldammer, et al., 2009a; Goldammer, et al., 2009b).

**3.4 Remaining ambiguities in the build of the bovine genome** 


Table 8. High level comparison of the sheep and cattle genomes based on virtual sheep genome analysis.

#### **3.6 Construction of the virtual sheep genome (vsg2.0) genome browser**

The cattle and sheep BAC and BES locations are displayed on the chromosome overview track of the virtual sheep genome browser (VSG) allowing a quick assessment of the quality of the assembly to be made (Fig 3). In addition, the sheep virtual genome assembly was annotated with the locations of the sheep markers, SNPs on the 1536 pilot sheep SNP chip (Kijas, et al., 2009) and the Illumina Ovine SNP50 BeadChip, and human and bovine mRNA RefSeqs downloaded from the NCBI (NCBI\_RefSeq).

Using Bacterial Artificial Chromosomes

to Refine Genome Assemblies and to Build Virtual Genomes 387

Fig. 4. A 5 Mb segment of the vsg 2 genome assembly of chromosome 23 showing the

The new vsg2.0 is a significant improvement over vsg1.2, built on the human genome framework. Clearly using the genome from a closely related species and allowing the data from the species of interest to direct the process has an advantage over a very well assembled, but more distant genome. At the low resolution level down to the level of the BACs the sheep genome has a very high level of overall conserved synteny with the bovine genome structure. A number of regions of ambiguity remain, but many of these are in

positions and orientations of sheep BACs and other tracks.

**4. Conclusion** 


Fig. 3. Overview of chromosome OAR23 from the vsg v2 browser, displaying ovine and bovine BAC mapping, sheep linkage map markers and Pilot SNP Chip SNPs.

Fig. 3. Overview of chromosome OAR23 from the vsg v2 browser, displaying ovine and

bovine BAC mapping, sheep linkage map markers and Pilot SNP Chip SNPs.

Ovine Tail-Tail Outsize BACs

Ovine Tail-Head BACs

Ovine Tail-Head-Outsize BACs

Ovine Head-Head-Outsize BACs

Bovine Tail-Tail Outsize BACs

Bovine Tail-Head-Outsize BACs

Bovine Head-Head-Outsize BACs

Ovine Head-Head BACs

Ovine unpaired BESs

Bovine Tail-Head BACs

Bovine Head-Head BACs

Sheep Markers

1536 Pilot SNPs

Fig. 4. A 5 Mb segment of the vsg 2 genome assembly of chromosome 23 showing the positions and orientations of sheep BACs and other tracks.

#### **4. Conclusion**

The new vsg2.0 is a significant improvement over vsg1.2, built on the human genome framework. Clearly using the genome from a closely related species and allowing the data from the species of interest to direct the process has an advantage over a very well assembled, but more distant genome. At the low resolution level down to the level of the BACs the sheep genome has a very high level of overall conserved synteny with the bovine genome structure. A number of regions of ambiguity remain, but many of these are in

Using Bacterial Artificial Chromosomes

to Refine Genome Assemblies and to Build Virtual Genomes 389

Fujita, P. A., Rhead, B., Zweig, A. S., Hinrichs, A. S., Karolchik, D., Cline, M. S., Goldman,

Goldammer, T., Brunner, R. M., Rebl, A., Wu, C. H., Nomura, K., Hadfield, T., Gill, C.,

Goldammer, T., Brunner, R. M., Rebl, A., Wu, C. H., Nomura, K., Hadfield, T., Maddox, J. F.

Goldammer, T., Di Meo, G. P., Luhken, G., Drogemuller, C., Wu, C. H., Kijas, J., Dalrymple,

2011. *Nucleic Acids Res*, Vol. 39, No. Database issue.pp. D876-882.

cattle. *Cytogenet Genome Res*, Vol. 125, No. 1.pp. 40-45.

and human. *Chromosome Res*, Vol. 17, No. 4.pp. 497-506.

ISGC website, Available from http://www.sheephapmap.org

Vol. 126, No. 1-2.pp. 63-76.

M., Barber, G. P., Clawson, H., Coelho, A., Diekhans, M., Dreszer, T. R., Giardine, B. M., Harte, R. A., Hillman-Jackson, J., Hsu, F., Kirkup, V., Kuhn, R. M., Learned, K., Li, C. H., Meyer, L. R., Pohl, A., Raney, B. J., Rosenbloom, K. R., Smith, K. E., Haussler, D. & Kent, W. J. (2011). The UCSC Genome Browser database: update

Dalrymple, B. P., Womack, J. E. & Cockett, N. E. (2009a). A high-resolution radiation hybrid map of sheep chromosome X and comparison with human and

& Cockett, N. E. (2009b). Cytogenetic anchoring of radiation hybrid and virtual maps of sheep chromosome X and comparison of X chromosomes in sheep, cattle,

B. P., Nicholas, F. W., Maddox, J. F., Iannuzzi, L. & Cockett, N. E. (2009c). Molecular cytogenetics and gene mapping in sheep (Ovis aries, 2n = 54). *Cytogenet Genome Res*,

Edgar, R. C., McEwan, J. C., Payne, G. M., Raison, J. M., Junier, T., Kriventseva, E. V., Eyras, E., Plass, M., Donthu, R., Larkin, D. M., Reecy, J., Yang, M. Q., Chen, L., Cheng, Z., Chitko-McKown, C. G., Liu, G. E., Matukumalli, L. K., Song, J., Zhu, B., Bradley, D. G., Brinkman, F. S., Lau, L. P., Whiteside, M. D., Walker, A., Wheeler, T. T., Casey, T., German, J. B., Lemay, D. G., Maqbool, N. J., Molenaar, A. J., Seo, S., Stothard, P., Baldwin, C. L., Baxter, R., Brinkmeyer-Langford, C. L., Brown, W. C., Childers, C. P., Connelley, T., Ellis, S. A., Fritz, K., Glass, E. J., Herzig, C. T., Iivanainen, A., Lahmers, K. K., Bennett, A. K., Dickens, C. M., Gilbert, J. G., Hagen, D. E., Salih, H., Aerts, J., Caetano, A. R., Dalrymple, B., Garcia, J. F., Gill, C. A., Hiendleder, S. G., Memili, E., Spurlock, D., Williams, J. L., Alexander, L., Brownstein, M. J., Guan, L., Holt, R. A., Jones, S. J., Marra, M. A., Moore, R., Moore, S. S., Roberts, A., Taniguchi, M., Waterman, R. C., Chacko, J., Chandrabose, M. M., Cree, A., Dao, M. D., Dinh, H. H., Gabisi, R. A., Hines, S., Hume, J., Jhangiani, S. N., Joshi, V., Kovar, C. L., Lewis, L. R., Liu, Y. S., Lopez, J., Morgan, M. B., Nguyen, N. B., Okwuonu, G. O., Ruiz, S. J., Santibanez, J., Wright, R. A., Buhay, C., Ding, Y., Dugan-Rocha, S., Herdandez, J., Holder, M., Sabo, A., Egan, A., Goodell, J., Wilczek-Boney, K., Fowler, G. R., Hitchens, M. E., Lozado, R. J., Moen, C., Steffen, D., Warren, J. T., Zhang, J., Chiu, R., Schein, J. E., Durbin, K. J., Havlak, P., Jiang, H., Liu, Y., Qin, X., Ren, Y., Shen, Y., Song, H., Bell, S. N., Davis, C., Johnson, A. J., Lee, S., Nazareth, L. V., Patel, B. M., Pu, L. L., Vattathil, S., Williams, R. L., Jr., Curry, S., Hamilton, C., Sodergren, E., Wheeler, D. A., Barris, W., Bennett, G. L., Eggen, A., Green, R. D., Harhay, G. P., Hobbs, M., Jann, O., Keele, J. W., Kent, M. P., Lien, S., McKay, S. D., McWilliam, S., Ratnakumar, A., Schnabel, R. D., Smith, T., Snelling, W. M., Sonstegard, T. S., Stone, R. T., Sugimoto, Y., Takasuga, A., Taylor, J. F., Van Tassell, C. P., Macneil, M. D., et al. (2009). The genome sequence of taurine cattle: a window to ruminant biology and evolution. *Science*, Vol. 324, No. 5926.pp. 522-528.

regions of ambiguity of the assembly of the bovine genome and therefore await further refinement of the bovine genome assembly, or a predominantly de novo assembly of the sheep genome. However, overall it is clear that the vsg 2 makes a robust framework to assemble the large number of short contigs expected from the sequencing of the sheep genome (Archibald, et al., 2010).

Two assembled genomes from closely related species is probably the optimal balance between analysis complexity and benefit, with inclusion of a more distant, but much better assembled genome, if the genomes of closely related species are not well assembled. Thus the methods that we have described are very broadly applicable.

#### **5. Acknowledgement**

The authors would like to thank the members of the International Sheep Genomics Consortium (ISGC\_website) in particular Jill Maddox, John McEwan and James Kijas for useful discussions. The authors also gratefully acknowledge the early pre-publication access under the Fort Lauderdale conventions to the draft equine genome sequence provided by the Broad Institute and to the draft bovine genome sequence provided by the Baylor College of Medicine Human Genome Sequencing Center and the Bovine Genome Sequencing Project Consortium. This work was partly funded by SheepGenomics (a joint venture of Meat and Livestock Australia and Australian Wool Innovation). The work was undertaken as part of the development of sheep genomics tools by the ISGC.

#### **6. References**

AGP File Specification (v. 1.1), Available from

 http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP\_Specificatio n.shtml


regions of ambiguity of the assembly of the bovine genome and therefore await further refinement of the bovine genome assembly, or a predominantly de novo assembly of the sheep genome. However, overall it is clear that the vsg 2 makes a robust framework to assemble the large number of short contigs expected from the sequencing of the sheep

Two assembled genomes from closely related species is probably the optimal balance between analysis complexity and benefit, with inclusion of a more distant, but much better assembled genome, if the genomes of closely related species are not well assembled. Thus

The authors would like to thank the members of the International Sheep Genomics Consortium (ISGC\_website) in particular Jill Maddox, John McEwan and James Kijas for useful discussions. The authors also gratefully acknowledge the early pre-publication access under the Fort Lauderdale conventions to the draft equine genome sequence provided by the Broad Institute and to the draft bovine genome sequence provided by the Baylor College of Medicine Human Genome Sequencing Center and the Bovine Genome Sequencing Project Consortium. This work was partly funded by SheepGenomics (a joint venture of Meat and Livestock Australia and Australian Wool Innovation). The work was undertaken

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP\_Specificatio

Archibald, A. L., Cockett, N. E., Dalrymple, B. P., Faraut, T., Kijas, J. W., Maddox, J. F.,

Dalrymple, B. P., Kirkness, E. F., Nefedov, M., McWilliam, S., Ratnakumar, A., Barris, W.,

Drogemuller, M., Tetens, J., Dalrymple, B., Goldammer, T., Wu, C. H., Cockett, N. E., Leeb,

Elsik, C. G., Tellam, R. L., Worley, K. C., Gibbs, R. A., Muzny, D. M., Weinstock, G. M.,

chromosome 10. *Cytogenet Genome Res*, Vol. 121, No. 1.pp. 35-40.

McEwan, J. C., Hutton Oddy, V., Raadsma, H. W., Wade, C., Wang, J., Wang, W. & Xun, X. (2010). The sheep genome reference sequence: a work in progress. *Anim* 

Zhao, S., Shetty, J., Maddox, J. F., O'Grady, M., Nicholas, F., Crawford, A. M., Smith, T., de Jong, P. J., McEwan, J., Oddy, V. H. & Cockett, N. E. (2007). Using comparative genomics to reorder the human genome sequence into a virtual sheep

T. & Drogemuller, C. (2008). A comparative radiation hybrid map of sheep

Adelson, D. L., Eichler, E. E., Elnitski, L., Guigo, R., Hamernik, D. L., Kappes, S. M., Lewin, H. A., Lynn, D. J., Nicholas, F. W., Reymond, A., Rijnkels, M., Skow, L. C., Zdobnov, E. M., Schook, L., Womack, J., Alioto, T., Antonarakis, S. E., Astashyn, A., Chapple, C. E., Chen, H. C., Chrast, J., Camara, F., Ermolaeva, O., Henrichsen, C. N., Hlavina, W., Kapustin, Y., Kiryutin, B., Kitts, P., Kokocinski, F., Landrum, M., Maglott, D., Pruitt, K., Sapojnikov, V., Searle, S. M., Solovyev, V., Souvorov, A., Ucla, C., Wyss, C., Anzola, J. M., Gerlach, D., Elhaik, E., Graur, D., Reese, J. T.,

the methods that we have described are very broadly applicable.

as part of the development of sheep genomics tools by the ISGC.

AGP File Specification (v. 1.1), Available from

*Genet*, Vol. 41, No. 5.pp. 449-453.

genome. *Genome Biol*, Vol. 8, No. 7.pp. R152.

genome (Archibald, et al., 2010).

**5. Acknowledgement** 

**6. References** 

n.shtml

Edgar, R. C., McEwan, J. C., Payne, G. M., Raison, J. M., Junier, T., Kriventseva, E. V., Eyras, E., Plass, M., Donthu, R., Larkin, D. M., Reecy, J., Yang, M. Q., Chen, L., Cheng, Z., Chitko-McKown, C. G., Liu, G. E., Matukumalli, L. K., Song, J., Zhu, B., Bradley, D. G., Brinkman, F. S., Lau, L. P., Whiteside, M. D., Walker, A., Wheeler, T. T., Casey, T., German, J. B., Lemay, D. G., Maqbool, N. J., Molenaar, A. J., Seo, S., Stothard, P., Baldwin, C. L., Baxter, R., Brinkmeyer-Langford, C. L., Brown, W. C., Childers, C. P., Connelley, T., Ellis, S. A., Fritz, K., Glass, E. J., Herzig, C. T., Iivanainen, A., Lahmers, K. K., Bennett, A. K., Dickens, C. M., Gilbert, J. G., Hagen, D. E., Salih, H., Aerts, J., Caetano, A. R., Dalrymple, B., Garcia, J. F., Gill, C. A., Hiendleder, S. G., Memili, E., Spurlock, D., Williams, J. L., Alexander, L., Brownstein, M. J., Guan, L., Holt, R. A., Jones, S. J., Marra, M. A., Moore, R., Moore, S. S., Roberts, A., Taniguchi, M., Waterman, R. C., Chacko, J., Chandrabose, M. M., Cree, A., Dao, M. D., Dinh, H. H., Gabisi, R. A., Hines, S., Hume, J., Jhangiani, S. N., Joshi, V., Kovar, C. L., Lewis, L. R., Liu, Y. S., Lopez, J., Morgan, M. B., Nguyen, N. B., Okwuonu, G. O., Ruiz, S. J., Santibanez, J., Wright, R. A., Buhay, C., Ding, Y., Dugan-Rocha, S., Herdandez, J., Holder, M., Sabo, A., Egan, A., Goodell, J., Wilczek-Boney, K., Fowler, G. R., Hitchens, M. E., Lozado, R. J., Moen, C., Steffen, D., Warren, J. T., Zhang, J., Chiu, R., Schein, J. E., Durbin, K. J., Havlak, P., Jiang, H., Liu, Y., Qin, X., Ren, Y., Shen, Y., Song, H., Bell, S. N., Davis, C., Johnson, A. J., Lee, S., Nazareth, L. V., Patel, B. M., Pu, L. L., Vattathil, S., Williams, R. L., Jr., Curry, S., Hamilton, C., Sodergren, E., Wheeler, D. A., Barris, W., Bennett, G. L., Eggen, A., Green, R. D., Harhay, G. P., Hobbs, M., Jann, O., Keele, J. W., Kent, M. P., Lien, S., McKay, S. D., McWilliam, S., Ratnakumar, A., Schnabel, R. D., Smith, T., Snelling, W. M., Sonstegard, T. S., Stone, R. T., Sugimoto, Y., Takasuga, A., Taylor, J. F., Van Tassell, C. P., Macneil, M. D., et al. (2009). The genome sequence of taurine cattle: a window to ruminant biology and evolution. *Science*, Vol. 324, No. 5926.pp. 522-528.


Using Bacterial Artificial Chromosomes

No. 636.

10.pp. 1599-1610.

Virtual sheep genome browser, Available from

*Genome Res*, Vol. 11, No. 7.pp. 1275-1289.

BACs. *BMC Genomics*, Vol. 10, No. 46.

genomes. *BMC Genomics*, Vol. 11, No. 458.

to Refine Genome Assemblies and to Build Virtual Genomes 391

NCBI RefSeq collection, Available from http://www.ncbi.nlm.nih.gov/RefSeq/index.html Ratnakumar, A., Barris, W., McWilliam, S., Brauning, R., McEwan, J. C., Snelling, W. M. &

Ratnakumar, A., Kirkness, E. F. & Dalrymple, B. P. (2010a). Quality control of the sheep

Ratnakumar, A., McWilliam, S., Barris, W. & Dalrymple, B. P. (2010b). Using paired-end

Snelling, W. M., Chiu, R., Schein, J. E., Hobbs, M., Abbey, C. A., Adelson, D. L., Aerts, J.,

A physical map of the bovine genome. *Genome Biol*, Vol. 8, No. 8.pp. R165. Soler, L., Conte, M. A., Katagiri, T., Howe, A. E., Lee, B. Y., Amemiya, C., Stuart, A., Dossat,

Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.

Wade, C. M., Giulotto, E., Sigurdsson, S., Zoli, M., Gnerre, S., Imsland, F., Lear, T. L.,

UCSC Genome Bioinformatics Website, Available from http://genome.ucsc.edu/

http://www.livestockgenomics.csiro.au/perl/gbrowse.cgi/vsheep2/

Moore, S. S., Dodds, K. G., Lumsden, J. M., van Stijn, T. C., Phua, S. H., Adelson, D. L., Burkin, H. R., Broom, J. E., Buitkamp, J., Cambridge, L., Cushwa, W. T., Gerard, E., Galloway, S. M., Harrison, B., Hawken, R. J., Hiendleder, S., Henry, H. M., Medrano, J. F., Paterson, K. A., Schibler, L., Stone, R. T. & van Hest, B. (2001). An enhanced linkage map of the sheep genome comprising more than 1000 loci.

Dalrymple, B. P. (2009). A multiway analysis for identifying high integrity bovine

bacterial artificial chromosome library, CHORI-243. *BMC Res Notes*, Vol. 3, No. 334.

sequences to optimise parameters for alignment of sequence reads against related

Bennett, G. L., Bosdet, I. E., Boussaha, M., Brauning, R., Caetano, A. R., Costa, M. M., Crawford, A. M., Dalrymple, B. P., Eggen, A., Everts-van der Wind, A., Floriot, S., Gautier, M., Gill, C. A., Green, R. D., Holt, R., Jann, O., Jones, S. J., Kappes, S. M., Keele, J. W., de Jong, P. J., Larkin, D. M., Lewin, H. A., McEwan, J. C., McKay, S., Marra, M. A., Mathewson, C. A., Matukumalli, L. K., Moore, S. S., Murdoch, B., Nicholas, F. W., Osoegawa, K., Roy, A., Salih, H., Schibler, L., Schnabel, R. D., Silveri, L., Skow, L. C., Smith, T. P., Sonstegard, T. S., Taylor, J. F., Tellam, R., Van Tassell, C. P., Williams, J. L., Womack, J. E., Wye, N. H., Yang, G. & Zhao, S. (2007).

C., Poulain, J., Johnson, J., Di Palma, F., Lindblad-Toh, K., Baroiller, J. F., D'Cotta, H., Ozouf-Costaz, C. & Kocher, T. D. (2010). Comparative physical maps derived from BAC end sequences of tilapia (Oreochromis niloticus). *BMC Genomics*, Vol. 11,

E., Harris, T. W., Arva, A. & Lewis, S. (2002). The generic genome browser: a building block for a model organism system database. *Genome Res*, Vol. 12, No.

Adelson, D. L., Bailey, E., Bellone, R. R., Blocker, H., Distl, O., Edgar, R. C., Garber, M., Leeb, T., Mauceli, E., MacLeod, J. N., Penedo, M. C., Raison, J. M., Sharpe, T., Vogel, J., Andersson, L., Antczak, D. F., Biagi, T., Binns, M. M., Chowdhary, B. P., Coleman, S. J., Della Valle, G., Fryc, S., Guerin, G., Hasegawa, T., Hill, E. W., Jurka, J., Kiialainen, A., Lindgren, G., Liu, J., Magnani, E., Mickelson, J. R., Murray, J., Nergadze, S. G., Onofrio, R., Pedroni, S., Piras, M. F., Raudsepp, T., Rocchi, M., Roed, K. H., Ryder, O. A., Searle, S., Skow, L., Swinburne, J. E., Syvanen, A. C., Tozaki, T., Valberg, S. J., Vaudin, M., White, J. R., Zody, M. C., Lander, E. S. &


Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. *Genome Res*, Vol. 12, No. 4.pp. 656-

Kijas, J. W., Townley, D., Dalrymple, B. P., Heaton, M. P., Maddox, J. F., McGrath, A.,

Lindblad-Toh, K., Wade, C. M., Mikkelsen, T. S., Karlsson, E. K., Jaffe, D. B., Kamal, M.,

Wilson, P., Ingersoll, R. G., McCulloch, R., McWilliam, S., Tang, D., McEwan, J., Cockett, N., Oddy, V. H., Nicholas, F. W. & Raadsma, H. (2009). A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. *PLoS One*,

Clamp, M., Chang, J. L., Kulbokas, E. J., 3rd, Zody, M. C., Mauceli, E., Xie, X., Breen, M., Wayne, R. K., Ostrander, E. A., Ponting, C. P., Galibert, F., Smith, D. R., DeJong, P. J., Kirkness, E., Alvarez, P., Biagi, T., Brockman, W., Butler, J., Chin, C. W., Cook, A., Cuff, J., Daly, M. J., DeCaprio, D., Gnerre, S., Grabherr, M., Kellis, M., Kleber, M., Bardeleben, C., Goodstadt, L., Heger, A., Hitte, C., Kim, L., Koepfli, K. P., Parker, H. G., Pollinger, J. P., Searle, S. M., Sutter, N. B., Thomas, R., Webber, C., Baldwin, J., Abebe, A., Abouelleil, A., Aftuck, L., Ait-Zahra, M., Aldredge, T., Allen, N., An, P., Anderson, S., Antoine, C., Arachchi, H., Aslam, A., Ayotte, L., Bachantsang, P., Barry, A., Bayul, T., Benamara, M., Berlin, A., Bessette, D., Blitshteyn, B., Bloom, T., Blye, J., Boguslavskiy, L., Bonnet, C., Boukhgalter, B., Brown, A., Cahill, P., Calixte, N., Camarata, J., Cheshatsang, Y., Chu, J., Citroen, M., Collymore, A., Cooke, P., Dawoe, T., Daza, R., Decktor, K., DeGray, S., Dhargay, N., Dooley, K., Dorje, P., Dorjee, K., Dorris, L., Duffey, N., Dupes, A., Egbiremolen, O., Elong, R., Falk, J., Farina, A., Faro, S., Ferguson, D., Ferreira, P., Fisher, S., FitzGerald, M., Foley, K., Foley, C., Franke, A., Friedrich, D., Gage, D., Garber, M., Gearin, G., Giannoukos, G., Goode, T., Goyette, A., Graham, J., Grandbois, E., Gyaltsen, K., Hafez, N., Hagopian, D., Hagos, B., Hall, J., Healy, C., Hegarty, R., Honan, T., Horn, A., Houde, N., Hughes, L., Hunnicutt, L., Husby, M., Jester, B., Jones, C., Kamat, A., Kanga, B., Kells, C., Khazanovich, D., Kieu, A. C., Kisner, P., Kumar, M., Lance, K., Landers, T., Lara, M., Lee, W., Leger, J. P., Lennon, N., Leuper, L., LeVine, S., Liu, J., Liu, X., Lokyitsang, Y., Lokyitsang, T., Lui, A., Macdonald, J., Major, J., Marabella, R., Maru, K., Matthews, C., McDonough, S., Mehta, T., Meldrim, J., Melnikov, A., Meneus, L., Mihalev, A., Mihova, T., Miller, K., Mittelman, R., Mlenga, V., Mulrain, L., Munson, G., Navidi, A., Naylor, J., Nguyen, T., Nguyen, N., Nguyen, C., Nicol, R., Norbu, N., Norbu, C., Novod, N., Nyima, T., Olandt, P., O'Neill, B., O'Neill, K., Osman, S., Oyono, L., Patti, C., Perrin, D., Phunkhang, P., Pierre, F., Priest, M., Rachupka, A., Raghuraman, S., Rameau, R., Ray, V., Raymond, C., Rege, F., Rise, C., Rogers, J., Rogov, P., Sahalie, J., Settipalli, S., Sharpe, T., Shea, T., Sheehan, M., Sherpa, N., Shi, J., Shih, D., et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic

664.

Vol. 4, No. 3.pp. e4668.

Liftover UCSC genome coordinate conversion files, Available from

http://genome.ucsc.edu/cgi-bin/hgLiftOver

dog. *Nature*, Vol. 438, No. 7069.pp. 803-819.

zebrafish genome. *BMC Genomics*, Vol. 10, No. 592.

Liu, H., Jiang, Y., Wang, S., Ninwichian, P., Somridhivej, B., Xu, P., Abernathy, J., Kucuktas,

Maddox, J. F., Davies, K. P., Crawford, A. M., Hulme, D. J., Vaiman, D., Cribiu, E. P.,

H. & Liu, Z. (2009). Comparative analysis of catfish BAC end sequences with the

Freking, B. A., Beh, K. J., Cockett, N. E., Kang, N., Riffkin, C. D., Drinkwater, R.,

Moore, S. S., Dodds, K. G., Lumsden, J. M., van Stijn, T. C., Phua, S. H., Adelson, D. L., Burkin, H. R., Broom, J. E., Buitkamp, J., Cambridge, L., Cushwa, W. T., Gerard, E., Galloway, S. M., Harrison, B., Hawken, R. J., Hiendleder, S., Henry, H. M., Medrano, J. F., Paterson, K. A., Schibler, L., Stone, R. T. & van Hest, B. (2001). An enhanced linkage map of the sheep genome comprising more than 1000 loci. *Genome Res*, Vol. 11, No. 7.pp. 1275-1289.

NCBI RefSeq collection, Available from http://www.ncbi.nlm.nih.gov/RefSeq/index.html


http://www.livestockgenomics.csiro.au/perl/gbrowse.cgi/vsheep2/

Wade, C. M., Giulotto, E., Sigurdsson, S., Zoli, M., Gnerre, S., Imsland, F., Lear, T. L., Adelson, D. L., Bailey, E., Bellone, R. R., Blocker, H., Distl, O., Edgar, R. C., Garber, M., Leeb, T., Mauceli, E., MacLeod, J. N., Penedo, M. C., Raison, J. M., Sharpe, T., Vogel, J., Andersson, L., Antczak, D. F., Biagi, T., Binns, M. M., Chowdhary, B. P., Coleman, S. J., Della Valle, G., Fryc, S., Guerin, G., Hasegawa, T., Hill, E. W., Jurka, J., Kiialainen, A., Lindgren, G., Liu, J., Magnani, E., Mickelson, J. R., Murray, J., Nergadze, S. G., Onofrio, R., Pedroni, S., Piras, M. F., Raudsepp, T., Rocchi, M., Roed, K. H., Ryder, O. A., Searle, S., Skow, L., Swinburne, J. E., Syvanen, A. C., Tozaki, T., Valberg, S. J., Vaudin, M., White, J. R., Zody, M. C., Lander, E. S. &

**19** 

*Spain* 

**Basidiomycetes Telomeres –** 

Lucía Ramírez, Gúmer Pérez, Raúl Castanera, Francisco Santoyo and Antonio G. Pisabarro

*Genetics and Microbiology Research Group, Public University of Navarre, Pamplona,* 

**1.1 The telomere: A complex nucleoprotein complex with a broad range of functions**  The telomeres (from the Greek *télos* far and -*meros* part) are the genetic structures found at the physical ends of linear chromosomes. They are nucleoprotein complexes composed of DNA repeats and a myriad of telomere and non-telomere associated proteins aimed to protect the ends of eukaryotic chromosomes from being recognized as double strand breaks, and to avoid chromosome end degradation by nucleases and non-canonical chromosome end fusions. Thus, telomeres are essential for chromosome integrity (Hande, 2004; Paeschke et al., 2010; Zakian, 1995). The fascinating story about telomere biology comes from the pioneering work of Elizabeth H. Blackburn who discovered that *Tetrahymena* telomeres consisted of a short DNA sequence motif that was repeated several times at the chromosomal end (Blackburn & Gall, 1978). This pattern is conserved in lower eukaryotes and in mammalian cells (Greider, 1998). Notable exceptions are *Drosophila* and some other dipterans, which instead possess tandem

Telomere DNA consists of tandem arrays of short repeated sequences forming a cap. Telomere length is species-specific and small cell type variations were observed. For reviews, see Fisher & Zakian, (2005) and Sanchez-Alonso & Guzman (2008). The basic telomere DNA repeat unit is the hexamer TTAGGG in which the strand running 5´→ 3´ outwards the centromere is usually guanine-rich. This G-rich strand protrudes its complementary end and bends on itself to form a telomere DNA loop (T-loop) (Griffith et al., 1999) which protects the structure from being recognized as a double-stranded break by sequestering the 3'-overhang into a high order DNA structure. The G-rich strand also serves as an anchor for a telomere-dedicated reverse transcriptase, called telomerase, that compensates for the inability of DNA polymerases to replicate the 5´ ends of linear chromosomes (Blackburn & Gall, 1978). The telomerase binds the G-rich strand by complementary pairing of the protruding DNA sequence to the telomerase RNA subunit and, as a result, telomerase elongates the overhang by adding telomere sequence repeats (Masutomi et al., 2003; Morin, 1989; Zhao et al., 2009). The T loop structure is maintained by a complex of telomere and non-telomere proteins called shelterins which repress the DNA repair machinery at telomeres, and regulate telomere length (for review see de Lange, 2005; Palm & de Lange, 2008; Rhodes et al., 2002; Vega et al., 2003; Zhao et al., 2009). The shelterin

arrays of retrotransposons at their chromosome ends (Abad et al., 2004).

**1. Introduction** 

**A Bioinformatics Approach** 

Lindblad-Toh, K. (2009). Genome sequence, comparative analysis, and population genetics of the domestic horse. *Science*, Vol. 326, No. 5954.pp. 865-867.


Lucía Ramírez, Gúmer Pérez, Raúl Castanera, Francisco Santoyo and Antonio G. Pisabarro *Genetics and Microbiology Research Group, Public University of Navarre, Pamplona, Spain* 

#### **1. Introduction**

392 Bioinformatics – Trends and Methodologies

genetics of the domestic horse. *Science*, Vol. 326, No. 5954.pp. 865-867. Wu, C. H., Nomura, K., Goldammer, T., Hadfield, T., Dalrymple, B. P., McWilliam, S.,

human chromosome 6 (HSA6). *Anim Genet*, Vol. 39, No. 5.pp. 459-467. Wu, C. H., Jin, W., Nomura, K., Goldammer, T., Hadfield, T., Dalrymple, B. P., McWilliam,

Zimin, A. V., Delcher, A. L., Florea, L., Kelley, D. R., Schatz, M. C., Puiu, D., Hanrahan, F.,

the domestic cow, Bos taurus. *Genome Biol*, Vol. 10, No. 4.pp. R42.

4.pp. 435-455.

Lindblad-Toh, K. (2009). Genome sequence, comparative analysis, and population

Maddox, J. F., Womack, J. E. & Cockett, N. E. (2008). A high-resolution comparative radiation hybrid map of ovine chromosomal regions that are homologous to

S., Maddox, J. F. & Cockett, N. E. (2009). A radiation hybrid comparative map of ovine chromosome 1 aligned to the virtual sheep genome. *Anim Genet*, Vol. 40, No.

Pertea, G., Van Tassell, C. P., Sonstegard, T. S., Marcais, G., Roberts, M., Subramanian, P., Yorke, J. A. & Salzberg, S. L. (2009). A whole-genome assembly of

> **1.1 The telomere: A complex nucleoprotein complex with a broad range of functions**  The telomeres (from the Greek *télos* far and -*meros* part) are the genetic structures found at the physical ends of linear chromosomes. They are nucleoprotein complexes composed of DNA repeats and a myriad of telomere and non-telomere associated proteins aimed to protect the ends of eukaryotic chromosomes from being recognized as double strand breaks, and to avoid chromosome end degradation by nucleases and non-canonical chromosome end fusions. Thus, telomeres are essential for chromosome integrity (Hande, 2004; Paeschke et al., 2010; Zakian, 1995). The fascinating story about telomere biology comes from the pioneering work of Elizabeth H. Blackburn who discovered that *Tetrahymena* telomeres consisted of a short DNA sequence motif that was repeated several times at the chromosomal end (Blackburn & Gall, 1978). This pattern is conserved in lower eukaryotes and in mammalian cells (Greider, 1998). Notable exceptions are *Drosophila* and some other dipterans, which instead possess tandem arrays of retrotransposons at their chromosome ends (Abad et al., 2004).

> Telomere DNA consists of tandem arrays of short repeated sequences forming a cap. Telomere length is species-specific and small cell type variations were observed. For reviews, see Fisher & Zakian, (2005) and Sanchez-Alonso & Guzman (2008). The basic telomere DNA repeat unit is the hexamer TTAGGG in which the strand running 5´→ 3´ outwards the centromere is usually guanine-rich. This G-rich strand protrudes its complementary end and bends on itself to form a telomere DNA loop (T-loop) (Griffith et al., 1999) which protects the structure from being recognized as a double-stranded break by sequestering the 3'-overhang into a high order DNA structure. The G-rich strand also serves as an anchor for a telomere-dedicated reverse transcriptase, called telomerase, that compensates for the inability of DNA polymerases to replicate the 5´ ends of linear chromosomes (Blackburn & Gall, 1978). The telomerase binds the G-rich strand by complementary pairing of the protruding DNA sequence to the telomerase RNA subunit and, as a result, telomerase elongates the overhang by adding telomere sequence repeats (Masutomi et al., 2003; Morin, 1989; Zhao et al., 2009). The T loop structure is maintained by a complex of telomere and non-telomere proteins called shelterins which repress the DNA repair machinery at telomeres, and regulate telomere length (for review see de Lange, 2005; Palm & de Lange, 2008; Rhodes et al., 2002; Vega et al., 2003; Zhao et al., 2009). The shelterin

Telomeres were considered as regions where the transcription of structural genes is repressed (Mondoux & Zakian, 2005), although it has been recently reported that telomere repeats and subtelomere regions can be transcribed (Azzalin et al., 2007; Luke & Lingner, 2009; Luke et al., 2008; Schoeftner & Blasco, 2008). The telomere transcribed region, called TERRA (telomere repeat-containing RNA), forms an integral component of telomere heterochromatin, and produces non-coding G-rich RNAs transcribed from the telomere Crich strand in mammals and fungi (Azzalin et al., 2007; Luke et al., 2008; Sanchez-Alonso & Guzman, 2008; Schoeftner & Blasco, 2008). TERRA transcription occurs at most or all chromosome ends and it is regulated by RNA surveillance factors and in response to changes in telomere length. The accumulation of TERRA at telomeres can also interfere with telomere replication (Azzalin & Lingner, 2007; Luke & Lingner, 2009; Schoeftner & Blasco,

The particular sequence organization of telomere makes difficult its genetic mapping and sequencing due to they are cloning recalcitrant and underrepresented in mapping and final assembled genomes. Consequently, their cloning and characterization must be made by dedicated molecular and bioinformatics strategies (Perez et al., 2009; Sanchez-Alonso &

Human subtelomere chromosome regions contain complex and dynamic stretches of DNA which, together with their associated proteins, are essential for genome stability and proper chromosome replication (Riethman et al., 2005). Subtelomere DNA repeats are a complex region of variable size segmentally duplicated containing low copy DNA repetitive tracts adjacent to the telomere. These duplications could be found only at the subtelomere regions, although it is common to find them also at pericentromeric and interstitial chromosomal loci

In humans, subtelomere DNA regions are operationally defined as the terminal 500 Kbp of each euchromatic chromosome arm. These regions contain subtelomere repeats (Srpts), segmental duplications, satellite sequences, and internal (TTAGGG)n-like sequences (Riethman et al., 2004). The organization of the subtelomere region is structurally conserved across eukaryotes (Anderson et al., 2008; Brown et al., 1990a; Brown et al., 1990b; Chan & Tye, 1983a; 1983b; Flint et al., 1997a; Flint et al., 1997b; Karpen & Spradling, 1992; Levis, 1993; Louis, 1995; Louis & Borts, 1995; Mefford & Trask, 2002; Pryde et al., 1997; Pryde & Louis, 1997; Walter et al., 1995; Wilkie et al., 1991). This region, susceptible to hypermethylation has been recently shown to have a central function in mammalian telomere-length homeostasis (Blasco, 2007). Subtelomere repeats are characterized by their high level of polymorphism among different chromosome ends and among individuals of the same species. This polymorphism, possibly indicative of a quick and dynamic sequence turnover, leads to a lack of relationship among subtelomere repeats across species. Nonhomologous or ectopic exchange between subtelomere regions of different chromosomes has been reported as a possible reason of polymorphisms in both yeast and humans (Linardopoulou et al., 2005; Louis & Haber, 1990; Mefford & Trask, 2002). In humans, there are more re-arrangements at the sub-telomere regions than in the rest of the genome. This is also true in some lower eukaryotes such as *Plasmodium falciparum* (Freitas-Junior et al., 2000), *Magnaporthe oryzae* (Rehmeyer et al., 2006) and *Neurospora crassa* (Wu et al., 2009)*,* among others. In all these cases genes involved in niche adaptation (species-

2009).

Guzman, 2008).

(Riethman et al., 2001).

**1.2 Subtelomere chromosome regions** 

complex is formed by a core of six proteins including the Myb-type homeodomain TRF proteins in mammals, Rap1 in *Saccharomyces cerevisiae* and Taz1 in *Schizosaccharomyces pombe* which bind the duplex form of the telomere repeats, the OB-fold containing protein POT1 in mammals and *S. pombe* and Cdc13 in *S. cerevisiae* which bind the single-stranded telomere 3'overhang and by other proteins associated via protein–protein interactions with them (Rhodes et al., 2002; Vega et al., 2003; Zhao et al., 2009). This shelterin complex is evolutionary conserved although some differences between species appear in protein numbers and in its higher order structure (Linger & Price, 2009). The G-rich single stranded telomere tail is also able to form a secondary DNA order structure resulting from intra and intermolecular G-quadruplex (Fry, 2007; Maizels, 2006). G-quadruplexes are stacked associations of G-quartets, which are themselves planar assemblies of four Hoogsteenbonded guanines, with the guanines derived from one or more nucleic acid strands (De Cian et al., 2008; Johnson et al., 2008). These structures have been observed in lower eukaryotes (Paeschke et al., 2008; Paeschke et al., 2005; Schaffitzel et al., 2001) and have the potential to regulate telomerase activity (Oganesian et al., 2006; Zahler et al., 1991).

Secondary DNA structures, G-quadruplex structures and T-loop may contribute to telomere function but pose an obstacle for semi-conservative and telomerase-mediated replication, a problem which should be solved to avoid telomere shorten. Telomeres become shortened during every cell cycle due to incomplete replication of the lagging strand (the so called ''end replication problem") resulting in cumulative telomere attrition during aging. In addition, a loss of telomere DNA occurs due to post-replicative degradation of the 5´ strand that generates long 3´ G-rich overhangs (Wellinger et al., 1996; Wellinger et al., 1993). In most species, the loss of telomere DNA is counteracted by the action of telomerase that carries its own RNA template coding for the telomere repeat sequence (Chan & Blackburn, 2004). The complementary C-rich strand is then synthesized by conventional RNA-primed DNA replication (Gilson & Geli, 2007; Verdun & Karlseder, 2007). Following replication, the telomeres created by the synthesis of the leading strand are either blunt-ended or left carrying a small 50 bp overhang whereas those created by the lagging-strand synthesis have a 3' overhang with a length determined by the position of the outermost RNA primer (de Lange, 2009). This fact supports the importance of the telomerase activity for the genome integrity. When telomeres reach a critical minimal length they become uncapped. This leads to a permanent cell cycle arrest (termed cellular senescence) or to apoptosis, depending on the cellular context in which the uncapping occurs (Aubert & Lansdorp, 2008; Blasco, 2005). Extreme telomere shortening leads to chromosome instability, end-to-end fusions, and checkpoint-mediated cell cycle arrest and/or apoptosis (reviewed in Aubert & Lansdorp (2008) and in Shore & Bianchi (2009). The whole processes are related in mammals not only to aging, but also to several age associated diseases such as tumorigenesis, coronary artery disease, and heart failure (Donate & Blasco, 2011; Ogami et al., 2004; Sherr & McCormick, 2002; Starr et al., 2007). In addition to the role of telomerase in maintaining telomere length, it has been described that homologous recombination (HR) constitute an alternative method (ALT ''alternative lengthening of telomeres'') to maintain telomere DNA in telomerasedeficient cells with telomeres highly heterogeneous in length. This mechanism was described in *S. cerevisiae* and consists in two pathways depending on different recombination proteins that use different telomere sequences as substrates for recombination. Cancer and immortalized cells can utilize the ALT mechanism to maintain telomere length (Lundblad & Blackburn, 1993; Teng et al., 2000; Teng & Zakian, 1999).

complex is formed by a core of six proteins including the Myb-type homeodomain TRF proteins in mammals, Rap1 in *Saccharomyces cerevisiae* and Taz1 in *Schizosaccharomyces pombe* which bind the duplex form of the telomere repeats, the OB-fold containing protein POT1 in mammals and *S. pombe* and Cdc13 in *S. cerevisiae* which bind the single-stranded telomere 3'overhang and by other proteins associated via protein–protein interactions with them (Rhodes et al., 2002; Vega et al., 2003; Zhao et al., 2009). This shelterin complex is evolutionary conserved although some differences between species appear in protein numbers and in its higher order structure (Linger & Price, 2009). The G-rich single stranded telomere tail is also able to form a secondary DNA order structure resulting from intra and intermolecular G-quadruplex (Fry, 2007; Maizels, 2006). G-quadruplexes are stacked associations of G-quartets, which are themselves planar assemblies of four Hoogsteenbonded guanines, with the guanines derived from one or more nucleic acid strands (De Cian et al., 2008; Johnson et al., 2008). These structures have been observed in lower eukaryotes (Paeschke et al., 2008; Paeschke et al., 2005; Schaffitzel et al., 2001) and have the potential to

Secondary DNA structures, G-quadruplex structures and T-loop may contribute to telomere function but pose an obstacle for semi-conservative and telomerase-mediated replication, a problem which should be solved to avoid telomere shorten. Telomeres become shortened during every cell cycle due to incomplete replication of the lagging strand (the so called ''end replication problem") resulting in cumulative telomere attrition during aging. In addition, a loss of telomere DNA occurs due to post-replicative degradation of the 5´ strand that generates long 3´ G-rich overhangs (Wellinger et al., 1996; Wellinger et al., 1993). In most species, the loss of telomere DNA is counteracted by the action of telomerase that carries its own RNA template coding for the telomere repeat sequence (Chan & Blackburn, 2004). The complementary C-rich strand is then synthesized by conventional RNA-primed DNA replication (Gilson & Geli, 2007; Verdun & Karlseder, 2007). Following replication, the telomeres created by the synthesis of the leading strand are either blunt-ended or left carrying a small 50 bp overhang whereas those created by the lagging-strand synthesis have a 3' overhang with a length determined by the position of the outermost RNA primer (de Lange, 2009). This fact supports the importance of the telomerase activity for the genome integrity. When telomeres reach a critical minimal length they become uncapped. This leads to a permanent cell cycle arrest (termed cellular senescence) or to apoptosis, depending on the cellular context in which the uncapping occurs (Aubert & Lansdorp, 2008; Blasco, 2005). Extreme telomere shortening leads to chromosome instability, end-to-end fusions, and checkpoint-mediated cell cycle arrest and/or apoptosis (reviewed in Aubert & Lansdorp (2008) and in Shore & Bianchi (2009). The whole processes are related in mammals not only to aging, but also to several age associated diseases such as tumorigenesis, coronary artery disease, and heart failure (Donate & Blasco, 2011; Ogami et al., 2004; Sherr & McCormick, 2002; Starr et al., 2007). In addition to the role of telomerase in maintaining telomere length, it has been described that homologous recombination (HR) constitute an alternative method (ALT ''alternative lengthening of telomeres'') to maintain telomere DNA in telomerasedeficient cells with telomeres highly heterogeneous in length. This mechanism was described in *S. cerevisiae* and consists in two pathways depending on different recombination proteins that use different telomere sequences as substrates for recombination. Cancer and immortalized cells can utilize the ALT mechanism to maintain telomere length (Lundblad & Blackburn, 1993; Teng et al., 2000; Teng & Zakian, 1999).

regulate telomerase activity (Oganesian et al., 2006; Zahler et al., 1991).

Telomeres were considered as regions where the transcription of structural genes is repressed (Mondoux & Zakian, 2005), although it has been recently reported that telomere repeats and subtelomere regions can be transcribed (Azzalin et al., 2007; Luke & Lingner, 2009; Luke et al., 2008; Schoeftner & Blasco, 2008). The telomere transcribed region, called TERRA (telomere repeat-containing RNA), forms an integral component of telomere heterochromatin, and produces non-coding G-rich RNAs transcribed from the telomere Crich strand in mammals and fungi (Azzalin et al., 2007; Luke et al., 2008; Sanchez-Alonso & Guzman, 2008; Schoeftner & Blasco, 2008). TERRA transcription occurs at most or all chromosome ends and it is regulated by RNA surveillance factors and in response to changes in telomere length. The accumulation of TERRA at telomeres can also interfere with telomere replication (Azzalin & Lingner, 2007; Luke & Lingner, 2009; Schoeftner & Blasco, 2009).

The particular sequence organization of telomere makes difficult its genetic mapping and sequencing due to they are cloning recalcitrant and underrepresented in mapping and final assembled genomes. Consequently, their cloning and characterization must be made by dedicated molecular and bioinformatics strategies (Perez et al., 2009; Sanchez-Alonso & Guzman, 2008).

#### **1.2 Subtelomere chromosome regions**

Human subtelomere chromosome regions contain complex and dynamic stretches of DNA which, together with their associated proteins, are essential for genome stability and proper chromosome replication (Riethman et al., 2005). Subtelomere DNA repeats are a complex region of variable size segmentally duplicated containing low copy DNA repetitive tracts adjacent to the telomere. These duplications could be found only at the subtelomere regions, although it is common to find them also at pericentromeric and interstitial chromosomal loci (Riethman et al., 2001).

In humans, subtelomere DNA regions are operationally defined as the terminal 500 Kbp of each euchromatic chromosome arm. These regions contain subtelomere repeats (Srpts), segmental duplications, satellite sequences, and internal (TTAGGG)n-like sequences (Riethman et al., 2004). The organization of the subtelomere region is structurally conserved across eukaryotes (Anderson et al., 2008; Brown et al., 1990a; Brown et al., 1990b; Chan & Tye, 1983a; 1983b; Flint et al., 1997a; Flint et al., 1997b; Karpen & Spradling, 1992; Levis, 1993; Louis, 1995; Louis & Borts, 1995; Mefford & Trask, 2002; Pryde et al., 1997; Pryde & Louis, 1997; Walter et al., 1995; Wilkie et al., 1991). This region, susceptible to hypermethylation has been recently shown to have a central function in mammalian telomere-length homeostasis (Blasco, 2007). Subtelomere repeats are characterized by their high level of polymorphism among different chromosome ends and among individuals of the same species. This polymorphism, possibly indicative of a quick and dynamic sequence turnover, leads to a lack of relationship among subtelomere repeats across species. Nonhomologous or ectopic exchange between subtelomere regions of different chromosomes has been reported as a possible reason of polymorphisms in both yeast and humans (Linardopoulou et al., 2005; Louis & Haber, 1990; Mefford & Trask, 2002). In humans, there are more re-arrangements at the sub-telomere regions than in the rest of the genome. This is also true in some lower eukaryotes such as *Plasmodium falciparum* (Freitas-Junior et al., 2000), *Magnaporthe oryzae* (Rehmeyer et al., 2006) and *Neurospora crassa* (Wu et al., 2009)*,* among others. In all these cases genes involved in niche adaptation (species-

shelterin complex (Bailey et al., 1999; Goytisolo et al., 2001). Mammal cells show a high frequency of telomere fusions (end-to-end) and chromosome instability (Bailey et al., 2004; Bailey et al., 1999; Espejel et al., 2002; Hsu et al., 2000; Smogorzewska et al., 2002; Takai et al., 2003; van Steensel et al., 1998). The FISH technique allows the identification of metacentricsubmetacentric and acrocentric-telocentric chromosome telomere fusions also known as Robertsonian-like configurations (Al-Wahiby et al., 2005; Hande, 2004). The occurrence of telomere–telomere associations has been suggested to play a role in nuclear organization (Nagele et al., 2001). In fact, telomere associations were seen in metaphases of human cells with shorten telomeres suggesting that a minimal telomere length is required for a proper

The basic and conserved telomere unit sequence in most filamentous fungi is TTAGGG. This sequence has been described in *Aspergillus nidulans* (Bhattacharyya & Blackburn, 1997)*, Beauveria bassiana* (Padmavathi et al., 2003)*, Botrytis cinerea* (Levis et al., 1997)*, Cladosporium fulvum* (Coleman et al., 1993)*, Fusarium oxysporum* (Inglis et al., 2000)*, Glomus intraradices* (Hijri et al., 2007)*, Magnaporthe grisea* (Gao et al., 2002)*, Metarrhizium anisopliae* (Inglis et al., 2005)*, N. crassa* (Wu et al., 2009)*, Pestalotiopsis microspora* (Long et al., 1998)*, Pleurotus ostreatus* (Perez et al., 2009), *Pneumocystis carinii* (Keely et al., 2001), and *Ustilago maydis* (Sanchez-Alonso & Guzman, 2008). However, variations of this sequence can be found in other fungi such as *A. oryzae* that has dodecanucleotide telomere repeats (Kusumoto et al., 2003) Incomplete and imperfect telomere units have been reported in: *A. oryzae, Candida* 

Two DNA sequence domains can be found adjacent to the telomere repeats. One of them, distal, is placed next to the telomere and contains tandem repeat motifs. The other, proximal, is interstitial, contains less repeated sequences and ferries clusters of related genes (Pryde et al., 1997). In several fungi, it has been observed an increased number of proteins involved in interactions with the environment coded for by genes mapping close to the telomeres. These genes are called 'contingency genes', they are dispensable for survival and are highly variable in populations. The accumulation of these genes near the telomeres is a strategy that allows fungi to afford new environments. In fact, it has been observed in *S. cerevisiae* that the genes located near the telomeres display variation in gene amplification and/or expression depending on the growing niche of the yeast. Some of these genes belong to the PAU family, the largest gene family in *S. cerevisiae* (23 members), whose regulation depends on the environmental growing conditions (anaerobiosis) (Rachidi et al., 2000). Other telomere associated genes in *S. cerevisiae* are MAL and MEL that participate in maltose and melibiose fermentation, used in the baking and brewing industries (Gibson et al., 1997; Teunissen & Steensma, 1995), and FLO that encodes cell-wall glycoproteins which participate in the regulation of cellular adhesion (Gibson et al., 1997; Halme et al., 2004; Teunissen & Steensma, 1995). Similarly, the TLO family unique in *C. albicans* consists of 15 members present on every chromosome, 14 of which are located at chromosome ends. Genome comparisons between *C. albicans* and *Candida dublinienesis* showed that the principal disparity in gene content between both species resides in the lack of the TLO genes in this last one. CdTLO1 null strains show a major reduction in hyphal formation in response to serum that can be reversed by complementation with either of two *C. albicans*

chromosome function during mitosis.

TLO genes (Jackson et al., 2009).

**2. Fungal telomeres - A bioinformatics approach** 

*albicans, Kluyveromyces lactis, S. cerevisiae* and *S. pombe*.

specific genes) were found in the subtelomeric regions. It could be that the high evolutive potential of the subtelomeric regions were used by these organisms to create variability aimed to avoid the detection by the host. In fungi the genes more frequently found in subtelomere regions are transposons, telomere-linked RecQ helicases, clusters of secondarymetabolites, cytochrome oxidases, hydrolases, molecular transporters, and genes encoding secreted proteins (Perez et al., 2009). These genes could undergo transcriptional silencing (subtelomere silencing) due to its close proximity to telomeres.

#### **1.3 Interstitial telomere repeats**

The various chemical modifications occurring at the amino terminal end of the histones affect the structure of chromatin and help establishing the functional and structural domains known as euchromatin and heterochromatin. Euchromatin is an open form of chromatin that allows transcription factors access to and transcriptionally activate their target genes. It is largely occupied by housekeeping genes, condensed during metaphase and decondensed during interphase. Heterochromatin, on the contrary, differs from euchromatin in that it is condensed during interphase.

Heterochromatin has been often said to be "poor in genes" and mainly constituted by repetitive DNA sequences. Moreover, since it is highly condensed and inaccessible to transcription factors, heterochromatin is generally transcriptionally silent (Hernandez-Rivas et al., 2010). Heterochromatin appears as blocks spread over the chromosomes when they are stained with Giemsa dye. The molecular analysis of heterochromatic blocks reveals sequences similar to the telomere repeats that are called in this case interstitial telomere repeat sequences or ITRs. These sequences include those repeats located close to the centromere and those found at interstitial sites, i.e., between the centromere and the telomeres (Meyne et al., 1990; Slijepcevic et al., 1996). ITRs were described in plants, animals and humans (Bolzan & Bianchi, 2006; Uchida et al., 2002; Welchen & Gonzalez, 2005). At the chromosome level, ITRs can be detected either by using the Fluorescence *in situ* hybridization (FISH) technique with a DNA or a peptide nucleic acid (PNA) pan-telomere probe (i.e., a probe that identifies simultaneously all of the telomeres in a metaphase cell), or by the primed *in situ* labeling (PRINS) reaction using an oligonucleotide primer complementary to the telomere DNA repeated sequence (Bolzan & Bianchi, 2006).

The length and the locations of the heterochromatic blocks in chromosomes are variable (Azzalin et al., 2001; Faravelli et al., 1998; Weber et al., 1990) as well as their origin. However, the presence of ITRs in the heterochromatic blocks is interpreted as the result of tandem telomere–telomere fusions during evolution (Hastie & Allshire, 1989; Holmquist & Dancis, 1979; Meyne et al., 1990) or the insertion of telomere DNA within genome unstable sites (recombination hotspots) during the repair of double strand DNA breaks (DSB) (Azzalin et al., 2001). The presence of some relatively small ITRs flanked by unstable ATrich DNA sequences could support this last hypothesis (Faravelli et al., 2002). On the other hand, telomere associations and fusions are common cytogenetic findings that have been implicated in the initiation of chromosome instability and tumorigenesis (Callen & Surralles, 2004; Murnane & Sabatier, 2004; Soler et al., 2005). Telomere fusions are the result of telomere dysfunction due to attrition of chromosome ends (Maser & DePinho, 2004). They are usually found in repair- and/or telomerase-deficient cells (Bailey et al., 1999; Blasco et al., 1997; Hande, 2004; Hande et al., 1999; Lo et al., 2002; Samper et al., 2000) with a variety of mutations affecting telomere function, including those occurring in proteins of the

specific genes) were found in the subtelomeric regions. It could be that the high evolutive potential of the subtelomeric regions were used by these organisms to create variability aimed to avoid the detection by the host. In fungi the genes more frequently found in subtelomere regions are transposons, telomere-linked RecQ helicases, clusters of secondarymetabolites, cytochrome oxidases, hydrolases, molecular transporters, and genes encoding secreted proteins (Perez et al., 2009). These genes could undergo transcriptional silencing

The various chemical modifications occurring at the amino terminal end of the histones affect the structure of chromatin and help establishing the functional and structural domains known as euchromatin and heterochromatin. Euchromatin is an open form of chromatin that allows transcription factors access to and transcriptionally activate their target genes. It is largely occupied by housekeeping genes, condensed during metaphase and decondensed during interphase. Heterochromatin, on the contrary, differs from euchromatin in that it is

Heterochromatin has been often said to be "poor in genes" and mainly constituted by repetitive DNA sequences. Moreover, since it is highly condensed and inaccessible to transcription factors, heterochromatin is generally transcriptionally silent (Hernandez-Rivas et al., 2010). Heterochromatin appears as blocks spread over the chromosomes when they are stained with Giemsa dye. The molecular analysis of heterochromatic blocks reveals sequences similar to the telomere repeats that are called in this case interstitial telomere repeat sequences or ITRs. These sequences include those repeats located close to the centromere and those found at interstitial sites, i.e., between the centromere and the telomeres (Meyne et al., 1990; Slijepcevic et al., 1996). ITRs were described in plants, animals and humans (Bolzan & Bianchi, 2006; Uchida et al., 2002; Welchen & Gonzalez, 2005). At the chromosome level, ITRs can be detected either by using the Fluorescence *in situ* hybridization (FISH) technique with a DNA or a peptide nucleic acid (PNA) pan-telomere probe (i.e., a probe that identifies simultaneously all of the telomeres in a metaphase cell), or by the primed *in situ* labeling (PRINS) reaction using an oligonucleotide primer

complementary to the telomere DNA repeated sequence (Bolzan & Bianchi, 2006).

The length and the locations of the heterochromatic blocks in chromosomes are variable (Azzalin et al., 2001; Faravelli et al., 1998; Weber et al., 1990) as well as their origin. However, the presence of ITRs in the heterochromatic blocks is interpreted as the result of tandem telomere–telomere fusions during evolution (Hastie & Allshire, 1989; Holmquist & Dancis, 1979; Meyne et al., 1990) or the insertion of telomere DNA within genome unstable sites (recombination hotspots) during the repair of double strand DNA breaks (DSB) (Azzalin et al., 2001). The presence of some relatively small ITRs flanked by unstable ATrich DNA sequences could support this last hypothesis (Faravelli et al., 2002). On the other hand, telomere associations and fusions are common cytogenetic findings that have been implicated in the initiation of chromosome instability and tumorigenesis (Callen & Surralles, 2004; Murnane & Sabatier, 2004; Soler et al., 2005). Telomere fusions are the result of telomere dysfunction due to attrition of chromosome ends (Maser & DePinho, 2004). They are usually found in repair- and/or telomerase-deficient cells (Bailey et al., 1999; Blasco et al., 1997; Hande, 2004; Hande et al., 1999; Lo et al., 2002; Samper et al., 2000) with a variety of mutations affecting telomere function, including those occurring in proteins of the

(subtelomere silencing) due to its close proximity to telomeres.

**1.3 Interstitial telomere repeats** 

condensed during interphase.

shelterin complex (Bailey et al., 1999; Goytisolo et al., 2001). Mammal cells show a high frequency of telomere fusions (end-to-end) and chromosome instability (Bailey et al., 2004; Bailey et al., 1999; Espejel et al., 2002; Hsu et al., 2000; Smogorzewska et al., 2002; Takai et al., 2003; van Steensel et al., 1998). The FISH technique allows the identification of metacentricsubmetacentric and acrocentric-telocentric chromosome telomere fusions also known as Robertsonian-like configurations (Al-Wahiby et al., 2005; Hande, 2004). The occurrence of telomere–telomere associations has been suggested to play a role in nuclear organization (Nagele et al., 2001). In fact, telomere associations were seen in metaphases of human cells with shorten telomeres suggesting that a minimal telomere length is required for a proper chromosome function during mitosis.

#### **2. Fungal telomeres - A bioinformatics approach**

The basic and conserved telomere unit sequence in most filamentous fungi is TTAGGG. This sequence has been described in *Aspergillus nidulans* (Bhattacharyya & Blackburn, 1997)*, Beauveria bassiana* (Padmavathi et al., 2003)*, Botrytis cinerea* (Levis et al., 1997)*, Cladosporium fulvum* (Coleman et al., 1993)*, Fusarium oxysporum* (Inglis et al., 2000)*, Glomus intraradices* (Hijri et al., 2007)*, Magnaporthe grisea* (Gao et al., 2002)*, Metarrhizium anisopliae* (Inglis et al., 2005)*, N. crassa* (Wu et al., 2009)*, Pestalotiopsis microspora* (Long et al., 1998)*, Pleurotus ostreatus* (Perez et al., 2009), *Pneumocystis carinii* (Keely et al., 2001), and *Ustilago maydis* (Sanchez-Alonso & Guzman, 2008). However, variations of this sequence can be found in other fungi such as *A. oryzae* that has dodecanucleotide telomere repeats (Kusumoto et al., 2003) Incomplete and imperfect telomere units have been reported in: *A. oryzae, Candida albicans, Kluyveromyces lactis, S. cerevisiae* and *S. pombe*.

Two DNA sequence domains can be found adjacent to the telomere repeats. One of them, distal, is placed next to the telomere and contains tandem repeat motifs. The other, proximal, is interstitial, contains less repeated sequences and ferries clusters of related genes (Pryde et al., 1997). In several fungi, it has been observed an increased number of proteins involved in interactions with the environment coded for by genes mapping close to the telomeres. These genes are called 'contingency genes', they are dispensable for survival and are highly variable in populations. The accumulation of these genes near the telomeres is a strategy that allows fungi to afford new environments. In fact, it has been observed in *S. cerevisiae* that the genes located near the telomeres display variation in gene amplification and/or expression depending on the growing niche of the yeast. Some of these genes belong to the PAU family, the largest gene family in *S. cerevisiae* (23 members), whose regulation depends on the environmental growing conditions (anaerobiosis) (Rachidi et al., 2000). Other telomere associated genes in *S. cerevisiae* are MAL and MEL that participate in maltose and melibiose fermentation, used in the baking and brewing industries (Gibson et al., 1997; Teunissen & Steensma, 1995), and FLO that encodes cell-wall glycoproteins which participate in the regulation of cellular adhesion (Gibson et al., 1997; Halme et al., 2004; Teunissen & Steensma, 1995). Similarly, the TLO family unique in *C. albicans* consists of 15 members present on every chromosome, 14 of which are located at chromosome ends. Genome comparisons between *C. albicans* and *Candida dublinienesis* showed that the principal disparity in gene content between both species resides in the lack of the TLO genes in this last one. CdTLO1 null strains show a major reduction in hyphal formation in response to serum that can be reversed by complementation with either of two *C. albicans* TLO genes (Jackson et al., 2009).

approach allowed our group to map 19 out of the 22 chromosome ends expected in its linkage map (Perez et al., 2009), as well as to study the telomere adjacent regions. Similar strategies have been described by different authors (Rehmeyer et al., 2006; Sanchez-Alonso & Guzman, 2008; Wu et al., 2009). The search for telomere regions was performed in whole genome sequence draft assemblies. These preliminary genome sequence versions appeared as incomplete, contained some genes truncated and were misassembled. The rationale for using them is that the genome final assembling strategies very often alter or eliminate the telomere and subtelomere repetitive sequences of the fully assembled genomes**.** We used the open-access Tandem Repeats Finder program (TRFp, http://tandem.bu.edu/trf/trf.html) to screen for repetitive telomere sequences in more than 6,200 contigs of the 4X coverage draft sequence assembly of *P. ostreatus* PC15 produced by the Joint Genome Institute (http://genome.jgi-psf.org/PleosPC15\_1/PleosPC15\_1.home.html). The TRFp locates and displays tandem repeats in a DNA sequence file submitted in FASTA format without need of specifying the repetitive pattern, its size or any other parameter. The TRFp output consists of two files: a repeat table and an alignment. The repeat table contains information about each repeat, its size, copy number, nucleotide content and location. Clicking on the location indices for this table's entries opens a second web browser that shows an alignment of the copies against a consensus pattern. TRFp is a very fast program that permits the analysis of up to 5 Mb sequence length. Repeats with pattern size in the range from one to 2000 bases could be detected (Benson, 1999). We identified, in each telomere sequence containing scaffold, the filtered gene models computer and manually annotated within 50 Kbp of the telomere repetitive sequence. This was done by manual inspection of each one of the predicted genes and using them as query in the non-redundant NCBI gene database using the BlastX program (Altschul et al., 1997). The BlastX results were considered significant if their expected value (e-value) was <e-20. The identified protein sequences were also used to query the online Pfam database using default parameters (Bateman et al., 2004). A similar approach has also been used by Wu et al. (2009) to search for subtelomere regions containing gene in *N. crassa.* This type of multiple analysis (genetic, molecular, and bioinformatics) allows the characterization of most of the *P. ostreatus* telomeres as well as several subtelomere regions that show high nucleotide similarity. The highly polymorphic subtelomere region of *P. ostreatus* chromosome six contains genes similar to those described in other eukaryotic organisms (RecQ helicases), apart from a species-specific laccase gene

cluster (six out of 12 genes annotated in the genome (Perez et al., 2009).

**2.1 Analysis of the basidiomycetes' telomere regions** 

different genomes.

In conclusion, the assemblage of telomere regions by bioinformatics strategies is a powerful tool to determine the arrangement of genomes in putative linkage groups in species with no genetic maps available, to establish synteny among different basidiomycete genomes, and to determine the presence of genes and gene clusters conserved in the subtelomere regions of

In the following sections we will review the composition and structure of the telomere and subtelomere regions of the different basidiomycetes using the bioinformatics approach described above. We will use the genome sequence data that are publicly available at the Joint Genome Institute (DOE-JGI, http://www.jgi.doe.gov/). This institute has been developing an intensive sequencing effort on different fungi related with the biological lignocellulose degradation. Lignin is a complex recalcitrant macromolecule that hinders the access of enzymes to cellulose. The enzymatic removal of lignin will permit the access to this

In human parasites it has been described several families of well characterized or putative virulence factors at chromosome ends. In *Candida glabrata* (De Las Penas et al., 2003) nearly 24 adhesin encoding genes were located at telomere regions, and, in *P. carinii* clusters of major surface antigen genes have been bioinformatically predicted at every chromosome end (Keely et al., 2005). A similar situation was reported in *P. falciparum* and *Trypanosoma brucei* where families of antigen surface proteins inducing immune responses are encoded in telomere regions as part of a pathogen's mechanism to amplify and diversify surface antigen genes to avoid host recognition, as it has been hypothesized (Barry et al., 2003).

To determine if plant pathogenic fungi use a similar mechanism to avoid host defenses, Farman's group analyzed and characterized the telomere organization in the rice blast pathogen *M. oryzae* (Rehmeyer et al., 2006) and in *N. crassa* (Wu et al., 2009). The molecular and bioinformatics approach allowed them to identify 14 chromosomes in both of them. In *M. oryzae,* the analysis of these sequences reveals the presence of a clearly defined distal subtelomere domain that contains a telomere-linked helicase (TLH) gene. No gene duplication near the chromosome termini is observed. Thus it is impossible to detect a proximal subtelomere domain. The sequenced *N. crassa* genome (Galagan et al., 2003) contains very little intact, duplicated DNA due to the repeat-induced point mutation (RIP) process (Selker, 1990). This situation made unlikely that *N. crassa* would possess intact subtelomere domains or terminal gene duplications. The search for tandem repeats at the ends of some chromosomes whose sequences extend out of the telomere repeats shows the lack of conserved subtelomere tandem repeats. A similar situation is observed when the sequences immediately adjacent to the TTAGGG repeats were compared between strains. Consistent with the absence of distinct subtelomere elements, *N. crassa* lacks the TLH genes that are present in the subtelomere regions of diverse fungi (Gao et al., 2002; Inglis et al., 2005; Louis & Haber, 1992; Mandell et al., 2005; Perez et al., 2009; Rehmeyer et al., 2006; Sanchez-Alonso & Guzman, 1998). As in other fungi, the terminal regions of *N. crassa* chromosomes ferry genes related to secondary metabolism such as a monooyygenase (CYP450), a FAD-binding domain containing protein, a second CYP450, an Omethyltransferase, a polyketide synthase, a major facilitator superfamily efflux pump, a putative transcription factor, and an oxidoreductase. Apart from the clusters of secondary metabolite genes, there were no genes overrepresented in the chromosome ends, except those predicted to code for enzymes related to plant cell-wall degradation activity.

In basidiomycetes, despite the genomes of an ever growing number of genera involved in lignin degradation, enzyme production, and bio pulping have been sequenced and annotated, the information on the characteristics of their telomere and sub-telomere regions is rather limited. As it was discussed above, this is due to the particular characteristic of telomeres that make them refractory to cloning and difficult to sequence and assemble in whole genome sequencing projects. A consequence of the difficulty in cloning telomeres is that, for most organisms, there is limited information on the organization of chromosome ends. Data about basidiomycete telomeres are available for *U. maydis* (Sanchez-Alonso & Guzman, 1998), *M. anisopliae* (Inglis et al., 2005) and *P. ostreatus* (Perez et al., 2009). In the three cases, the conserved telomere repetitive unit is TTAGGG, and the number of tandem repetitions varies among species: 37 times in U*. maydis*, from 18 to 26 in *M. anisopliae*, and from 25 to 150 in *P. ostreatus*. In all these cases, genes coding for RecQ helicases have been found adjacent to the telomere regions.

In *P. ostreatus*, a lignin degrader edible mushroom, the analysis of telomere organization was carried out with a combination of genetic, molecular, and bioinformatics tools. This

In human parasites it has been described several families of well characterized or putative virulence factors at chromosome ends. In *Candida glabrata* (De Las Penas et al., 2003) nearly 24 adhesin encoding genes were located at telomere regions, and, in *P. carinii* clusters of major surface antigen genes have been bioinformatically predicted at every chromosome end (Keely et al., 2005). A similar situation was reported in *P. falciparum* and *Trypanosoma brucei* where families of antigen surface proteins inducing immune responses are encoded in telomere regions as part of a pathogen's mechanism to amplify and diversify surface antigen

To determine if plant pathogenic fungi use a similar mechanism to avoid host defenses, Farman's group analyzed and characterized the telomere organization in the rice blast pathogen *M. oryzae* (Rehmeyer et al., 2006) and in *N. crassa* (Wu et al., 2009). The molecular and bioinformatics approach allowed them to identify 14 chromosomes in both of them. In *M. oryzae,* the analysis of these sequences reveals the presence of a clearly defined distal subtelomere domain that contains a telomere-linked helicase (TLH) gene. No gene duplication near the chromosome termini is observed. Thus it is impossible to detect a proximal subtelomere domain. The sequenced *N. crassa* genome (Galagan et al., 2003) contains very little intact, duplicated DNA due to the repeat-induced point mutation (RIP) process (Selker, 1990). This situation made unlikely that *N. crassa* would possess intact subtelomere domains or terminal gene duplications. The search for tandem repeats at the ends of some chromosomes whose sequences extend out of the telomere repeats shows the lack of conserved subtelomere tandem repeats. A similar situation is observed when the sequences immediately adjacent to the TTAGGG repeats were compared between strains. Consistent with the absence of distinct subtelomere elements, *N. crassa* lacks the TLH genes that are present in the subtelomere regions of diverse fungi (Gao et al., 2002; Inglis et al., 2005; Louis & Haber, 1992; Mandell et al., 2005; Perez et al., 2009; Rehmeyer et al., 2006; Sanchez-Alonso & Guzman, 1998). As in other fungi, the terminal regions of *N. crassa* chromosomes ferry genes related to secondary metabolism such as a monooyygenase (CYP450), a FAD-binding domain containing protein, a second CYP450, an Omethyltransferase, a polyketide synthase, a major facilitator superfamily efflux pump, a putative transcription factor, and an oxidoreductase. Apart from the clusters of secondary metabolite genes, there were no genes overrepresented in the chromosome ends, except

genes to avoid host recognition, as it has been hypothesized (Barry et al., 2003).

those predicted to code for enzymes related to plant cell-wall degradation activity.

found adjacent to the telomere regions.

In basidiomycetes, despite the genomes of an ever growing number of genera involved in lignin degradation, enzyme production, and bio pulping have been sequenced and annotated, the information on the characteristics of their telomere and sub-telomere regions is rather limited. As it was discussed above, this is due to the particular characteristic of telomeres that make them refractory to cloning and difficult to sequence and assemble in whole genome sequencing projects. A consequence of the difficulty in cloning telomeres is that, for most organisms, there is limited information on the organization of chromosome ends. Data about basidiomycete telomeres are available for *U. maydis* (Sanchez-Alonso & Guzman, 1998), *M. anisopliae* (Inglis et al., 2005) and *P. ostreatus* (Perez et al., 2009). In the three cases, the conserved telomere repetitive unit is TTAGGG, and the number of tandem repetitions varies among species: 37 times in U*. maydis*, from 18 to 26 in *M. anisopliae*, and from 25 to 150 in *P. ostreatus*. In all these cases, genes coding for RecQ helicases have been

In *P. ostreatus*, a lignin degrader edible mushroom, the analysis of telomere organization was carried out with a combination of genetic, molecular, and bioinformatics tools. This approach allowed our group to map 19 out of the 22 chromosome ends expected in its linkage map (Perez et al., 2009), as well as to study the telomere adjacent regions. Similar strategies have been described by different authors (Rehmeyer et al., 2006; Sanchez-Alonso & Guzman, 2008; Wu et al., 2009). The search for telomere regions was performed in whole genome sequence draft assemblies. These preliminary genome sequence versions appeared as incomplete, contained some genes truncated and were misassembled. The rationale for using them is that the genome final assembling strategies very often alter or eliminate the telomere and subtelomere repetitive sequences of the fully assembled genomes**.** We used the open-access Tandem Repeats Finder program (TRFp, http://tandem.bu.edu/trf/trf.html) to screen for repetitive telomere sequences in more than 6,200 contigs of the 4X coverage draft sequence assembly of *P. ostreatus* PC15 produced by the Joint Genome Institute (http://genome.jgi-psf.org/PleosPC15\_1/PleosPC15\_1.home.html). The TRFp locates and displays tandem repeats in a DNA sequence file submitted in FASTA format without need of specifying the repetitive pattern, its size or any other parameter. The TRFp output consists of two files: a repeat table and an alignment. The repeat table contains information about each repeat, its size, copy number, nucleotide content and location. Clicking on the location indices for this table's entries opens a second web browser that shows an alignment of the copies against a consensus pattern. TRFp is a very fast program that permits the analysis of up to 5 Mb sequence length. Repeats with pattern size in the range from one to 2000 bases could be detected (Benson, 1999). We identified, in each telomere sequence containing scaffold, the filtered gene models computer and manually annotated within 50 Kbp of the telomere repetitive sequence. This was done by manual inspection of each one of the predicted genes and using them as query in the non-redundant NCBI gene database using the BlastX program (Altschul et al., 1997). The BlastX results were considered significant if their expected value (e-value) was <e-20. The identified protein sequences were also used to query the online Pfam database using default parameters (Bateman et al., 2004). A similar approach has also been used by Wu et al. (2009) to search for subtelomere regions containing gene in *N. crassa.* This type of multiple analysis (genetic, molecular, and bioinformatics) allows the characterization of most of the *P. ostreatus* telomeres as well as several subtelomere regions that show high nucleotide similarity. The highly polymorphic subtelomere region of *P. ostreatus* chromosome six contains genes similar to those described in other eukaryotic organisms (RecQ helicases), apart from a species-specific laccase gene cluster (six out of 12 genes annotated in the genome (Perez et al., 2009).

In conclusion, the assemblage of telomere regions by bioinformatics strategies is a powerful tool to determine the arrangement of genomes in putative linkage groups in species with no genetic maps available, to establish synteny among different basidiomycete genomes, and to determine the presence of genes and gene clusters conserved in the subtelomere regions of different genomes.

#### **2.1 Analysis of the basidiomycetes' telomere regions**

In the following sections we will review the composition and structure of the telomere and subtelomere regions of the different basidiomycetes using the bioinformatics approach described above. We will use the genome sequence data that are publicly available at the Joint Genome Institute (DOE-JGI, http://www.jgi.doe.gov/). This institute has been developing an intensive sequencing effort on different fungi related with the biological lignocellulose degradation. Lignin is a complex recalcitrant macromolecule that hinders the access of enzymes to cellulose. The enzymatic removal of lignin will permit the access to this

The telomeres found in scaffolds 5 and 7 have a complex structure. The telomere of scaffold 5 contained 22 and 20 copies of the CCCTAA sequence separated by a gap of 193 bp. In the case of scaffold 7, 20 and 18 copies of this sequence appeared separated by a gap of 1680 bp (Fig. 2). We presume that these particular structures could reflect misassemblages of the

**5' 3'**

(5' CCCTAA 3')20 scaffold\_5:979-1098

> GAP (193 bp) scaffold\_5:785-978

(5' CCCTAA 3')18 scaffold\_7:2437-2545

GAP (1680 bp) scaffold\_7:756-2436

Fig. 2. Location of telomere regions in scaffolds 5 and 7 of the genome sequence of *C.*

In summary this analysis has allowed the identification of 42 telomere repeat containing regions. 22 out of them fit with direct telomere sequences (TTAGGG) present in 22 scaffolds and 22 reverse telomere regions (CCCTAA) present in 20 scaffolds. Taking into account that the *C. subvermispora* strain sequence was a dikaryon, these results suggest that the *Ceriporiopsis* genome would consist of 11 chromosomes. Because scaffold 7 showed telomere repeats at both ends, the sequence contained in this scaffold would be the only fully

**5' 3'**

**2.1.2 Analysis of the telomere regions of** *Phanerochaete chrysosporium* **strain RP78**  *P. chrisosporium* is a model white rot basidiomycete that has been extensively used because of it interest as lignin degrader (Kersten & Cullen, 2007; Tien, 1987). The draft genome of the homokaryotic strain of *P. chrysosporium* strain RP78 was assembled into 232 scaffolds and contains 35.1 Mbp of non-redundant sequences (Martinez et al., 2004). 90% of the assembly was found in 21 scaffolds, while 50% was found in eight scaffolds larger than 1.9 Mbp.

telomere sequences in both scaffolds.

**Scaffold 5: 2331173 bp** (5' CCCTAA 3')22 scaffold\_5:1-132

Interstitial (651 bp) scaffold\_5:133-184

**Scaffold 7: 2096541 bp**

(5' CCCTAA 3')20 scaffold\_7:1-120

Intertitial (634 bp) scaffold\_7:121-755

*subvermispora.* 

assembled chromosome of *C. subvermispora.*

large carbon reservoir for its use in different energy-related applications. There are two types of fungi according to their strategy for making cellulose accessible: white rot fungi degrade lignin, and brown rot fungi minimally modify the lignin and attack the cellulose using a different chemical approach (Lundell et al., 2010; Martinez et al., 2005; Ruiz-Duenas & Martinez, 2009). Among all the sequenced basidiomycetes, we will concentrate here on the white rot degraders *Ceriporiopsis subvermispora, Phanerochaete chrisosporium* and *P. ostreatus*; the brown rot *Postia placenta* and the tree pathogen *Heterobasidion annosum*.

#### **2.1.1 Analysis of the telomere regions of** *Ceriporiopsis subvermispora* **B**

*C. subvermispora* is a white rot basidiomycete that rapidly depolymerizes lignin with relatively little cellulose degradation when growing on wood (Martinez et al., 2005). The chromosome number of this species has not been conclusively determined. The JGI has sequenced the strain B. Two type of sequence data (un-assembled reads and assembled scaffolds unmasked) were screened for the presence of telomere sequences in the genome of *C. subvermispora*: 297,269 unassembled and 740 assembled scaffolds. In 207 unassembled scaffolds, between five and 23 tandem repeats of the telomere motif TTAGGG were found. In most cases (82 % of these unassembled scaffolds) 17 and 19 repeats of the motif were found, and the mean repeat number was 18.7. On the other hand, 187 scaffolds carried between seven and 22 repeats of the telomere complementary sequence CCCTAA. In this case, the modal repetition number varied between 17 and 19 (84% of the scaffolds) and the mean number was 19.2. Taken together these data suggest that the average number of telomere repeats in *C. subvermispora* is 19.0.

The unmasked analysis of 740 assembled scaffolds revealed that 42 of them contained telomere sequences: 22 harbored the TTAGGG sequence and 20 the complementary CCCTAA. The telomere region TTAGGG was placed at the bottom (3' telomere) end of the chromosome in 21 out of the 22 scaffolds. An exception to this was observed in scaffold 3. The sequence was located at an interstitial position. The telomere region CCCTAA was placed at the upper (5' telomere) end of the chromosome in 19 out of 20 scaffolds. As it was described above, this sequence was also placed at an interstitial location in scaffold 3. This suggest a missassembling scaffold 3 because both the direct TTAGGG21 (TTAGGG, scaffold 3: 1205371-1205497) and the reverse CCCTAA21 (CCCTAA, scaffold 3: 1206763-1205497) telomere sequences have been found flanking a 1265 bp gap 3 (Fig. 1).

Fig. 1. Location of interstitial telomere sequences in scaffold 3 of the genome sequence of *C. subvermispora.*

large carbon reservoir for its use in different energy-related applications. There are two types of fungi according to their strategy for making cellulose accessible: white rot fungi degrade lignin, and brown rot fungi minimally modify the lignin and attack the cellulose using a different chemical approach (Lundell et al., 2010; Martinez et al., 2005; Ruiz-Duenas & Martinez, 2009). Among all the sequenced basidiomycetes, we will concentrate here on the white rot degraders *Ceriporiopsis subvermispora, Phanerochaete chrisosporium* and *P.* 

*C. subvermispora* is a white rot basidiomycete that rapidly depolymerizes lignin with relatively little cellulose degradation when growing on wood (Martinez et al., 2005). The chromosome number of this species has not been conclusively determined. The JGI has sequenced the strain B. Two type of sequence data (un-assembled reads and assembled scaffolds unmasked) were screened for the presence of telomere sequences in the genome of *C. subvermispora*: 297,269 unassembled and 740 assembled scaffolds. In 207 unassembled scaffolds, between five and 23 tandem repeats of the telomere motif TTAGGG were found. In most cases (82 % of these unassembled scaffolds) 17 and 19 repeats of the motif were found, and the mean repeat number was 18.7. On the other hand, 187 scaffolds carried between seven and 22 repeats of the telomere complementary sequence CCCTAA. In this case, the modal repetition number varied between 17 and 19 (84% of the scaffolds) and the mean number was 19.2. Taken together these data suggest that the average number of

The unmasked analysis of 740 assembled scaffolds revealed that 42 of them contained telomere sequences: 22 harbored the TTAGGG sequence and 20 the complementary CCCTAA. The telomere region TTAGGG was placed at the bottom (3' telomere) end of the chromosome in 21 out of the 22 scaffolds. An exception to this was observed in scaffold 3. The sequence was located at an interstitial position. The telomere region CCCTAA was placed at the upper (5' telomere) end of the chromosome in 19 out of 20 scaffolds. As it was described above, this sequence was also placed at an interstitial location in scaffold 3. This suggest a missassembling scaffold 3 because both the direct TTAGGG21 (TTAGGG, scaffold 3: 1205371-1205497) and the reverse CCCTAA21 (CCCTAA, scaffold 3: 1206763-1205497)

Fig. 1. Location of interstitial telomere sequences in scaffold 3 of the genome sequence of *C.*

GAP (1264 bp) scaffold\_3:1205498-1206762

(5' CCCTAA 3')21 scaffold\_3:1206763-1206881

**3'**

telomere sequences have been found flanking a 1265 bp gap 3 (Fig. 1).

(5' TTAGGG 3')21 scaffold\_3:1205371-1205497

*ostreatus*; the brown rot *Postia placenta* and the tree pathogen *Heterobasidion annosum*.

**2.1.1 Analysis of the telomere regions of** *Ceriporiopsis subvermispora* **B** 

telomere repeats in *C. subvermispora* is 19.0.

*subvermispora.*

**Scaffold 3: 2650498 bp**

**5'**

The telomeres found in scaffolds 5 and 7 have a complex structure. The telomere of scaffold 5 contained 22 and 20 copies of the CCCTAA sequence separated by a gap of 193 bp. In the case of scaffold 7, 20 and 18 copies of this sequence appeared separated by a gap of 1680 bp (Fig. 2). We presume that these particular structures could reflect misassemblages of the telomere sequences in both scaffolds.

Fig. 2. Location of telomere regions in scaffolds 5 and 7 of the genome sequence of *C. subvermispora.* 

In summary this analysis has allowed the identification of 42 telomere repeat containing regions. 22 out of them fit with direct telomere sequences (TTAGGG) present in 22 scaffolds and 22 reverse telomere regions (CCCTAA) present in 20 scaffolds. Taking into account that the *C. subvermispora* strain sequence was a dikaryon, these results suggest that the *Ceriporiopsis* genome would consist of 11 chromosomes. Because scaffold 7 showed telomere repeats at both ends, the sequence contained in this scaffold would be the only fully assembled chromosome of *C. subvermispora.*

#### **2.1.2 Analysis of the telomere regions of** *Phanerochaete chrysosporium* **strain RP78**

*P. chrisosporium* is a model white rot basidiomycete that has been extensively used because of it interest as lignin degrader (Kersten & Cullen, 2007; Tien, 1987). The draft genome of the homokaryotic strain of *P. chrysosporium* strain RP78 was assembled into 232 scaffolds and contains 35.1 Mbp of non-redundant sequences (Martinez et al., 2004). 90% of the assembly was found in 21 scaffolds, while 50% was found in eight scaffolds larger than 1.9 Mbp.

Another telomere-like interstitial region was found at about 300000 bp of the 3' end of scaffold 10. The position of the interstitial telomere block within the scaffold would suggested that it could be the result of an ancestral intra-chromosomal rearrangement (inversions and/or fusions), from differential crossing-over or from the repair of double-

> (5' TTTAGGG 3')21 scaffold\_10:1195442-1195589

Interstitial (640 bp) scaffold\_10:1194802-1195441 Interstitial (95584 bp) scaffold\_10:1195863-1291447

GAP (273 bp) scaffold\_10:1195590-1195862

strand break during evolution (Lin & Yan, 2008) (Fig. 4).

GAP (1549 bp) scaffold\_10:1193253-1194801

*P. chrysosporium.*

*P. chrysosporium.*

one at 40552bp of the 3' end (Fig. 5).

Scaffold 10: 1493310 bp

Fig. 4. Location of telomere regions in scaffold 10 of the genome sequence of

Fig. 5. Location of telomere regions in scaffolds 5 and 8 of the genome sequence of

**Scaffold 8: 1906386 bp** (5' CCCTAAA 3')33

5' 3'

**Scaffold 5: 2164014 bp** (5' TTTAGGG 3')22

The analysis of scaffold 28 revealed two arrays of interstitial copies of the heptamer CCCCTAAA (23 and 18 repeats, respectively) placed at 11417 bp and 13853 bp from the 5'

5' 3'

scaffold\_5:2123492-2123646

scaffold\_9:8772-9003

It was observed that in eight of the 16 telomere-containing scaffolds identified, the repetitive unit was the complementary CCCTAAA. In five of them, the heptamer unit was placed at the 5' end. A telomere unit in an internal region at 8772 bp of the 5' end was present in scaffold 8 while scaffold 5 ferried a telomere motif at 2123492 bp of the 3' end and another

5' 3'

The screening of the telomere sequences was performed as described above using the scaffolds of the assemblage v2.0. The basic telomere unit of *P. chrysosporium* is the heptamer TTTAGGG. 16 out of the 232 scaffolds contained telomere sequences, eight of them with the TTTAGGG motif. In four of these scaffolds (numbers 7, 9, 23 y 159) the repetitive unit was located at an interstitial position at 3200 bp from the 3' end flanked by a gap. This arrangement could be the result of a wrong assemblage as it can be seen in Fig. 3.

Fig. 3. Location of telomere regions in scaffolds 5, 9, 23 and 159 of the genome sequence of *P. chrisosporium.* 

The screening of the telomere sequences was performed as described above using the scaffolds of the assemblage v2.0. The basic telomere unit of *P. chrysosporium* is the heptamer TTTAGGG. 16 out of the 232 scaffolds contained telomere sequences, eight of them with the TTTAGGG motif. In four of these scaffolds (numbers 7, 9, 23 y 159) the repetitive unit was located at an interstitial position at 3200 bp from the 3' end flanked by a gap. This

> GAP (726 bp) scaffold\_7:2049833-2050559

scaffold\_7:2049678-2049832

scaffold\_9:1896232-1896400

Terminal sequence (999 bp) scaffold\_7:2050560-2051558

Terminal sequence (2060 bp) scaffold\_9:1896473-1898532

GAP (1119 bp) scaffold\_23:270905-272024

> Terminal sequence (1768 bp) scaffold\_23:272025-273793

arrangement could be the result of a wrong assemblage as it can be seen in Fig. 3.

**Scaffold 7: 2051558 bp** (5' TTTAGGG 3')22

**Scaffold 9: 1898532 bp** (5' TTTAGGG 3')24

**Scaffold 23: 273793 bp**

*P. chrisosporium.* 

(5' TTTAGGG 3')26 scaffold\_23:267927-268109

**Scaffold 159: 4260 bp** (5' TTTAGGG 3')16

GAP (50 bp) scaffold\_159:3398-3448

5' 3'

5' 3'

scaffold\_159:3285-3397

Fig. 3. Location of telomere regions in scaffolds 5, 9, 23 and 159 of the genome sequence of

5' 3'

GAP (233 bp) scaffold\_23:268110-268343

GAP (71 bp) scaffold\_9:1896401-1896472

> Interstitial (2560 bp) scaffold\_23:268344-270904

> > Terminal sequence (811 bp) scaffold\_159:3449-4260

5' 3'

Another telomere-like interstitial region was found at about 300000 bp of the 3' end of scaffold 10. The position of the interstitial telomere block within the scaffold would suggested that it could be the result of an ancestral intra-chromosomal rearrangement (inversions and/or fusions), from differential crossing-over or from the repair of doublestrand break during evolution (Lin & Yan, 2008) (Fig. 4).

Fig. 4. Location of telomere regions in scaffold 10 of the genome sequence of *P. chrysosporium.*

It was observed that in eight of the 16 telomere-containing scaffolds identified, the repetitive unit was the complementary CCCTAAA. In five of them, the heptamer unit was placed at the 5' end. A telomere unit in an internal region at 8772 bp of the 5' end was present in scaffold 8 while scaffold 5 ferried a telomere motif at 2123492 bp of the 3' end and another one at 40552bp of the 3' end (Fig. 5).

Fig. 5. Location of telomere regions in scaffolds 5 and 8 of the genome sequence of *P. chrysosporium.*

The analysis of scaffold 28 revealed two arrays of interstitial copies of the heptamer CCCCTAAA (23 and 18 repeats, respectively) placed at 11417 bp and 13853 bp from the 5'

scaffold\_37:497534-497679

Terminal sequence (4030 bp) scaffold\_37:497730-501760

Terminal sequence (1695 bp) scaffold\_167:165911-167606

GAP (50 bp) scaffold\_167:165861-165910

Interstitial (1501 bp) scaffold\_117:245546-247046

> Terminal sequence (1595 bp) scaffold\_117:252939-254534

GAP (5892 bp) scaffold\_117:247047-252938

**Scaffold 37: 501760 bp** (5' TTAGG 3')29

**5' 3'**

(5' TTAGG 3')50 scaffold\_117:245246-245495

GAP (1391 bp) scaffold\_37:493926-495316

> Interstitial (780 bp) scaffold\_117:244466-245245

**Scaffold 167: 167606 bp** (5' TTAGG 3')31

**Scaffold 117: 254534 bp**

GAP (754 bp) scaffold\_117:243712-244465

GAP (50 bp) scaffold\_37:497680-497729

**5' 3'**

scaffold\_167:165706- 165860<

**5' 3'**

Interstitial (1917 bp) scaffold\_167:163789-165705

Interstitial (2217 bp) scaffold\_37:495317-497533

Fig. 7. Location of telomere regions in scaffolds 37, 117 and 167 of the genome sequence of

GAP (50 bp) scaffold\_117:245496-245545

(5' TTAGG 3')44 scaffold\_178:154105-154324

5' 3'

GAP (1529 bp) scaffold\_178:154325-155853

(5' TTAGG 3')66 scaffold\_178:156698-157039

Interstitial (844 bp) scaffold\_178:155854-156697

Fig. 8. Location of telomere regions in scaffold 178 of genome sequence of *P. placenta.*

*P. placenta.* 

**Scaffold 178: 157039bp**

GAP (50 bp) scaffold\_167:163739-163788

end. An interstitial fragment of 1342 bp and a gap of 933 bp were found between them suggesting the occurrence of missassemblage (Fig. 6).

Fig. 6. Location of telomere regions in scaffold 28 of the genome sequence of *P. chrysosporium.*

In summary, the results presented here suggest that the genome of *P. chrysosporium* is arranged in at least eight linkage groups. Because scaffold 8 shows telomere repeats at both ends it is suggested that the sequence contained in this scaffold constitute the only fully assembled chromosome of *P. chrysosporium*.

#### **2.1.3 Analysis of the telomere regions of** *Pleurotus ostreatus* **PC15**

*P. ostreatus* PC15 is a monokarytic strain of an industrially-produced edible basidiomycete that has been also used as a model system for lignocellulose degradation. *P. ostreatus* differs from the other white rot model system (*P. chrysosporium*) in its enzymatic portfolio for lignin degradation. The structure of its genome was determined by linkage analysis (Larraya et al., 2000), and the complete genome of this strain has been sequenced. The assemblage v1.0 consists of 19 scaffolds of which 18 were larger than 2 Kbp. The screening of the genome for telomere sequences was carried out as described above and revealed that the elementary telomere unit of *P. ostreatus* is the hexamer TTAGGG. All scaffolds were screened for telomere regions and 19 telomere regions were recovered. In eight of them the motif TTAGGG was found, and the remaining 11 had the motif CCCTAA. The number of repetitions of the basic unit ranged from 19 to 38. As it was determined, scaffolds 1, 3, 4, 5, 6, 9, 10 and 11 show telomere repeats at both ends indicating that they are fully assembled.

#### **2.1.4 Analysis of the telomere regions of** *Postia placenta* **MAD-698**

*P. placenta* is a brown rot basidiomycete that rapidly depolymerizes the cellulose in wood without significant lignin removal. This type of decay differs sharply from white rot fungi such as *P. chrysosporium* and *P. ostreatus*. The genome of the dikaryotic strain of *P. placenta*  MAD-698 revealed a genome of 90.9 Mbp assembled in 1243 scaffolds (Martinez et al., 2009). All the scaffolds of the assemblage were screened for telomere sequences as above. The basic telomere unit in this fungus is the pentamer TTAGG. The analysis of 1243 scaffolds revealed the presence of 23 regions containing telomere sequences. 12 of them carried the TTAGG sequence: in eight of them the sequence was found at the scaffold's 3' end, three scaffolds ferried the sequence in an interstitial location (Fig. 7) and in one of them two regions with the pentamer sequence appeared at the end of the chromosome but separated by 2373 bp suggesting a missassemblage of scaffold 178 at its 3' end (Fig. 8).

end. An interstitial fragment of 1342 bp and a gap of 933 bp were found between them

GAP (933 bp) scaffold\_28:12922-13854

In summary, the results presented here suggest that the genome of *P. chrysosporium* is arranged in at least eight linkage groups. Because scaffold 8 shows telomere repeats at both ends it is suggested that the sequence contained in this scaffold constitute the only fully

Interstitial (1342 bp) scaffold\_28:11580-12921

(5' CCCTAAA 3')18 scaffold\_28:13855-13981

*P. ostreatus* PC15 is a monokarytic strain of an industrially-produced edible basidiomycete that has been also used as a model system for lignocellulose degradation. *P. ostreatus* differs from the other white rot model system (*P. chrysosporium*) in its enzymatic portfolio for lignin degradation. The structure of its genome was determined by linkage analysis (Larraya et al., 2000), and the complete genome of this strain has been sequenced. The assemblage v1.0 consists of 19 scaffolds of which 18 were larger than 2 Kbp. The screening of the genome for telomere sequences was carried out as described above and revealed that the elementary telomere unit of *P. ostreatus* is the hexamer TTAGGG. All scaffolds were screened for telomere regions and 19 telomere regions were recovered. In eight of them the motif TTAGGG was found, and the remaining 11 had the motif CCCTAA. The number of repetitions of the basic unit ranged from 19 to 38. As it was determined, scaffolds 1, 3, 4, 5, 6, 9, 10 and 11 show telomere repeats at both ends indicating that they are fully assembled.

*P. placenta* is a brown rot basidiomycete that rapidly depolymerizes the cellulose in wood without significant lignin removal. This type of decay differs sharply from white rot fungi such as *P. chrysosporium* and *P. ostreatus*. The genome of the dikaryotic strain of *P. placenta*  MAD-698 revealed a genome of 90.9 Mbp assembled in 1243 scaffolds (Martinez et al., 2009). All the scaffolds of the assemblage were screened for telomere sequences as above. The basic telomere unit in this fungus is the pentamer TTAGG. The analysis of 1243 scaffolds revealed the presence of 23 regions containing telomere sequences. 12 of them carried the TTAGG sequence: in eight of them the sequence was found at the scaffold's 3' end, three scaffolds ferried the sequence in an interstitial location (Fig. 7) and in one of them two regions with the pentamer sequence appeared at the end of the chromosome but separated by 2373 bp

Fig. 6. Location of telomere regions in scaffold 28 of the genome sequence of

5' 3'

**2.1.3 Analysis of the telomere regions of** *Pleurotus ostreatus* **PC15** 

**2.1.4 Analysis of the telomere regions of** *Postia placenta* **MAD-698** 

suggesting a missassemblage of scaffold 178 at its 3' end (Fig. 8).

suggesting the occurrence of missassemblage (Fig. 6).

**Scaffold 28: 146458 bp**

assembled chromosome of *P. chrysosporium*.

(5' CCCTAAA 3')23 scaffold\_28:11418-11579

*P. chrysosporium.*

Fig. 7. Location of telomere regions in scaffolds 37, 117 and 167 of the genome sequence of *P. placenta.* 

Fig. 8. Location of telomere regions in scaffold 178 of genome sequence of *P. placenta.*

5' 3'

GAP (50 bp) scaffold\_70:38346-38395

(5' CCTAA 3')26 scaffold\_70:38396-38530

> Interstitial (614 bp) scaffold\_70:38531-39144

> > Interstitial (700 bp) scaffold\_144:181292-181991

GAP (380 bp) scaffold\_144:181992-182371

(5' CCTAA 3')50 scaffold\_144:181042-181291

GAP (50 bp) scaffold\_144:180992-181041

GAP (1080 bp) scaffold\_70:39145-40224

**Scaffold 70: 365241bp**

Interstitial (1028 bp) scaffold\_70:37318-38345

GAP (1081 bp) scaffold\_70:36237-37317

Fig. 11. Location of telomere regions in scaffolds 70 and 144 of genome sequence of

5' 3'

GAP (50 bp) scaffold\_144:179227-179276

Interstitial (1714 bp) scaffold\_144:179277-180991

**2.1.5 Analysis of the telomere regions of** *Heterobasidion annosum*

complete chromosomes (http://genome.jgipsf.org/Hetan2/Hetan2.home.html).

**2.1.6 Summary of the telomere regions of different basidiomycetes** 

In summary, *P. placenta* has a telomere pentameric (TTAGG) basic repetitive unit. The number of copies present in the assembled genome ranged from 19 to 70. The analysis of data suggests that the minimum number of linkage groups of this species could be 12.

*Heterobasidion annosum* is a root pathogen responsible for important losses in conifer plantations and natural forests throughout the northern hemisphere (Asiegbu et al., 2005). Genetic linkage analyses of this fungus had produced maps with 19 large linkage groups and 20 smaller ones (Lind et al., 2005), but the precise chromosome number for this species has not been conclusively determined. The v2.0 of the homokaryotic *H. annosum* genome assembly consists of 33.1 Mbp sequence assembled into 15 scaffolds at least 10 of which represent nearly

The screening of the telomere sequences was performed as it was described above. The *H. annosum* telomere repetitive sequence is a TTAGG pentamer. The screening of the 15 scaffolds rendered 19 telomere regions. Six of them corresponded to the direct repeat sequence at the 3' end of the scaffolds, and the remaining 13 carried the reverse sequence CCTAA at the scaffold's 5' end. These results suggest that the genome of *H. annosum* is arranged in at least 13 linkage groups. Taking into account that scaffolds 5, 6, 9, 10, 11 and 12 contained telomere repeats at both ends, it can be concluded that they could correspond

A summary of the structural characteristics of the telomeres studied in this paper can be

*P. placenta.*

**Scaffold 144: 208283bp**

GAP (493 bp) scaffold\_144:176574-177067

Interstitial (2159 bp) scaffold\_144:177068-179226

to fully assembled chromosomes.

found in Table 1.

The remaining 11 scaffolds carried the telomere unit CCTAA placed at the 5´end in seven of them. The scaffold 99 (Fig. 9) has a complex structure with two interstitial CCTAA regions containing the motif. One of them located towards the scaffold 5' end, contains 40 copies of the telomere unit, and the other, located 16533 bp downstream, another interstitial region containing 19 repetitions of the unit. A gap of about 50 bp placed at the 5´end of this interstitial region suggests that this arrangement could be due to a missassemblage.

Fig. 9. Location of telomere regions in scaffold 99 of genome sequence of *P. placenta.*

The scaffold 33 also showed two interstitial regions with 40 and 30 repetitions of the of the CCTAA telomere motif. One of the regions was preceded by a gap of 701 bp suggesting that it could be a wrong assemblage. The other one (40 copies of the telomere unit) could very well represent and ITS (Interstitial Telomere Sequence) that can be produced by chromosome rearrangements as described above.

Fig. 10. Location of telomere regions in scaffold 33 of genome sequence of *P. placenta.*

Two other scaffolds with interstitial sequences were found. Scaffold 70 showed 26 repetitions of the CCTAA telomere unit at 38395 bp from the 5´ end preceded by a gap, and scaffold 144 showed 50 repetitions of the telomere unit at 181041 bp from the 5' end preceded by another other gap of 50 bp (Fig. 11). We suggest that these structures are consequence of wrong assemblages.

The remaining 11 scaffolds carried the telomere unit CCTAA placed at the 5´end in seven of them. The scaffold 99 (Fig. 9) has a complex structure with two interstitial CCTAA regions containing the motif. One of them located towards the scaffold 5' end, contains 40 copies of the telomere unit, and the other, located 16533 bp downstream, another interstitial region containing 19 repetitions of the unit. A gap of about 50 bp placed at the 5´end of this interstitial region suggests that this arrangement could be due to a

Fig. 9. Location of telomere regions in scaffold 99 of genome sequence of *P. placenta.*

GAP (1848 bp) scaffold\_99: 10098-11945

Interstitial (977 bp) scaffold\_99: 9121-10097

scaffold\_33:279914-280114

Interstitial (660 bp) scaffold\_33:280115-280774

Fig. 10. Location of telomere regions in scaffold 33 of genome sequence of *P. placenta.*

Two other scaffolds with interstitial sequences were found. Scaffold 70 showed 26 repetitions of the CCTAA telomere unit at 38395 bp from the 5´ end preceded by a gap, and scaffold 144 showed 50 repetitions of the telomere unit at 181041 bp from the 5' end preceded by another other gap of 50 bp (Fig. 11). We suggest that these structures are

5' 3'

chromosome rearrangements as described above.

**Scaffold 33: 499751 bp** (5' CCTAA 3')40

consequence of wrong assemblages.

The scaffold 33 also showed two interstitial regions with 40 and 30 repetitions of the of the CCTAA telomere motif. One of the regions was preceded by a gap of 701 bp suggesting that it could be a wrong assemblage. The other one (40 copies of the telomere unit) could very well represent and ITS (Interstitial Telomere Sequence) that can be produced by

5' 3'

Interstitial (1373 bp) scaffold\_99: 11946-13318

GAP (1053 bp) scaffold\_99: 13319-14371

GAP (50 bp) scaffold\_99: 15515-15564

Interstitial (1143 bp) scaffold\_99: 14372-15514

(5' CCTAA 3')30 scaffold\_33:281476-281625

GAP (701 bp) scaffold\_33:280775-281475 GAP (50 bp) scaffold\_99: 16706-16755

Interstitial (1141 bp) scaffold\_99: 15565-16705

(5' CCTAA 3')19 scaffold\_99: 16756-16850

missassemblage.

**Scaffold 99: 233591 bp**

GAP (2309 bp) scaffold\_99: 1217-3525

> Interstitial (4477 bp) scaffold\_99: 3526-8002

GAP (1118 bp) scaffold\_99: 8003-9120

(5' CCTAA 3')40 scaffold\_99: 1-202

Interstitial (1014 bp) scaffold\_99: 203-1216

In summary, *P. placenta* has a telomere pentameric (TTAGG) basic repetitive unit. The number of copies present in the assembled genome ranged from 19 to 70. The analysis of data suggests that the minimum number of linkage groups of this species could be 12.

#### **2.1.5 Analysis of the telomere regions of** *Heterobasidion annosum*

*Heterobasidion annosum* is a root pathogen responsible for important losses in conifer plantations and natural forests throughout the northern hemisphere (Asiegbu et al., 2005). Genetic linkage analyses of this fungus had produced maps with 19 large linkage groups and 20 smaller ones (Lind et al., 2005), but the precise chromosome number for this species has not been conclusively determined. The v2.0 of the homokaryotic *H. annosum* genome assembly consists of 33.1 Mbp sequence assembled into 15 scaffolds at least 10 of which represent nearly complete chromosomes (http://genome.jgipsf.org/Hetan2/Hetan2.home.html).

The screening of the telomere sequences was performed as it was described above. The *H. annosum* telomere repetitive sequence is a TTAGG pentamer. The screening of the 15 scaffolds rendered 19 telomere regions. Six of them corresponded to the direct repeat sequence at the 3' end of the scaffolds, and the remaining 13 carried the reverse sequence CCTAA at the scaffold's 5' end. These results suggest that the genome of *H. annosum* is arranged in at least 13 linkage groups. Taking into account that scaffolds 5, 6, 9, 10, 11 and 12 contained telomere repeats at both ends, it can be concluded that they could correspond to fully assembled chromosomes.

#### **2.1.6 Summary of the telomere regions of different basidiomycetes**

A summary of the structural characteristics of the telomeres studied in this paper can be found in Table 1.

Total number of scaffolds 740 232 12 1243 15

Scaffolds with direct motif 22 8 8 12 6 Scaffolds with reverse motif 20 8 11 11 13 Total length analyzed (Kbp) 2100 800 950 1150 950

Predicted protein 104 (36.7 %) 57 (42.5 %) 182 (57.1 %) 79 (44.6 %) 131 (27.9 %)

Processes and Cellular Component are rather low, whereas the number of Molecular Function terms is much higher and produces a clearer picture of what are the subtelomeric

Biological Process (BP) 46 22 24 20 48 Cellular Component (CC) 15 9 10 7 12 Molecular Function (MF) 127 88 161 65 186 Table 3. GO terms richness in the subtelomeric regions of the five basidiomycetes analyzed. In order to determine if the genes found at the subtelomere constitute a representative sample of the genes of each species, we can perform a simple statistical analysis to calculate the numbers that would be expected for each one of the GO terms in the subtelomeric regions using the whole genome data as frequency. If we do this type of study, we conclude that the distribution of the subtelomeric GO terms for each one of the categories is not a representative sample of the total gene set for each one of the species (data not shown) and, consequently, we can conclude that there are sets of genes that are found more frequently at the telomere regions. For identifying these sets, we must discuss the GO term distribution in

**2.2.1 Analysis of the subtelomere regions of** *Ceriporiopsis subvermispora* **B** 

The analysis of the 283 gene models annotated in the subtelomeric regions of *C. subvermispora* revealed 46 BP terms in which transport, protein amino acid phosphorylation,

*P. chrysosporium*

*P.* 

*chrysosporium P. ostreatus P. placenta H. annosum* 

42 16 19 23 19

283 134 319 177 470

134 (47.3 %) 56 (41.8 %) 87 (27.3 %) 64 (36.2 %) 279 (59.4 %)

45 (15.9 %) 21 (15.7 %) 50 (15.7 %) 34 (19.2 %) 60 (12.8 %)

0.13 0.17 0.34 0.10 0.49

0.31 0.39 0.34 0.19 0.40

*P. ostreatus*

*P. placenta* 

*H. annosum* 

Feature *C.* 

Number of subtelomeric regions analyzed

Total number filtered model

Known/annotated/putative

Subtelomere gene density

Whole genome gene density

regions coding for (Table 3).

GO category

each one of the species.

genes

genes

(genes / Kbp)

(genes / Kbp)

Unknown genes (no homology)

*subvermispora*

Table 2. Density and homology types in the subtelomere regions.

*C. subvermispora*


Table 1. Structural characteristics of the telomeres in the basidiomycetes analyzed.

#### **2.2 Analysis of the basidiomycetes' subtelomeric regions**

The analysis of the subtelomeric regions is aimed at answering two questions: are the genes sitting at the subtelomeric a representative sample of the genes of each species or is there enrichment in sub-telomere specific genes? If this were the case, which are these telomereenriched genes and are they conserved across species? In order to address these questions, we have recorded the genes automatically annotated in 50 Kbp regions adjacent to the different telomeres identified in the species analyzed, we have checked them manually and we have recorded and classified the Gene Ontology (GO) terms related to the genes identified in these regions (Ashburner et al., 2000).

As the number of telomere-containing scaffolds differed in the various species, (from 12 in *P ostreatus* to 1243 in *P. placenta*) the length of genomic sequence screened also varied, although in a much smaller degree (from 800 Kbp in *P. chrysosporium* to 2100 Kbp in *C. subvermispora*). The gene density of the analyzed regions was found to be related to the degree of finishing of the genome: those assembled as draft (*C. subvermispora*, *P. placenta* and *P. chrysosporium*) have gene densities lower than 0.20 genes per Kbp, whereas the gene density in the finished genomes is much higher (0.34 and 0.49 genes per Kbp). The gene density in the draft genomes seems to be significantly lower than the global gene density in the corresponding genomes. This can be due to deficiencies in the annotation of these draft genomes since the global gene density in all the species analyzed (with the exception of *P. placenta*) is very similar. In all cases, the most of the genes automatically annotated at the subtelomeric regions had no homology with others of the gene databases (cutoff criterion evalue<e-20 for BlastX) (Table 2).

The Gene Ontology annotation is aimed at standardizing the representation of gene and gene products in such a way that they can be compared among databases. This approach project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data (Ashburner et al., 2000). There are three classification categories that are provided by the consortium: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF). Each identified gene product is labeled with all the GO terms in each category that can define it. By this way, a list of GO terms provides a kind of picture describing the specific condition of the gene subset that is under study. We have studied the three categories of GO annotation in the genes annotated in 50 Kbp subtelomeric regions in the five genomes studied and recorded their general statistics (Table 3). Because of the low number of gene models identified in the subtelomeric regions and because not all of them can be labeled with a GO term, the numbers of terms in the categories of Biological

408 Bioinformatics – Trends and Methodologies

The analysis of the subtelomeric regions is aimed at answering two questions: are the genes sitting at the subtelomeric a representative sample of the genes of each species or is there enrichment in sub-telomere specific genes? If this were the case, which are these telomereenriched genes and are they conserved across species? In order to address these questions, we have recorded the genes automatically annotated in 50 Kbp regions adjacent to the different telomeres identified in the species analyzed, we have checked them manually and we have recorded and classified the Gene Ontology (GO) terms related to the genes

As the number of telomere-containing scaffolds differed in the various species, (from 12 in *P ostreatus* to 1243 in *P. placenta*) the length of genomic sequence screened also varied, although in a much smaller degree (from 800 Kbp in *P. chrysosporium* to 2100 Kbp in *C. subvermispora*). The gene density of the analyzed regions was found to be related to the degree of finishing of the genome: those assembled as draft (*C. subvermispora*, *P. placenta* and *P. chrysosporium*) have gene densities lower than 0.20 genes per Kbp, whereas the gene density in the finished genomes is much higher (0.34 and 0.49 genes per Kbp). The gene density in the draft genomes seems to be significantly lower than the global gene density in the corresponding genomes. This can be due to deficiencies in the annotation of these draft genomes since the global gene density in all the species analyzed (with the exception of *P. placenta*) is very similar. In all cases, the most of the genes automatically annotated at the subtelomeric regions had no homology with others of the gene databases (cutoff criterion e-

The Gene Ontology annotation is aimed at standardizing the representation of gene and gene products in such a way that they can be compared among databases. This approach project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data (Ashburner et al., 2000). There are three classification categories that are provided by the consortium: Biological Process (BP), Cellular Component (CC) and Molecular Function (MF). Each identified gene product is labeled with all the GO terms in each category that can define it. By this way, a list of GO terms provides a kind of picture describing the specific condition of the gene subset that is under study. We have studied the three categories of GO annotation in the genes annotated in 50 Kbp subtelomeric regions in the five genomes studied and recorded their general statistics (Table 3). Because of the low number of gene models identified in the subtelomeric regions and because not all of them can be labeled with a GO term, the numbers of terms in the categories of Biological

**Telomere repetition** 

**Average number of telomere** 

**Minimum number of linkage groups analyzed** 

**Scaffold number** 

*C. subvermispora* 39.0 Mb 740 TTAGGG 19 copies 11 *P. chrysosporium* 35.1 Mb 232 TTTAGGG 22 copies 8 *P. ostreatus* 34.3 Mb 12 TTAGGG 24 copies 11 *P. placenta* 90.0 Mb 1243 TTAGG 25 copies 12 *H. annosum* 33.1 Mb 15 TTAGG 25 copies 13 Table 1. Structural characteristics of the telomeres in the basidiomycetes analyzed.

**Species** 

**Genome length assembled**

**2.2 Analysis of the basidiomycetes' subtelomeric regions** 

identified in these regions (Ashburner et al., 2000).

value<e-20 for BlastX) (Table 2).


Table 2. Density and homology types in the subtelomere regions.

Processes and Cellular Component are rather low, whereas the number of Molecular Function terms is much higher and produces a clearer picture of what are the subtelomeric regions coding for (Table 3).


Table 3. GO terms richness in the subtelomeric regions of the five basidiomycetes analyzed.

In order to determine if the genes found at the subtelomere constitute a representative sample of the genes of each species, we can perform a simple statistical analysis to calculate the numbers that would be expected for each one of the GO terms in the subtelomeric regions using the whole genome data as frequency. If we do this type of study, we conclude that the distribution of the subtelomeric GO terms for each one of the categories is not a representative sample of the total gene set for each one of the species (data not shown) and, consequently, we can conclude that there are sets of genes that are found more frequently at the telomere regions. For identifying these sets, we must discuss the GO term distribution in each one of the species.

#### **2.2.1 Analysis of the subtelomere regions of** *Ceriporiopsis subvermispora* **B**

The analysis of the 283 gene models annotated in the subtelomeric regions of *C. subvermispora* revealed 46 BP terms in which transport, protein amino acid phosphorylation,

represented ones in this category. Finally, out of the 186 MF terms, the more represented terms are those of zinc ion binding, oxidoreductase activity, ATP binding, binding and heme

If we consider the CC category, the terms nucleus and intracellular are present among the more represented ones in all the species studied, the terms membrane and integral to membrane are present in four of the species and the terms ribosome and cytoplasm are present in three of the five species. In this case the more different subtelomeric regions in terms of the CC-GO terms are those of *C. subvermispora* and *P. placenta*, being the other three

Finally, in the case of the MF terms, as their number is much higher, a deeper comparison can be made among the species (Table 4). The five species analyzed share the most frequent MF terms associated to the genes in the subtelomere regions supporting the idea of a

> *Pleurotus ostreatus*

2,60 2,70 2,69 2,48 3,14 5

2,60 2,02 2,48 3

*Postia placenta* *Heterobasidion annosum* 

Presence

*Phanerochaete chrysosporium*

zinc ion binding 4,76 3,60 2,02 4,96 4,72 5

ATP binding 5,63 2,70 5,05 6,61 2,83 5 nucleic acid binding 3,90 4,50 4,38 4,96 2,20 5 catalytic activity 2,16 1,80 1,35 2,48 2,20 5 DNA binding 1,30 0,90 1,68 2,48 1,57 5 transporter activity 3,03 1,80 1,35 ND 2,20 4 iron ion binding 1,35 1,65 1,57 3

Table 4. Frequency of Molecular Function GO terms in the subtelomere regions of the

preference for certain gene of gene families at these chromosome locations.

*Ceriporiopsis subvermispora*

**2.2.6 Comparative analysis of the subtelomere regions of the five basidiomycetes**  The record of the GO terms associated to genes found in the 50 Kbp adjacent to the telomere sequences in the five basidiomycetes analyzed reveals common patterns that permit to determine some telomere-enriched GO term families. As a preliminary study, we have taken into account the terms that are always present in among the most represented ones in each of the categories and we have extracted those of them that are present in all or most of the species analyzed. If we consider the BP category, the term electron transport is among the most represented in the five species studied and the terms transport, protein aminoacid phosphorylation, metabolic process and proteolysis are present in four of the five species. So, we can conclude that the subtelomeric regions are enriched in these processes. The species *C. subvermispora*, *H. annosum*, and *P. ostreatus* have subtelomeric regions where the more abundant BP-GO terms are highly similar whereas the subtelomeric regions in *P.* 

binding.

*chrysosporium* are the most dissimilar.

species in intermediate positions.

Molecular function

oxidoreductase

protein-tyrosine kinase activity

studied basidiomycetes.

activity

metabolic process, electron transport, carbohydrate metabolic process, and proteolysis are the more represented ones. The terms transport, protein amino acid phosphorylation and carbohydrate metabolism seem to be overrepresented in this region whereas the terms metabolic processes and electron transport seem to be underrepresented in comparison with the total genome. The analysis revealed 15 CC terms annotated in this region. Out of which, the terms intracellular, integral to membrane, membrane and nucleus are the most represented ones in this category. All of them, except the term nucleus, seem to be overrepresented in the subtelomeric region. Finally, out of the 127 MF terms being the more represented were ATP binding, zinc ion binding, nucleic acid binding protein, binding protein, kinase activity, protein serine/threonine kinase activity, transporter activity, oxidoreductase activity, and protein-tyrosine kinase activity. All of them seem to be overrepresented in the subtelomeric regions in comparison to the whole genome.

#### **2.2.2 Analysis of the subtelomere regions of** *Phanerochaete chrysosporium*

The analysis of the 134 gene models annotated in the subtelomeric regions of *P. chrysosporium* revealed 22 BP terms being the most represented terms proteolysis and peptidolysis, electron transport, metabolism, methionine biosynthesis, protein transport, small GTPase mediated signal transduction, and transport. The analysis revealed 9 CC terms out of which, the terms membrane, nucleus and integral to membrane are the most represented ones in this category. Finally, out of the 88 MF terms, the more represented terms are those of aspartic-type endopeptidase activity, nucleic acid binding, zinc ion binding, ATP binding, and oxidoreductase activity

#### **2.2.3 Analysis of the subtelomere regions of** *Pleurotus ostreatus*

The analysis of the 319 gene models annotated in the subtelomeric regions of *P. ostreatus* revealed 24 BP terms being the most represented terms protein amino acid phosphorylation, proteolysis, metabolic process, electron transport, and transport. The analysis revealed 10 CC terms out of which, the terms intracellular, integral to membrane, nucleus, cell wall and ribosome are the most represented ones in this category. Finally, out of the 161 MF terms, the more represented terms are those of ATP binding, nucleic acid binding, oxidoreductase activity, protein-tyrosine kinase activity and zinc ion binding.

#### **2.2.4 Analysis of the subtelomere regions of** *Postia placenta*

The analysis of the 177 gene models annotated in the subtelomeric regions of *P. placenta* revealed 20 BP terms being the most represented terms proteolysis, electron transport, protein amino acid phosphorylation and regulation of transcription DNA-dependent. The analysis revealed 7 CC terms out of which, the terms intracellular, membrane, and nucleus are the most represented ones in this category. Finally, out of the 65 MF terms, the more represented terms are those of ATP binding, nucleic acid binding and zinc ion binding.

#### **2.2.5 Analysis of the subtelomere regions of** *Heterobasidion annosum*

The analysis of the 470 gene models annotated in the subtelomeric regions of *H. annosum*  revealed 48 BP terms being the most represented terms metabolic process, transport, electron transport, regulation of transcription DNA-dependent, carbohydrate metabolic process, and proteolysis. The analysis revealed 12 CC terms out of which, the terms membrane, integral to membrane, intracellular, nucleus and cytoplasm are the most

metabolic process, electron transport, carbohydrate metabolic process, and proteolysis are the more represented ones. The terms transport, protein amino acid phosphorylation and carbohydrate metabolism seem to be overrepresented in this region whereas the terms metabolic processes and electron transport seem to be underrepresented in comparison with the total genome. The analysis revealed 15 CC terms annotated in this region. Out of which, the terms intracellular, integral to membrane, membrane and nucleus are the most represented ones in this category. All of them, except the term nucleus, seem to be overrepresented in the subtelomeric region. Finally, out of the 127 MF terms being the more represented were ATP binding, zinc ion binding, nucleic acid binding protein, binding protein, kinase activity, protein serine/threonine kinase activity, transporter activity, oxidoreductase activity, and protein-tyrosine kinase activity. All of them seem to be

overrepresented in the subtelomeric regions in comparison to the whole genome.

**2.2.2 Analysis of the subtelomere regions of** *Phanerochaete chrysosporium*

binding, ATP binding, and oxidoreductase activity

**2.2.3 Analysis of the subtelomere regions of** *Pleurotus ostreatus*

activity, protein-tyrosine kinase activity and zinc ion binding.

**2.2.4 Analysis of the subtelomere regions of** *Postia placenta*

**2.2.5 Analysis of the subtelomere regions of** *Heterobasidion annosum*

The analysis of the 134 gene models annotated in the subtelomeric regions of *P. chrysosporium* revealed 22 BP terms being the most represented terms proteolysis and peptidolysis, electron transport, metabolism, methionine biosynthesis, protein transport, small GTPase mediated signal transduction, and transport. The analysis revealed 9 CC terms out of which, the terms membrane, nucleus and integral to membrane are the most represented ones in this category. Finally, out of the 88 MF terms, the more represented terms are those of aspartic-type endopeptidase activity, nucleic acid binding, zinc ion

The analysis of the 319 gene models annotated in the subtelomeric regions of *P. ostreatus* revealed 24 BP terms being the most represented terms protein amino acid phosphorylation, proteolysis, metabolic process, electron transport, and transport. The analysis revealed 10 CC terms out of which, the terms intracellular, integral to membrane, nucleus, cell wall and ribosome are the most represented ones in this category. Finally, out of the 161 MF terms, the more represented terms are those of ATP binding, nucleic acid binding, oxidoreductase

The analysis of the 177 gene models annotated in the subtelomeric regions of *P. placenta* revealed 20 BP terms being the most represented terms proteolysis, electron transport, protein amino acid phosphorylation and regulation of transcription DNA-dependent. The analysis revealed 7 CC terms out of which, the terms intracellular, membrane, and nucleus are the most represented ones in this category. Finally, out of the 65 MF terms, the more represented terms are those of ATP binding, nucleic acid binding and zinc ion binding.

The analysis of the 470 gene models annotated in the subtelomeric regions of *H. annosum*  revealed 48 BP terms being the most represented terms metabolic process, transport, electron transport, regulation of transcription DNA-dependent, carbohydrate metabolic process, and proteolysis. The analysis revealed 12 CC terms out of which, the terms membrane, integral to membrane, intracellular, nucleus and cytoplasm are the most represented ones in this category. Finally, out of the 186 MF terms, the more represented terms are those of zinc ion binding, oxidoreductase activity, ATP binding, binding and heme binding.

#### **2.2.6 Comparative analysis of the subtelomere regions of the five basidiomycetes**

The record of the GO terms associated to genes found in the 50 Kbp adjacent to the telomere sequences in the five basidiomycetes analyzed reveals common patterns that permit to determine some telomere-enriched GO term families. As a preliminary study, we have taken into account the terms that are always present in among the most represented ones in each of the categories and we have extracted those of them that are present in all or most of the species analyzed. If we consider the BP category, the term electron transport is among the most represented in the five species studied and the terms transport, protein aminoacid phosphorylation, metabolic process and proteolysis are present in four of the five species. So, we can conclude that the subtelomeric regions are enriched in these processes. The species *C. subvermispora*, *H. annosum*, and *P. ostreatus* have subtelomeric regions where the more abundant BP-GO terms are highly similar whereas the subtelomeric regions in *P. chrysosporium* are the most dissimilar.

If we consider the CC category, the terms nucleus and intracellular are present among the more represented ones in all the species studied, the terms membrane and integral to membrane are present in four of the species and the terms ribosome and cytoplasm are present in three of the five species. In this case the more different subtelomeric regions in terms of the CC-GO terms are those of *C. subvermispora* and *P. placenta*, being the other three species in intermediate positions.

Finally, in the case of the MF terms, as their number is much higher, a deeper comparison can be made among the species (Table 4). The five species analyzed share the most frequent MF terms associated to the genes in the subtelomere regions supporting the idea of a preference for certain gene of gene families at these chromosome locations.


Table 4. Frequency of Molecular Function GO terms in the subtelomere regions of the studied basidiomycetes.

should be noticed that 10 out of 15 regions were placed on chromosome 5 of *P. ostreatus*. A similar situation was observed when synteny was analyzed between *H. annosum* and *P. ostreatus*. 10 out of 25 synteny regions of *H. annosum* mapped to chromosome 10 of *P.* 

*C. subvermispora* 6 6 7 ― 16 12 ― 47 *P. chrysosporium* 4 ― 4 ― 4 ― ― 12 *P. placenta* ― ― 5 10 ― ― ― 15 *H. annosum* ― ― 6 9 ― ― 10 25

Table 5. Number of subtelomeric synteny regions in different basidiomycetes using

The chromosome 4 of *P. ostreatus* can be defined as mosaic of modules of subtelomeres from the other basidiomycetes studied in this paper. The list below contains some gene models mapping to the subtelomeric regions in these basidiomycetes that were found at interstitial positions in *P. ostreatus* chromosome 4: from *C. subvermispora*, a cell cycle check point protein, a membrane transporter, a histone deacetylase, a histidine acid phosphatase, the ribosomal protein L1 and an ABC transporter; from *P. chrysosporium*, a haloacid dehalogenase-like hydrolase and a glycoside hydrolase; from *P. placenta*, an inositol polyphosphate phosphatase, a metal-dependent phosphohydrolase, and a monooxygenase; from *H. annosum*, a mitochondrial carrier transporter, a Golgi transporter, and zinc finger

11 out of 12 gene models of *C. subvermispora* were syntenic to a 20 Kbp regions of *P. ostreatus* chromosome 8. These gene models corresponded to a nucleic acid binding protein, citrate synthase, a methyltransferase, an exoribonuclease, a phosphoribosyltransferase, a prenyltransferase, a homeobox transcription factor, a mitochondrial inner membrane protein importer, as DNA-J type heat shock protein, a RNA splicing protein, and a

The genome of *P. falciparum* is organized in 14 compartmentalized chromosomes where the conserved regions form the central chromosomal domains and the polymorphic regions are at the terminal domains. In this way, housekeeping genes tend to be located at the central regions of the chromosomes, whereas the highly variable gene families responsible for the antigenic variation of the parasite are clustered towards the telomeres (Hernandez-Rivas et al., 2010). Our results suggest that a similar type of chromosomal organization would be expected to occur in basidiomycetes, although a larger number of genomes should be

Total 10 6 22 19 20 12 10 101

*P. ostreatus* chromosome Chr 1 Chr 3 Chr 4 Chr 5 Chr 7 Chr 8 Chr 10 Total

*ostreatus*.

*P. ostreatus* as reference.

transcription factor (Fig. 12).

cytochrome c oxidase.

studied to fully support this hypothesis.

#### **3. Synteny**

Synteny can be defined as the conservation of the relative positions and order of genes in different chromosomes. This definition implies that the conserved genes are related by their descent from an original ancestor (homologous genes). There are two types of homology: orthology and paralogy. We can call two genes belonging to different species as orthologous if they descent from a single gene present in the last common of the two species. On the other hand, two genes are called paralogous, if they derive from gene duplication events occurred in a given species. The orthology requires that speciation has occurred, whereas this is not necessary in the case of parology, which can occur only in individuals of the same species. As the evolutionary histories of different species may differ, groups of paralogous genes can be orthologous of a single gene in a different species. The preserved colocalization of genomic regions on chromosomes of different species is called shared synteny. This may involve relationships between genes within the syntenic regions involved, such as combinations of alleles that are advantegeous when inherited together, or shared regulatory mechanisms.

The problem of identifying syntenic regions in different genomes has been addressed using different strategies including the use of FASTA (Lipman & Pearson, 1985) and Blast (Altschul et al., 1997), and different bioinformatics approaches (Catchen et al., 2009; Grabherr et al., 2010; Tang et al., 2011). We have used a method based on the identification of synteny regions at the chromosome ends by means of a BLASTP analysis of genes in the two genomes using a cut-off threshold of e-20. Later, the Vista Synteny Viewer (http://genome.lbl.gov/vista/index.shtml) integrated into the JGI Genome Portal was used in the preliminary orthologous searching of each species. This tool enables pair-wise comparative analysis of genome assemblies at three levels of resolution. The use of synteny software programs is of particular interest to see the particular changes undergone by the subtelomeric regions during evolution (Housworth & Postlethwait, 2002). For instance, the chromosome 3 in *H. annosum* maintains the synteny with *P. ostreatus* chromosome 3, but the *H. annosum* subtelomeric region aligned with a central region of *P. ostreatus* chromosome 4 suggesting the occurrence of a translocation event after the divergence of the two species.

The different basidiomycetes were used as reference genomes and *P. ostreatus* PC15 v2.0 was the query genome. Focusing in the distal 50Kbp of each chromosome, we identified the putative gene orthologous in the subtelomeric regions of each basidiomycete. Then we used that gene sequences as query in a BlastP search of the *P. ostreatus* filtered model genes as subject. Two models were considered as orthologous if their alignment had an e-value lower than e-20 and they shared a minimum 60% in identity percentage.

The synteny between the subtelomeric regions was analyzed using *P. ostreatus* chromosomes as a reference. It was observed that seven *P. ostreatus* chromosomes (chromosomes 1, 3, 4, 5, 7, 8 and 10) harbored sequences homologous to subtelomere regions of the other basidiomycetes analyzed in this study (Table 5). 47 synteny regions were uncovered when the subtelomeric regions of *C. subvermispora* were compared to those of *P. ostreatus*. The highest number (16) corresponded to those regions placed at *P. ostreatus* chromosome 7. However, it should be mentioned that 12 *C. subvermispora* gene models were found within a 30 Kbp region of the *P. ostreatus* genome (data not shown). The lowest number of synteny regions was found when the subtelomeric regions of *P. placenta* were used as query. It

Synteny can be defined as the conservation of the relative positions and order of genes in different chromosomes. This definition implies that the conserved genes are related by their descent from an original ancestor (homologous genes). There are two types of homology: orthology and paralogy. We can call two genes belonging to different species as orthologous if they descent from a single gene present in the last common of the two species. On the other hand, two genes are called paralogous, if they derive from gene duplication events occurred in a given species. The orthology requires that speciation has occurred, whereas this is not necessary in the case of parology, which can occur only in individuals of the same species. As the evolutionary histories of different species may differ, groups of paralogous genes can be orthologous of a single gene in a different species. The preserved colocalization of genomic regions on chromosomes of different species is called shared synteny. This may involve relationships between genes within the syntenic regions involved, such as combinations of alleles that are advantegeous when inherited together, or

The problem of identifying syntenic regions in different genomes has been addressed using different strategies including the use of FASTA (Lipman & Pearson, 1985) and Blast (Altschul et al., 1997), and different bioinformatics approaches (Catchen et al., 2009; Grabherr et al., 2010; Tang et al., 2011). We have used a method based on the identification of synteny regions at the chromosome ends by means of a BLASTP analysis of genes in the two genomes using a cut-off threshold of e-20. Later, the Vista Synteny Viewer (http://genome.lbl.gov/vista/index.shtml) integrated into the JGI Genome Portal was used in the preliminary orthologous searching of each species. This tool enables pair-wise comparative analysis of genome assemblies at three levels of resolution. The use of synteny software programs is of particular interest to see the particular changes undergone by the subtelomeric regions during evolution (Housworth & Postlethwait, 2002). For instance, the chromosome 3 in *H. annosum* maintains the synteny with *P. ostreatus* chromosome 3, but the *H. annosum* subtelomeric region aligned with a central region of *P. ostreatus* chromosome 4 suggesting the occurrence of a translocation event after the divergence of the two species. The different basidiomycetes were used as reference genomes and *P. ostreatus* PC15 v2.0 was the query genome. Focusing in the distal 50Kbp of each chromosome, we identified the putative gene orthologous in the subtelomeric regions of each basidiomycete. Then we used that gene sequences as query in a BlastP search of the *P. ostreatus* filtered model genes as subject. Two models were considered as orthologous if their alignment had an e-value lower

The synteny between the subtelomeric regions was analyzed using *P. ostreatus* chromosomes as a reference. It was observed that seven *P. ostreatus* chromosomes (chromosomes 1, 3, 4, 5, 7, 8 and 10) harbored sequences homologous to subtelomere regions of the other basidiomycetes analyzed in this study (Table 5). 47 synteny regions were uncovered when the subtelomeric regions of *C. subvermispora* were compared to those of *P. ostreatus*. The highest number (16) corresponded to those regions placed at *P. ostreatus* chromosome 7. However, it should be mentioned that 12 *C. subvermispora* gene models were found within a 30 Kbp region of the *P. ostreatus* genome (data not shown). The lowest number of synteny regions was found when the subtelomeric regions of *P. placenta* were used as query. It

than e-20 and they shared a minimum 60% in identity percentage.

**3. Synteny** 

shared regulatory mechanisms.

should be noticed that 10 out of 15 regions were placed on chromosome 5 of *P. ostreatus*. A similar situation was observed when synteny was analyzed between *H. annosum* and *P. ostreatus*. 10 out of 25 synteny regions of *H. annosum* mapped to chromosome 10 of *P. ostreatus*.


Table 5. Number of subtelomeric synteny regions in different basidiomycetes using *P. ostreatus* as reference.

The chromosome 4 of *P. ostreatus* can be defined as mosaic of modules of subtelomeres from the other basidiomycetes studied in this paper. The list below contains some gene models mapping to the subtelomeric regions in these basidiomycetes that were found at interstitial positions in *P. ostreatus* chromosome 4: from *C. subvermispora*, a cell cycle check point protein, a membrane transporter, a histone deacetylase, a histidine acid phosphatase, the ribosomal protein L1 and an ABC transporter; from *P. chrysosporium*, a haloacid dehalogenase-like hydrolase and a glycoside hydrolase; from *P. placenta*, an inositol polyphosphate phosphatase, a metal-dependent phosphohydrolase, and a monooxygenase; from *H. annosum*, a mitochondrial carrier transporter, a Golgi transporter, and zinc finger transcription factor (Fig. 12).

11 out of 12 gene models of *C. subvermispora* were syntenic to a 20 Kbp regions of *P. ostreatus* chromosome 8. These gene models corresponded to a nucleic acid binding protein, citrate synthase, a methyltransferase, an exoribonuclease, a phosphoribosyltransferase, a prenyltransferase, a homeobox transcription factor, a mitochondrial inner membrane protein importer, as DNA-J type heat shock protein, a RNA splicing protein, and a cytochrome c oxidase.

The genome of *P. falciparum* is organized in 14 compartmentalized chromosomes where the conserved regions form the central chromosomal domains and the polymorphic regions are at the terminal domains. In this way, housekeeping genes tend to be located at the central regions of the chromosomes, whereas the highly variable gene families responsible for the antigenic variation of the parasite are clustered towards the telomeres (Hernandez-Rivas et al., 2010). Our results suggest that a similar type of chromosomal organization would be expected to occur in basidiomycetes, although a larger number of genomes should be studied to fully support this hypothesis.

The bioinformatics analysis described in this paper allowed us to establish the type and the number of the telomere repeat unit in the basidiomycetes analyzed, to suggest the putative linkage groups in fungi where linkage maps are not available, to uncover misassembled telomere regions, and to reveal the preference for some gene models to be located at the subtelomeric regions and to uncover synteny among the subtelomere regions in the

This work has been supported by funds of the AGL2008-05608-C02-01 of the Spanish National Plan of Scientific Research, the Bioethanol Euroinnova project of the Goverment of Navarre (Spain), by additional institutional support from the Public University of Navarre. Some of the sequence data were produced in genome sequence projects developed at the JGI within the Community Sequence Program under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by its

LR led and coordinated the project, GP determined bioinformatically the telomeres and subtelomere regions in the species. RC, FS and AGP made the GO analysis of the data. The

Abad, J. P., De Pablos, B., Osoegawa, K., De Jong, P. J., Martin-Gallardo, A. & Villasante, A.

Al-Wahiby, S., Wong, H. P. & Slijepcevic, P. (2005). Shortened telomeres in murine scid cells

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.

Anderson, J. A., Song, Y. S. & Langley, C. H. (2008). Molecular population genetics of Drosophila subtelomeric DNA. *Genetics.*, Vol. 178, No. 1, pp. 477-487. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P.,

Asiegbu, F. O., Adomas, A. & Stenlid, J. (2005). Conifer root and butt rot caused by

Aubert, G. & Lansdorp, P. M. (2008). Telomeres and aging. *Physiol Rev*, Vol. 88, No. 2, pp.

Azzalin, C. M. & Lingner, J. (2007). Molecular biology: damage control. *Nature.*, Vol. 448,

search programs. *Nucleic Acids Res*, Vol. 25, No. 17, pp. 3389-3402.

Ontology Consortium. *Nat Genet.*, Vol. 25, No. 1, pp. 25-29.

(2004). TAHRE, a novel telomeric retrotransposon from Drosophila melanogaster, reveals the origin of Drosophila telomeres. *Mol Biol Evol.*, Vol. 21, No. 9, pp. 1620-

expressing mutant hRAD54 coincide with reduction in recombination at telomeres.

J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database

Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene

Heterobasidion annosum (Fr.) Bref. s.l. *Mol Plant Pathol.*, Vol. 6, No. 4, pp. 395-409.

associate National Laboratories Lawrence Livermore and Los Alamos.

*Mutat Res*, Vol. 578, No. 1-2, pp. 134-142.

**4. Conclusion** 

basidiomycetes analyzed.

**5. Acknowledgements** 

**6. References** 

1624.

557-579.

No. 7157, pp. 1001-1002.

manuscript was prepared by LR, and AGP.

Fig. 12. Mosaic structure of *P. ostreatus* chromosome 4. The syntenic gene models of *P. ostreatus* (Post), *P. chrysosporium* (Pchr), *P. placenta* (Ppla) and *H. annosum* (Hann) are indicated along with their position on the *P. ostreatus* chromosome.

#### **4. Conclusion**

414 Bioinformatics – Trends and Methodologies

Ppla\_42906

Ppla\_93580

Ppla\_93579

Ppla\_46863

indicated along with their position on the *P. ostreatus* chromosome.

Fig. 12. Mosaic structure of *P. ostreatus* chromosome 4. The syntenic gene models of *P. ostreatus* (Post), *P. chrysosporium* (Pchr), *P. placenta* (Ppla) and *H. annosum* (Hann) are

Pchr\_129484

Pchr\_35714

Ppla\_46929 3,275,000

3,280,000

Post\_26215

Post\_39620 Post\_26234

Post\_26915

Post\_157567

Post\_185946

Post\_176197

**TTAGGG**

Csub\_148142

Csub\_110877 Csub\_110878

Csub\_45036

Hann\_47641

Hann\_314581

Hann\_432754

Hann\_17090

Hann\_313250 Hann\_48385

Csub\_79948

Csub\_110873

Csub\_110874

Post\_1103034

Post\_1063733 Post\_1015429

Post\_1093096

Post\_1064481

Post\_1064483

2,965,000

1,305,000

1,295,000

2,975,000

2,985,000

3,335,000

3,345,000 3,385,000 3,390,000

3,425,000 3,430,000

3,480,000

3,500,000

3,490,000

3,510,000

3,515,000

Pchr\_2004

Pchr\_2003

Post\_50083

1,120,000

722,000

1,130,000

Post\_1040099

410,000 415,000

662,000

672,000

682,000

692,000

702,000

712,000

1

10,000

Chromosome position

> 20,000 255,000 260,000

> > Post\_27447

Post\_1021683

Post\_1063555

Post\_1102815

Post\_1029741

Post\_1111873

Post\_1096328

**(CCCTAA)38**

The bioinformatics analysis described in this paper allowed us to establish the type and the number of the telomere repeat unit in the basidiomycetes analyzed, to suggest the putative linkage groups in fungi where linkage maps are not available, to uncover misassembled telomere regions, and to reveal the preference for some gene models to be located at the subtelomeric regions and to uncover synteny among the subtelomere regions in the basidiomycetes analyzed.

#### **5. Acknowledgements**

This work has been supported by funds of the AGL2008-05608-C02-01 of the Spanish National Plan of Scientific Research, the Bioethanol Euroinnova project of the Goverment of Navarre (Spain), by additional institutional support from the Public University of Navarre. Some of the sequence data were produced in genome sequence projects developed at the JGI within the Community Sequence Program under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by its associate National Laboratories Lawrence Livermore and Los Alamos.

LR led and coordinated the project, GP determined bioinformatically the telomeres and subtelomere regions in the species. RC, FS and AGP made the GO analysis of the data. The manuscript was prepared by LR, and AGP.

#### **6. References**


Coleman, M. J., McHale, M. T., Arnau, J., Watson, A. & Oliver, R. P. (1993). Cloning and

Chan, C. S. & Tye, B. K. (1983a). A family of Saccharomyces cerevisiae repetitive

Chan, C. S. & Tye, B. K. (1983b). Organization of DNA sequences and replication origins at

Chan, S. R. & Blackburn, E. H. (2004). Telomeres and telomerase. *Philos Trans R Soc Lond B* 

De Cian, A., Lacroix, L., Douarre, C., Temime-Smaali, N., Trentesaux, C., Riou, J. F. &

de Lange, T. (2005). Shelterin: the protein complex that shapes and safeguards human

de Lange, T. (2009). How telomeres solve the end-protection problem. *Science*, Vol. 326, No.

De Las Penas, A., Pan, S. J., Castano, I., Alder, J., Cregg, R. & Cormack, B. P. (2003).

Espejel, S., Franco, S., Rodriguez-Perales, S., Bouffler, S. D., Cigudosa, J. C. & Blasco, M. A.

Faravelli, M., Azzalin, C. M., Bertoni, L., Chernova, O., Attolini, C., Mondello, C. & Giulotto,

Faravelli, M., Moralli, D., Bertoni, L., Attolini, C., Chernova, O., Raimondi, E. & Giulotto, E.

Fisher, T. S. & Zakian, V. A. (2005). Ku: a multifunctional protein involved in telomere

Flint, J., Bates, G. P., Clark, K., Dorman, A., Willingham, D., Roe, B. A., Micklem, G., Higgs,

Flint, J., Thomas, K., Micklem, G., Raynham, H., Clark, K., Doggett, N. A., King, A. & Higgs,

Freitas-Junior, L. H., Bottius, E., Pirrit, L. A., Deitsch, K. W., Scheidig, C., Guinet, F.,

transcriptional silencing. *Genes Dev.*, Vol. 17, No. 18, pp. 2245-2258. Donate, L. E. & Blasco, M. A. (2011). Telomeres in cancer and ageing. *Philos Trans R Soc Lond* 

critically short telomeres. *Embo J*, Vol. 21, No. 9, pp. 2207-2219.

maintenance. *DNA Repair (Amst)*, Vol. 4, No. 11, pp. 1215-1226.

human telomeric region. *Nat Genet.*, Vol. 15, No. 3, pp. 252-257.

hamster chromosomes. *Gene*, Vol. 283, No. 1-2, pp. 11-16.

1, pp. 67-73.

pp. 131-155.

5955, pp. 948-952.

*J Mol Biol.*, Vol. 168, No. 3, pp. 505-523.

*Biol Sci*, Vol. 359, No. 1441, pp. 109-121.

*B Biol Sci*, Vol. 366, No. 1561, pp. 76-84.

Vol. 83, No. 3-4, pp. 281-286.

Vol. 407, No. 6807, pp. 1018-1022.

pp. 1305-1313.

yeast telomeres. *Cell.*, Vol. 33, No. 2, pp. 563-573.

telomeres. *Genes Dev*, Vol. 19, No. 18, pp. 2100-2110.

characterisation of telomeric DNA from Cladosporium fulvum. *Gene*, Vol. 132, No.

autonomously replicating sequences that have very similar genomic environments.

Mergny, J. L. (2008). Targeting telomeres and telomerase. *Biochimie*, Vol. 90, No. 1,

Virulence-related surface glycoproteins in the yeast pathogen Candida glabrata are encoded in subtelomeric clusters and subject to RAP1- and SIR-dependent

(2002). Mammalian Ku86 mediates chromosomal fusions and apoptosis caused by

E. (2002). Molecular organization of internal telomeric sequences in Chinese

(1998). Two extended arrays of a satellite DNA sequence at the centromere and at the short-arm telomere of Chinese hamster chromosome 5. *Cytogenet Cell Genet*,

D. R. & Louis, E. J. (1997a). Sequence comparison of human and yeast telomeres identifies structurally distinct subtelomeric domains. *Hum Mol Genet.*, Vol. 6, No. 8,

D. R. (1997b). The relationship between chromosome structure and function at a

Nehrbass, U., Wellems, T. E. & Scherf, A. (2000). Frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of P. falciparum. *Nature.*,


Azzalin, C. M., Nergadze, S. G. & Giulotto, E. (2001). Human intrachromosomal telomeric-

Azzalin, C. M., Reichenbach, P., Khoriauli, L., Giulotto, E. & Lingner, J. (2007). Telomeric

Bailey, S. M., Cornforth, M. N., Ullrich, R. L. & Goodwin, E. H. (2004). Dysfunctional

Bailey, S. M., Meyne, J., Chen, D. J., Kurimasa, A., Li, G. C., Lehnert, B. E. & Goodwin, E. H.

Barry, J. D., Ginger, M. L., Burton, P. & McCulloch, R. (2003). Why are parasite contingency genes often associated with telomeres? *Int J Parasitol.*, Vol. 33, No. 1, pp. 29-45. Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A.,

Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. *Nucleic* 

Bhattacharyya, A. & Blackburn, E. H. (1997). Aspergillus nidulans maintains short telomeres throughout development. *Nucleic Acids Res*, Vol. 25, No. 7, pp. 1426-1431. Blackburn, E. H. & Gall, J. G. (1978). A tandemly repeated sequence at the termini of the

Blasco, M. A. (2005). Telomeres and human disease: ageing, cancer and beyond. *Nat Rev* 

Blasco, M. A. (2007). Telomere length, stem cells and aging. *Nat Chem Biol.*, Vol. 3, No. 10,

Blasco, M. A., Lee, H. W., Hande, M. P., Samper, E., Lansdorp, P. M., DePinho, R. A. &

Bolzan, A. D. & Bianchi, M. S. (2006). Telomeres, interstitial telomeric repeat sequences, and

Brown, W. R., Dobson, M. J. & MacKinnon, P. (1990a). Telomere cloning and mammalian

Brown, W. R., MacKinnon, P. J., Villasante, A., Spurr, N., Buckle, V. J. & Dobson, M. J.

Callen, E. & Surralles, J. (2004). Telomere dysfunction in genome instability syndromes.

Catchen, J. M., Conery, J. S. & Postlethwait, J. H. (2009). Automated identification of

chromosomal aberrations. *Mutat Res.*, Vol. 612, No. 3, pp. 189-214.

chromosome analysis. *J Cell Sci.*, Vol. 95, No. Pt 4, pp. 521-526.

lacking telomerase RNA. *Cell*, Vol. 91, No. 1, pp. 25-34.

110, No. 2, pp. 75-82.

Vol. 3, No. 4, pp. 349-357.

Database issue, pp. D138-141.

*Genet*, Vol. 6, No. 8, pp. 611-622.

Vol. 63, No. 1, pp. 119-132.

pp. 1497-1505.

*Mutat Res*, Vol. 567, No. 1, pp. 85-104.

*Acids Res.*, Vol. 27, No. 2, pp. 573-580.

14904.

1, pp. 33-53.

pp. 640-649.

ends. *Science*, Vol. 318, No. 5851, pp. 798-801.

like repeats: sequence organization and mechanisms of origin. *Chromosoma*, Vol.

repeat containing RNA and RNA surveillance factors at mammalian chromosome

mammalian telomeres join with DNA double-strand breaks. *DNA Repair (Amst)*,

(1999). DNA double-strand break repair proteins are required to cap the ends of mammalian chromosomes. *Proc Natl Acad Sci U S A*, Vol. 96, No. 26, pp. 14899-

Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C. & Eddy, S. R. (2004). The Pfam protein families database. *Nucleic Acids Res.*, Vol. 32, No.

extrachromosomal ribosomal RNA genes in Tetrahymena. *J Mol Biol.*, Vol. 120, No.

Greider, C. W. (1997). Telomere shortening and tumor formation by mouse cells

(1990b). Structure and polymorphism of human telomere-associated DNA. *Cell.*,

conserved synteny after whole-genome duplication. *Genome Res.*, Vol. 19, No. 8,


Hernandez-Rivas, R., Perez-Toledo, K., Herrera Solorio, A. M., Delgadillo, D. M. & Vargas,

Hsu, H. L., Gilley, D., Galande, S. A., Hande, M. P., Allen, B., Kim, S. H., Li, G. C., Campisi,

Inglis, P. W., Aragao, F. J., Frazao, H., Magalhaes, B. P. & Valadares-Inglis, M. C. (2000).

Inglis, P. W., Rigden, D. J., Mello, L. V., Louis, E. J. & Valadares-Inglis, M. C. (2005).

Jackson, A. P., Gamble, J. A., Yeomans, T., Moran, G. P., Saunders, D., Harris, D., Aslett, M.,

Karpen, G. H. & Spradling, A. C. (1992). Analysis of subtelomeric heterochromatin in the

Keely, S. P., Renauld, H., Wakefield, A. E., Cushion, M. T., Smulian, A. G., Fosker, N.,

Keely, S. P., Wakefield, A. E., Cushion, M. T., Smulian, A. G., Hall, N., Barrell, B. G. &

Kersten, P. & Cullen, D. (2007). Extracellular oxidative systems of the lignin-degrading

(Glomeromycota). *Fungal Genet Biol*, Vol. 44, No. 12, pp. 1380-1386. Holmquist, G. P. & Dancis, B. (1979). Telomere replication, kinetochore organizers, and satellite DNA evolution. *Proc Natl Acad Sci U S A*, Vol. 76, No. 9, pp. 4566-4570. Housworth, E. A. & Postlethwait, J. (2002). Measures of synteny conservation between

species pairs. *Genetics.*, Vol. 162, No. 1, pp. 441-448.

*Microbiol Lett*, Vol. 191, No. 2, pp. 249-254.

*Genetics.*, Vol. 132, No. 3, pp. 737-753.

*Eukaryot Microbiol*, Vol. Suppl, No., pp. 118S-120S.

2812.

pp. 79-90.

1250-1263.

1589-1600.

77-87.

M. (2010). Telomeric heterochromatin in Plasmodium falciparum. *Journal of Biomedicine & Biotechnology*, Vol. 2010, No., pp. 290501. Epub 292010 Feb 290511. Hijri, M., Niculita, H. & Sanders, I. R. (2007). Molecular characterization of chromosome

termini of the arbuscular mycorrhizal fungus Glomus intraradices

J., Kohwi-Shigematsu, T. & Chen, D. J. (2000). Ku acts in a unique way at the mammalian telomere to prevent end joining. *Genes Dev*, Vol. 14, No. 22, pp. 2807-

Biolistic co-transformation of Metarhizium anisopliae var. acridum strain CG423 with green fluorescent protein and resistance to glufosinate ammonium. *FEMS* 

Monomorphic subtelomeric DNA in the filamentous fungus, Metarhizium anisopliae,contains a RecQ helicase-like gene. *Mol Genet Genomics*, Vol. 274, No. 1,

Barrell, J. F., Butler, G., Citiulo, F., Coleman, D. C., de Groot, P. W., Goodwin, T. J., Quail, M. A., McQuillan, J., Munro, C. A., Pain, A., Poulter, R. T., Rajandream, M. A., Renauld, H., Spiering, M. J., Tivey, A., Gow, N. A., Barrell, B., Sullivan, D. J. & Berriman, M. (2009). Comparative genomics of the fungal pathogens Candida dubliniensis and Candida albicans. *Genome Res.*, Vol. 19, No. 12, pp. 2231-2244. Johnson, J. E., Smith, J. S., Kozak, M. L. & Johnson, F. B. (2008). In vivo veritas: using yeast to

probe the biological functions of G-quadruplexes. *Biochimie*, Vol. 90, No. 8, pp.

Drosophila minichromosome Dp1187 by single P element insertional mutagenesis.

Fraser, A., Harris, D., Murphy, L., Price, C., Quail, M. A., Seeger, K., Sharp, S., Tindal, C. J., Warren, T., Zuiderwijk, E., Barrell, B. G., Stringer, J. R. & Hall, N. (2005). Gene arrays at Pneumocystis carinii telomeres. *Genetics.*, Vol. 170, No. 4, pp.

Stringer, J. R. (2001). Detailed structure of Pneumocystis carinii chromosome ends. *J* 

Basidiomycete Phanerochaete chrysosporium. *Fungal Genet Biol.*, Vol. 44, No. 2, pp.


Fry, M. (2007). Tetraplex DNA and its interacting proteins. *Front Biosci*, Vol. 12, No., pp.

Galagan, J. E., Calvo, S. E., Borkovich, K. A., Selker, E. U., Read, N. D., Jaffe, D., FitzHugh,

Neurospora crassa. *Nature.*, Vol. 422, No. 6934, pp. 859-868.

Magnaporthe grisea. *Genetics*, Vol. 162, No. 1, pp. 103-112.

Gao, W., Khang, C. H., Park, S. Y., Lee, Y. H. & Kang, S. (2002). Evolution and organization

Gibson, A. W., Wojciechowicz, L. A., Danzi, S. E., Zhang, B., Kim, J. H., Hu, Z. & Michels, C.

Gilson, E. & Geli, V. (2007). How telomeres are replicated. *Nat Rev Mol Cell Biol*, Vol. 8, No.

Goytisolo, F. A., Samper, E., Edmonson, S., Taccioli, G. E. & Blasco, M. A. (2001). The

Greider, C. W. (1998). Telomeres and senescence: the history, the experiment, the future.

Griffith, J. D., Comeau, L., Rosenfield, S., Stansel, R. M., Bianchi, A., Moss, H. & de Lange, T.

Halme, A., Bumgarner, S., Styles, C. & Fink, G. R. (2004). Genetic and epigenetic regulation

Hande, M. P. (2004). DNA repair factors and telomere-chromosome integrity in mammalian

Hande, M. P., Samper, E., Lansdorp, P. & Blasco, M. A. (1999). Telomere length dynamics

Hastie, N. D. & Allshire, R. C. (1989). Human telomeres: fusion and interstitial sites. *Trends* 

and G-strand overhang. *Mol Cell Biol*, Vol. 21, No. 11, pp. 3642-3651. Grabherr, M. G., Russell, P., Meyer, M., Mauceli, E., Alfoldi, J., Di Palma, F. & Lindblad-Toh,

Satsuma. *Bioinformatics*, Vol. 26, No. 9, pp. 1145-1151.

*Curr Biol.*, Vol. 8, No. 5, pp. R178-181.

of a highly dynamic, subtelomeric helicase gene family in the rice blast fungus

A. (1997). Constitutive mutations of the Saccharomyces cerevisiae MAL-activator genes MAL23, MAL43, MAL63, and mal64. *Genetics.*, Vol. 146, No. 4, pp. 1287-

absence of the DNA-dependent protein kinase catalytic subunit in mice results in anaphase bridges and in increased telomeric fusions with normal telomere length

K. (2010). Genome-wide synteny through highly sensitive sequence alignment:

(1999). Mammalian telomeres end in a large duplex loop. *Cell.*, Vol. 97, No. 4, pp.

of the FLO gene family generates cell-surface variation in yeast. *Cell.*, Vol. 116, No.

cells. *Cytogenet Genome Res*, Vol. 104, No. 1-4, pp. 116-122. Epub 292010 Feb 290511.

and chromosomal instability in cells derived from telomerase null mice. *J Cell Biol*,

W., Ma, L. J., Smirnov, S., Purcell, S., Rehman, B., Elkins, T., Engels, R., Wang, S., Nielsen, C. B., Butler, J., Endrizzi, M., Qui, D., Ianakiev, P., Bell-Pedersen, D., Nelson, M. A., Werner-Washburne, M., Selitrennikoff, C. P., Kinsey, J. A., Braun, E. L., Zelter, A., Schulte, U., Kothe, G. O., Jedd, G., Mewes, W., Staben, C., Marcotte, E., Greenberg, D., Roy, A., Foley, K., Naylor, J., Stange-Thomann, N., Barrett, R., Gnerre, S., Kamal, M., Kamvysselis, M., Mauceli, E., Bielke, C., Rudd, S., Frishman, D., Krystofova, S., Rasmussen, C., Metzenberg, R. L., Perkins, D. D., Kroken, S., Cogoni, C., Macino, G., Catcheside, D., Li, W., Pratt, R. J., Osmani, S. A., DeSouza, C. P., Glass, L., Orbach, M. J., Berglund, J. A., Voelker, R., Yarden, O., Plamann, M., Seiler, S., Dunlap, J., Radford, A., Aramayo, R., Natvig, D. O., Alex, L. A., Mannhaupt, G., Ebbole, D. J., Freitag, M., Paulsen, I., Sachs, M. S., Lander, E. S., Nusbaum, C. & Birren, B. (2003). The genome sequence of the filamentous fungus

4336-4351.

1298.

503-514.

3, pp. 405-415.

Vol. 144, No. 4, pp. 589-601.

*Genet*, Vol. 5, No. 10, pp. 326-331.

10, pp. 825-838.


Lundell, T. K., Makela, M. R. & Hilden, K. (2010). Lignin-modifying enzymes in filamentous

Maizels, N. (2006). Dynamic roles for G4 DNA in the biology of eukaryotic cells. *Nat Struct* 

Mandell, J. G., Goodrich, K. J., Bahler, J. & Cech, T. R. (2005). Expression of a RecQ helicase

Martinez, A. T., Speranza, M., Ruiz-Duenas, F. J., Ferreira, P., Camarero, S., Guillen, F.,

Martinez, D., Challacombe, J., Morgenstern, I., Hibbett, D., Schmoll, M., Kubicek, C. P.,

Martinez, D., Larrondo, L. F., Putnam, N., Gelpke, M. D., Huang, K., Chapman, J.,

Maser, R. S. & DePinho, R. A. (2004). Telomeres and the DNA damage response: why the fox is guarding the henhouse. *DNA Repair (Amst)*, Vol. 3, No. 8-9, pp. 979-988. Masutomi, K., Yu, E. Y., Khurts, S., Ben-Porath, I., Currier, J. L., Metz, G. B., Brooks, M. W.,

Mefford, H. C. & Trask, B. J. (2002). The complex structure and dynamic evolution of human

Meyne, J., Baker, R. J., Hobart, H. H., Hsu, T. C., Ryder, O. A., Ward, O. G., Wiley, J. E.,

Mondoux, M. & Zakian, V. A. (2005). Telomere position effect: silencing near the end. In:

Morin, G. B. (1989). The human telomere terminal transferase enzyme is a ribonucleoprotein that synthesizes TTAGGG repeats. *Cell.*, Vol. 59, No. 3, pp. 521-529.

subtelomeres. *Nat Rev Genet.*, Vol. 3, No. 2, pp. 91-102.

Helfenbein, K. G., Ramaiya, P., Detter, J. C., Larimer, F., Coutinho, P. M., Henrissat, B., Berka, R., Cullen, D. & Rokhsar, D. (2004). Genome sequence of the lignocellulose degrading fungus Phanerochaete chrysosporium strain RP78. *Nat* 

Kaneko, S., Murakami, S., DeCaprio, J. A., Weinberg, R. A., Stewart, S. A. & Hahn, W. C. (2003). Telomerase maintains telomere structure in normal human cells. *Cell*,

Wurster-Hill, D. H., Yates, T. L. & Moyzis, R. K. (1990). Distribution of nontelomeric sites of the (TTAGGG)n telomeric sequence in vertebrate chromosomes.

*Telomeres* T. de Lange, V. Lundblad and E. H. Blackburn, pp. (261-316), CSHL Press,

Vol. 50, No. 1, pp. 5-20.

pp. 1954-1959.

*Mol Biol*, Vol. 13, No. 12, pp. 1055-1059.

*Chem*, Vol. 280, No. 7, pp. 5249-5257.

*Biotechnol.*, Vol. 22, No. 6, pp. 695-700.

*Chromosoma*, Vol. 99, No. 1, pp. 3-10.

Cold Spring Harbor, New York.

Vol. 114, No. 2, pp. 241-253.

lignin. *Int Microbiol.*, Vol. 8, No. 3, pp. 195-204.

basidiomycetes--ecological, functional and phylogenetic review. *J Basic Microbiol*,

homolog affects progression through crisis in fission yeast lacking telomerase. *J Biol* 

Martinez, M. J., Gutierrez, A. & del Rio, J. C. (2005). Biodegradation of lignocellulosics: microbial, chemical, and enzymatic aspects of the fungal attack of

Ferreira, P., Ruiz-Duenas, F. J., Martinez, A. T., Kersten, P., Hammel, K. E., Vanden Wymelenberg, A., Gaskell, J., Lindquist, E., Sabat, G., Bondurant, S. S., Larrondo, L. F., Canessa, P., Vicuna, R., Yadav, J., Doddapaneni, H., Subramanian, V., Pisabarro, A. G., Lavin, J. L., Oguiza, J. A., Master, E., Henrissat, B., Coutinho, P. M., Harris, P., Magnuson, J. K., Baker, S. E., Bruno, K., Kenealy, W., Hoegger, P. J., Kues, U., Ramaiya, P., Lucas, S., Salamov, A., Shapiro, H., Tu, H., Chee, C. L., Misra, M., Xie, G., Teter, S., Yaver, D., James, T., Mokrejs, M., Pospisek, M., Grigoriev, I. V., Brettin, T., Rokhsar, D., Berka, R. & Cullen, D. (2009). Genome, transcriptome, and secretome analysis of wood decay fungus Postia placenta supports unique mechanisms of lignocellulose conversion. *Proc Natl Acad Sci U S A.*, Vol. 106, No. 6,


Kusumoto, K. I., Suzuki, S. & Kashiwagi, Y. (2003). Telomeric repeat sequence of Aspergillus

Larraya, L. M., Perez, G., Ritter, E., Pisabarro, A. G. & Ramirez, L. (2000). Genetic linkage

Levis, C., Giraud, T., Dutertre, M., Fortini, D. & Brygoo, Y. (1997). Telomeric DNA of

Levis, R. W. (1993). Drosophila melanogaster does not share the telomeric repeat sequence

Lin, K. W. & Yan, J. (2008). Endings in the middle: current knowledge of interstitial

Linardopoulou, E. V., Williams, E. M., Fan, Y., Friedman, C., Young, J. M. & Trask, B. J.

Lind, M., Olson, A. & Stenlid, J. (2005). An AFLP-markers based genetic linkage map of

Linger, B. R. & Price, C. M. (2009). Conservation of telomere protein complexes: shuffling through evolution. *Crit Rev Biochem Mol Biol*, Vol. 44, No. 6, pp. 434-446. Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein similarity searches.

Lo, A. W., Sprung, C. N., Fouladi, B., Pedram, M., Sabatier, L., Ricoul, M., Reynolds, G. E. &

Long, D. M., Smidansky, E. D., Archer, A. J. & Strobel, G. A. (1998). In vivo addition of

Louis, E. J. (1995). The chromosome ends of Saccharomyces cerevisiae. *Yeast.*, Vol. 11, No.

Louis, E. J. & Borts, R. H. (1995). A complete set of marked telomeres in Saccharomyces

Louis, E. J. & Haber, J. E. (1990). Mitotic recombination among subtelomeric Y' repeats in

Louis, E. J. & Haber, J. E. (1992). The structure and evolution of subtelomeric Y' repeats in

Luke, B. & Lingner, J. (2009). TERRA: telomeric repeat-containing RNA. *Embo J.*, Vol. 28,

Luke, B., Panza, A., Redon, S., Iglesias, N., Li, Z. & Lingner, J. (2008). The Rat1p 5' to 3'

Saccharomyces cerevisiae. *Genetics*, Vol. 124, No. 3, pp. 547-559.

Saccharomyces cerevisiae. *Genetics.*, Vol. 131, No. 3, pp. 559-574.

telomeric sequences. *Mutat Res*, Vol. 658, No. 1-2, pp. 95-110.

segmental duplication. *Nature.*, Vol. 437, No. 7055, pp. 94-100.

*Science.*, Vol. 227, No. 4693, pp. 1435-1441.

247-251.

440-442.

66, No. 12, pp. 5290-5300.

No. 2, pp. 267-272.

No. 6, pp. 519-527.

4836-4850.

335-344.

16, pp. 1553-1573.

No. 17, pp. 2503-2510.

oryzae consists of dodeca-nucleotides. *Appl Microbiol Biotechnol*, Vol. 61, No. 3, pp.

map of the edible basidiomycete Pleurotus ostreatus. *Appl Environ Microbiol.*, Vol.

Botrytis cinerea: a useful tool for strain identification. *FEMS Microbiol Lett*, Vol. 157,

of another invertebrate, Ascaris lumbricoides. *Mol Gen Genet.*, Vol. 236, No. 2-3, pp.

(2005). Human subtelomeres are hot spots of interchromosomal recombination and

Heterobasidion annosum locating intersterility genes. *Fungal Genet Biol.*, Vol. 42,

Murnane, J. P. (2002). Chromosome instability as a result of double-strand breaks near telomeres in mouse embryonic stem cells. *Mol Cell Biol*, Vol. 22, No. 13, pp.

telomeric repeats to foreign DNA generates extrachromosomal DNAs in the taxolproducing fungus Pestalotiopsis microspora. *Fungal Genet Biol*, Vol. 24, No. 3, pp.

cerevisiae for physical mapping and cloning. *Genetics.*, Vol. 139, No. 1, pp. 125-136.

exonuclease degrades telomeric repeat-containing RNA and promotes telomere elongation in Saccharomyces cerevisiae. *Mol Cell*, Vol. 32, No. 4, pp. 465-477. Lundblad, V. & Blackburn, E. H. (1993). An alternative pathway for yeast telomere maintenance rescues est1- senescence. *Cell*, Vol. 73, No. 2, pp. 347-360.


Riethman, H. C., Xiang, Z., Paul, S., Morse, E., Hu, X. L., Flint, J., Chi, H. C., Grady, D. L. &

Ruiz-Duenas, F. J. & Martinez, A. T. (2009). Microbial degradation of lignin: how a bulky

Samper, E., Goytisolo, F. A., Slijepcevic, P., van Buul, P. P. & Blasco, M. A. (2000).

Sanchez-Alonso, P. & Guzman, P. (1998). Organization of chromosome ends in Ustilago

Sanchez-Alonso, P. & Guzman, P. (2008). Predicted elements of telomere organization and function in *Ustilago maydis*. *Fungal Genet Biol.*, Vol. 45 Suppl 1, No., pp. S54-62. Schaffitzel, C., Berger, I., Postberg, J., Hanes, J., Lipps, H. J. & Pluckthun, A. (2001). In vitro

Schoeftner, S. & Blasco, M. A. (2008). Developmentally regulated transcription of

Schoeftner, S. & Blasco, M. A. (2009). Chromatin regulation and non-coding RNAs at mammalian telomeres. *Semin Cell Dev Biol*, Vol. 21, No. 2, pp. 186-193. Selker, E. U. (1990). Premeiotic instability of repeated sequences in Neurospora crassa. *Annu* 

Sherr, C. J. & McCormick, F. (2002). The RB and p53 pathways in cancer. *Cancer Cell*, Vol. 2,

Shore, D. & Bianchi, A. (2009). Telomere length regulation: coupling DNA end processing to feedback regulation of telomerase. *Embo J*, Vol. 28, No. 16, pp. 2309-2322. Slijepcevic, P., Xiao, Y., Dominguez, I. & Natarajan, A. T. (1996). Spontaneous and radiation-

Smogorzewska, A., Karlseder, J., Holtgreve-Grez, H., Jauch, A. & de Lange, T. (2002). DNA

Soler, D., Genesca, A., Arnedo, G., Egozcue, J. & Tusell, L. (2005). Telomere dysfunction

Starr, J. M., McGurn, B., Harris, S. E., Whalley, L. J., Deary, I. J. & Shiels, P. G. (2007).

Takai, H., Smogorzewska, A. & de Lange, T. (2003). DNA damage foci at dysfunctional

Tang, H., Lyons, E., Pedersen, B., Schnable, J. C., Paterson, A. H. & Freeling, M. (2011).

induced chromosomal breakage at interstitial telomeric sites. *Chromosoma*, Vol. 104,

ligase IV-dependent NHEJ of deprotected mammalian telomeres in G1 and G2.

drives chromosomal instability in human mammary epithelial cells. *Genes* 

Association between telomere length and heart disease in a narrow age cohort of

Screening synteny blocks in pairwise genome comparisons through integer

genome sequence. *Nature.*, Vol. 409, No. 6822, pp. 948-951.

110.1111/j.1751-7915.2008.00078.x.

pp. 1043-1054.

8572-8577.

10, No. 2, pp. 228-236.

No. 2, pp. 103-112.

No. 8, pp. 596-604.

*Rev Genet*, Vol. 24, No., pp. 579-613.

*Curr Biol*, Vol. 12, No. 19, pp. 1635-1644.

*Chromosomes Cancer*, Vol. 44, No. 4, pp. 339-350.

older people. *Exp Gerontol*, Vol. 42, No. 6, pp. 571-573.

telomeres. *Curr Biol*, Vol. 13, No. 17, pp. 1549-1556.

programming. *BMC Genomics*, Vol. 12, No., pp. 102.

Moyzis, R. K. (2001). Integration of telomere sequences with the draft human

recalcitrant polymer is efficiently recycled in nature and how we can take advantage of this. *Microb Biotechnol.*, Vol. 2, No. 2, pp. 164-177. doi:

Mammalian Ku86 protein prevents telomeric fusions independently of the length of TTAGGG repeats and the G-strand overhang. *EMBO Rep*, Vol. 1, No. 3, pp. 244-252.

maydis. RecQ-like helicase motifs at telomeric regions. *Genetics.*, Vol. 148, No. 3,

generated antibodies specific for telomeric guanine-quadruplex DNA react with Stylonychia lemnae macronuclei. *Proc Natl Acad Sci U S A*, Vol. 98, No. 15, pp.

mammalian telomeres by DNA-dependent RNA polymerase II. *Nat Cell Biol*, Vol.


Murnane, J. P. & Sabatier, L. (2004). Chromosome rearrangements resulting from telomere dysfunction and their role in cancer. *Bioessays*, Vol. 26, No. 11, pp. 1164-1174. Nagele, R. G., Velasco, A. Q., Anderson, W. J., McMahon, D. J., Thomson, Z., Fazekas, J.,

Ogami, M., Ikura, Y., Ohsawa, M., Matsuo, T., Kayo, S., Yoshimi, N., Hai, E., Shirai, N.,

Oganesian, L., Moon, I. K., Bryan, T. M. & Jarstfer, M. B. (2006). Extension of G-quadruplex

Padmavathi, J., UmaDevi, K., Rao, C. U. & Reddy, N. N. (2003). Telomere fingerprinting for

Paeschke, K., McDonald, K. R. & Zakian, V. A. (2010). Telomeres: structures in need of

Paeschke, K., Simonsson, T., Postberg, J., Rhodes, D. & Lipps, H. J. (2005). Telomere end-

Palm, W. & de Lange, T. (2008). How shelterin protects mammalian telomeres. *Annu Rev* 

Perez, G., Pangilinan, J., Pisabarro, A. G. & Ramirez, L. (2009). Telomere organization in the

Pryde, F. E., Gorham, H. C. & Louis, E. J. (1997). Chromosome ends: all the same under their

Pryde, F. E. & Louis, E. J. (1997). Saccharomyces cerevisiae telomeres. A review. *Biochemistry* 

Rachidi, N., Martinez, M. J., Barre, P. & Blondin, B. (2000). Saccharomyces cerevisiae PAU genes are induced by anaerobiosis. *Mol Microbiol.*, Vol. 35, No. 6, pp. 1421-1430. Rehmeyer, C., Li, W., Kusaba, M., Kim, Y. S., Brown, D., Staben, C., Dean, R. & Farman, M.

Rhodes, D., Fairall, L., Simonsson, T., Court, R. & Chapman, L. (2002). Telomere

Riethman, H., Ambrosini, A., Castaneda, C., Finklestein, J., Hu, X. L., Mudunuri, U., Paul, S.

Riethman, H., Ambrosini, A. & Paul, S. (2005). Human subtelomere structure and variation.

DNA by ciliate telomerase. *Embo J*, Vol. 25, No. 5, pp. 1148-1159.

unwinding. *FEBS Lett*, Vol. 584, No. 17, pp. 3760-3772.

caps. *Curr Opin Genet Dev.*, Vol. 7, No. 6, pp. 822-828.

oryzae. *Nucleic Acids Res*, Vol. 34, No. 17, pp. 4685-4701.

architecture. *EMBO Rep*, Vol. 3, No. 12, pp. 1139-1145.

assemblies. *Genome Res.*, Vol. 14, No. 1, pp. 18-28.

*Chromosome Res*, Vol. 13, No. 5, pp. 505-515.

*Nat Struct Mol Biol*, Vol. 12, No. 10, pp. 847-854.

*Genet*, Vol. 42, No., pp. 301-334.

*(Mosc).* Vol. 62, No. 11, pp. 1232-1241.

5, pp. 1427-1436.

pp. 377-388.

546-550.

604.

Wind, K. & Lee, H. (2001). Telomere associations in interphase nuclei: possible role in maintenance of interphase chromosome topology. *J Cell Sci*, Vol. 114, No. Pt 2,

Ehara, S., Komatsu, R., Naruko, T. & Ueda, M. (2004). Telomere shortening in human coronary artery diseases. *Arterioscler Thromb Vasc Biol*, Vol. 24, No. 3, pp.

assessing chromosome number, isolate typing and recombination in the entomopathogen Beauveria bassiana. *Mycol Res*, Vol. 107, No. Pt 5, pp. 572-580. Paeschke, K., Juranek, S., Simonsson, T., Hempel, A., Rhodes, D. & Lipps, H. J. (2008).

Telomerase recruitment by the telomere end binding protein-beta facilitates Gquadruplex DNA unfolding in ciliates. *Nat Struct Mol Biol*, Vol. 15, No. 6, pp. 598-

binding proteins control the formation of G-quadruplex DNA structures in vivo.

ligninolytic basidiomycete Pleurotus ostreatus. *Appl Environ Microbiol.*, Vol. 75, No.

(2006). Organization of chromosome ends in the rice blast fungus, Magnaporthe

& Wei, J. (2004). Mapping and initial analysis of human subtelomeric sequence


**20** 

*1,2Australia 3Denmark* 

**SNPpattern: A Genetic Tool to Derive** 

Stephen J. Goodswen1,2 and Haja N. Kadarmideen3 *1University of Technology Sydney, Broadway, Sydney, NSW 2CSIRO Livestock Industries, ATSIP, University Drive, James Cook University Campus, Townsville, QLD* 

*University of Copenhagen, Frederiksberg C* 

**Haplotype Blocks and Measure Genomic** 

*3Department of Basic Animal and Veterinary Sciences, Faculty of Life Sciences,* 

**Diversity in Populations Using SNP Genotypes** 

The aftermath of the Human Genome Project has generated new revolutionary techniques and equipment such as high throughput measurement tools for collecting biological information. One notable tool is a microarray that can be used to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) in one run. This highthroughput SNP genotypes along with phenotypic measurements can be used in fine quantitative trait loci (QTL) mapping or genome-wide association studies (GWAS). The result of fine QTL mapping or GWAS is a set of statistically significant QTL regions or genetic markers such as SNPs. See Box 1 for SNP, QTL and GWAS explanation. The significant QTLs or SNPs from QTL mapping or GWAS are used subsequently in QTL or SNP – based selection of elite animals or plants for breeding in agriculture or used to predict disease risks in humans and animals (e.g. Burton et al. 2007, Mackay et al. 2009). GWAS relies on a natural phenomenon of linkage disequilibrium (LD) between genetic (SNP) markers and causal variants or quantitative trait nucleotide (QTN). For GWAS to be applied successfully there is a need to understand the extent and distribution of linkage disequilibrium (LD) across the entire genome in a population. In particular, we need to know how LD varies from one region (or population) to another. This need to know how LD (and haplotype diversity) varies from one region or population to another provided the motivation to develop SNPpattern, a

generic bioinformatic tool for finding SNP allele patterns in populations.

We are currently in a bioinformatics era. The emergence of bioinformatics is the result of two converging forces. One relates to the exponential increase in computer processing power, digital storage capacity, and digital communication. The other force is the exponential increase in biological data (Larranaga et al., 2006). Prior to the 1990s biologists

**1.1 The principles of linkage disequilibrium (LD) and haplotypes** 

**1. Introduction** 


### **SNPpattern: A Genetic Tool to Derive Haplotype Blocks and Measure Genomic Diversity in Populations Using SNP Genotypes**

Stephen J. Goodswen1,2 and Haja N. Kadarmideen3

*1University of Technology Sydney, Broadway, Sydney, NSW 2CSIRO Livestock Industries, ATSIP, University Drive, James Cook University Campus, Townsville, QLD 3Department of Basic Animal and Veterinary Sciences, Faculty of Life Sciences, University of Copenhagen, Frederiksberg C 1,2Australia 3Denmark* 

#### **1. Introduction**

424 Bioinformatics – Trends and Methodologies

Teng, S. C., Chang, J., McCowan, B. & Zakian, V. A. (2000). Telomerase-independent

Teunissen, A. W. & Steensma, H. Y. (1995). Review: the dominant flocculation genes of

Tien, M. (1987). Properties of ligninase from Phanerochaete chrysosporium and their possible applications. *Crit Rev Microbiol*, Vol. 15, No. 2, pp. 141-168. Uchida, W., Matsunaga, S., Sugiyama, R. & Kawano, S. (2002). Interstitial telomere-like repeats in the Arabidopsis thaliana genome. *Genes Genet Syst*, Vol. 77, No. 1, pp. 63-67. van Steensel, B., Smogorzewska, A. & de Lange, T. (1998). TRF2 protects human telomeres

Vega, L. R., Mateyak, M. K. & Zakian, V. A. (2003). Getting to the end: telomerase access in yeast and humans. *Nat Rev Mol Cell Biol*, Vol. 4, No. 12, pp. 948-959. Verdun, R. E. & Karlseder, J. (2007). Replication and protection of telomeres. *Nature*, Vol.

Walter, M. F., Jang, C., Kasravi, B., Donath, J., Mechler, B. M., Mason, J. M. & Biessmann, H.

Weber, B., Collins, C., Robbins, C., Magenis, R. E., Delaney, A. D., Gray, J. W. & Hayden, M.

Welchen, E. & Gonzalez, D. H. (2005). Differential expression of the Arabidopsis cytochrome

Wellinger, R. J., Ethier, K., Labrecque, P. & Zakian, V. A. (1996). Evidence for a new step in

Wellinger, R. J., Wolf, A. J. & Zakian, V. A. (1993). Saccharomyces telomeres acquire single-

Wilkie, A. O., Higgs, D. R., Rack, K. A., Buckle, V. J., Spurr, N. K., Fischel-Ghodsian, N.,

Wu, C., Kim, Y. S., Smith, K. M., Li, W., Hood, H. M., Staben, C., Selker, E. U., Sachs, M. S. &

Zhao, Y., Sfeir, A. J., Zou, Y., Buseman, C. M., Chow, T. T., Shay, J. W. & Wright, W. E.

fungus Neurospora crassa. *Genetics.*, Vol. 181, No. 3, pp. 1129-1145. Zahler, A. M., Williamson, J. R., Cech, T. R. & Prescott, D. M. (1991). Inhibition of telomerase by G-quartet DNA structures. *Nature*, Vol. 350, No. 6320, pp. 718-720. Zakian, V. A. (1995). Telomeres: beginning to understand the end. *Science*, Vol. 270, No.

from fill-in in human cancer cells. *Cell.*, Vol. 138, No. 3, pp. 463-475.

(1995). DNA organization and polymorphism of a wild-type Drosophila telomere

R. (1990). Characterization and organization of DNA sequences adjacent to the human telomere associated repeat (TTAGGG)n. *Nucleic Acids Res*, Vol. 18, No. 11,

c genes Cytc-1 and Cytc-2. Evidence for the involvement of TCP-domain proteinbinding elements in anther- and meristem-specific expression of the Cytc-1 gene.

Ceccherini, I., Brown, W. R. & Harris, P. C. (1991). Stable length polymorphism of up to 260 kb at the tip of the short arm of human chromosome 16. *Cell.*, Vol. 64, No.

Farman, M. L. (2009). Characterization of chromosome ends in the filamentous

(2009). Telomere extension occurs at most chromosome ends and is uncoupled

from end-to-end fusions. *Cell*, Vol. 92, No. 3, pp. 401-413.

region. *Chromosoma.*, Vol. 104, No. 4, pp. 229-241.

*Plant Physiol*, Vol. 139, No. 1, pp. 88-100.

telomere maintenance. *Cell*, Vol. 85, No. 3, pp. 423-433.

strand TG1-3 tails late in S phase. *Cell*, Vol. 72, No. 1, pp. 51-60.

inhibited recombinational process. *Mol Cell*, Vol. 6, No. 4, pp. 947-952. Teng, S. C. & Zakian, V. A. (1999). Telomere-telomere recombination is an efficient bypass

19, No. 12, pp. 8083-8093.

11, No. 11, pp. 1001-1013.

447, No. 7147, pp. 924-931.

pp. 3353-3361.

3, pp. 595-606.

5242, pp. 1601-1607.

lengthening of yeast telomeres occurs by an abrupt Rad50p-dependent, Rif-

pathway for telomere maintenance in Saccharomyces cerevisiae. *Mol Cell Biol*, Vol.

Saccharomyces cerevisiae constitute a new subtelomeric gene family. *Yeast.*, Vol.

The aftermath of the Human Genome Project has generated new revolutionary techniques and equipment such as high throughput measurement tools for collecting biological information. One notable tool is a microarray that can be used to genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) in one run. This highthroughput SNP genotypes along with phenotypic measurements can be used in fine quantitative trait loci (QTL) mapping or genome-wide association studies (GWAS). The result of fine QTL mapping or GWAS is a set of statistically significant QTL regions or genetic markers such as SNPs. See Box 1 for SNP, QTL and GWAS explanation. The significant QTLs or SNPs from QTL mapping or GWAS are used subsequently in QTL or SNP – based selection of elite animals or plants for breeding in agriculture or used to predict disease risks in humans and animals (e.g. Burton et al. 2007, Mackay et al. 2009). GWAS relies on a natural phenomenon of linkage disequilibrium (LD) between genetic (SNP) markers and causal variants or quantitative trait nucleotide (QTN). For GWAS to be applied successfully there is a need to understand the extent and distribution of linkage disequilibrium (LD) across the entire genome in a population. In particular, we need to know how LD varies from one region (or population) to another. This need to know how LD (and haplotype diversity) varies from one region or population to another provided the motivation to develop SNPpattern, a generic bioinformatic tool for finding SNP allele patterns in populations.

#### **1.1 The principles of linkage disequilibrium (LD) and haplotypes**

We are currently in a bioinformatics era. The emergence of bioinformatics is the result of two converging forces. One relates to the exponential increase in computer processing power, digital storage capacity, and digital communication. The other force is the exponential increase in biological data (Larranaga et al., 2006). Prior to the 1990s biologists

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

BOX 1

and Measure Genomic Diversity in Populations Using SNP Genotypes 427

SNP: A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide — A, T, C, or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. (Source:

QTL mapping: Quantitative trait locus (QTL) mapping means identifying genes that affects a complex phenotype like disease or explains significant proportion of genetic variation of a quantitative trait observed in mapping population. It uncovers the genetic basis of quantitative variation in a trait.

GWAS: A genome-wide association study (GWAS) is an approach that involves rapidly scanning genetic (SNP) markers across genome in hundreds of individuals to find and quantify genetic variations in a particular disease or trait associated with each SNP screened. It uses highly dense SNP marker genotype data (nearing 1 million in some animal species) to detect association with phenotypes. These study require larger sample sizes than QTL mapping and requires validations in other independent populations. GWAS techniques result in a panel of predictive markers that can predict a future phenotype of an individual. How good will be a prediction by a set of markers depends on whether or not they are linked to and/or in linkage disequilibrium with causal loci.

http://en.wikipedia.org/wiki/Main\_Pag)

association studies. In genome-wide association studies (Hirschhorn et al. 2002, Pearson and Manolio 2008, Kruglyak 2008), the premise is to test for associations between the variation in a complex trait and causal mutations, however, for the most part we instead test for association between the trait and a SNP in high LD with the causal mutation. Knowledge of LD patterns has been shown to increase the power and decrease the amount of genotyping required for association studies. For example, we can use information about LD and allele frequencies across the genome to make informed decisions as to which SNPs (known as tag SNPs) should be selected for the genotyping array. That is, the number of SNPs required in GWAS can be reduced without a reduction in power if LD is extensive (Carlson et al., 2001). Linkage disequilibrium is also used in the studies of a species genetic history and origins, the detection of natural selection, and the biology of recombination from inferring the distribution of crossover events from patterns of LD Pritchard (2001). In particular for animal production, working out LD is important within breeds to determine the SNP

could be stereotyped as being isolated in their experimental laboratories doing their poorly funded projects and recording their findings in a paper format. The Human Genome Project completely changed all of this (Collins et al., 2003). Notwithstanding the staggering \$3 billion cost for the project, the scientific findings and the new revolutionary techniques and equipment have spurred on many other projects to generate an avalanche of advances in gene technologies, genomics, and molecular biology. Some of the notable developments are the high throughput measurement tools for collecting biological information; tools such as microarrays, high speed DNA sequencers, and mass spectrometers. The main outcome from all this new technology is enormous amounts of disseminated biological data in different digital formats. One of the main challenges in bioinformatics is to transform the exponentially growing biological data into useful information. What constitutes *useful*  information is of course debatable; nevertheless, information is the critical starting component to solving biological problems. Living cells are extremely complicated systems, even so, the new high throughput measurement tools have revolutionised the way we can collect biological data about these systems and begin to unravel the complexity. In the light of these advances in genomics, the bioinformatics aspiration is to provide the relevant tools to make sense of multiple sources of omics datasets or at the very least, enable the researcher to make valuable inferences, connections and predictions from the information. Kadarmideen & Reverter (2007) provided a good review of some integrative analytical framework combining multiple -omics data types specifically for livestock populations but they discuss generic issues for most species where genome sequences are being made available. For instance, Kadarmideen et al., (2006), Kadarmideen and Janss (2007) and Kadarmideen (2008), apply an integrative systems genetics approaches to map genetic variants and unravel underlying genetic networks of diabetes, stress, and reproduction, respectively in recombinant inbred strains of mouse genotyped for over 2 million SNP genetic markers and microarray expression profiled for over 20000 transcripts in various tissues. Without the relevant bioinformatics tools, it would not have been possible to integrate such large datasets and apply sophisticated statistical genetic algorithms and models.

Systematic studies of common genetic variants have shown that some combinations of polymorphisms at different loci occur more or less frequently in a population such that the alleles of these polymorphisms are associated more often than if they were unlinked. That is, there is a statistically significant difference between observed and expected allelic frequencies (expected, in this instance, refers to allelic frequencies as result of independent segregation).

This non-random and non-Mendelian association between alleles at two or more loci is referred to as linkage disequilibrium (LD) and is a departure from the Hardy-Weinberg equilibrium. SNPs (Box 1) are the most common polymorphism and are extremely dense throughout the genome which allows for an effective study of common haplotypes. For the remaining of this section, SNPs will be used when referring to variants/polymorphism in the context of LD.

Prior to the year 2004 there was little published research on LD in humans, yet from 2004 onwards an exponential release of publications commenced1 (for instance, see patterns of human LD in Ardlie et al. 2002). It is argued that this increase in interest is mainly because of the increased applications of LD as a tool. For example, LD is the essential tool of genetic

<sup>1</sup> Based on ISI Web of KnowledgeSM searches

#### BOX 1

426 Bioinformatics – Trends and Methodologies

could be stereotyped as being isolated in their experimental laboratories doing their poorly funded projects and recording their findings in a paper format. The Human Genome Project completely changed all of this (Collins et al., 2003). Notwithstanding the staggering \$3 billion cost for the project, the scientific findings and the new revolutionary techniques and equipment have spurred on many other projects to generate an avalanche of advances in gene technologies, genomics, and molecular biology. Some of the notable developments are the high throughput measurement tools for collecting biological information; tools such as microarrays, high speed DNA sequencers, and mass spectrometers. The main outcome from all this new technology is enormous amounts of disseminated biological data in different digital formats. One of the main challenges in bioinformatics is to transform the exponentially growing biological data into useful information. What constitutes *useful*  information is of course debatable; nevertheless, information is the critical starting component to solving biological problems. Living cells are extremely complicated systems, even so, the new high throughput measurement tools have revolutionised the way we can collect biological data about these systems and begin to unravel the complexity. In the light of these advances in genomics, the bioinformatics aspiration is to provide the relevant tools to make sense of multiple sources of omics datasets or at the very least, enable the researcher to make valuable inferences, connections and predictions from the information. Kadarmideen & Reverter (2007) provided a good review of some integrative analytical framework combining multiple -omics data types specifically for livestock populations but they discuss generic issues for most species where genome sequences are being made available. For instance, Kadarmideen et al., (2006), Kadarmideen and Janss (2007) and Kadarmideen (2008), apply an integrative systems genetics approaches to map genetic variants and unravel underlying genetic networks of diabetes, stress, and reproduction, respectively in recombinant inbred strains of mouse genotyped for over 2 million SNP genetic markers and microarray expression profiled for over 20000 transcripts in various tissues. Without the relevant bioinformatics tools, it would not have been possible to integrate such large datasets and apply sophisticated statistical genetic algorithms and

Systematic studies of common genetic variants have shown that some combinations of polymorphisms at different loci occur more or less frequently in a population such that the alleles of these polymorphisms are associated more often than if they were unlinked. That is, there is a statistically significant difference between observed and expected allelic frequencies (expected, in this instance, refers to allelic frequencies as result of independent

This non-random and non-Mendelian association between alleles at two or more loci is referred to as linkage disequilibrium (LD) and is a departure from the Hardy-Weinberg equilibrium. SNPs (Box 1) are the most common polymorphism and are extremely dense throughout the genome which allows for an effective study of common haplotypes. For the remaining of this section, SNPs will be used when referring to variants/polymorphism in

Prior to the year 2004 there was little published research on LD in humans, yet from 2004 onwards an exponential release of publications commenced1 (for instance, see patterns of human LD in Ardlie et al. 2002). It is argued that this increase in interest is mainly because of the increased applications of LD as a tool. For example, LD is the essential tool of genetic

models.

segregation).

the context of LD.

1 Based on ISI Web of KnowledgeSM searches

SNP: A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide — A, T, C, or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. (Source: http://en.wikipedia.org/wiki/Main\_Pag)

QTL mapping: Quantitative trait locus (QTL) mapping means identifying genes that affects a complex phenotype like disease or explains significant proportion of genetic variation of a quantitative trait observed in mapping population. It uncovers the genetic basis of quantitative variation in a trait.

GWAS: A genome-wide association study (GWAS) is an approach that involves rapidly scanning genetic (SNP) markers across genome in hundreds of individuals to find and quantify genetic variations in a particular disease or trait associated with each SNP screened. It uses highly dense SNP marker genotype data (nearing 1 million in some animal species) to detect association with phenotypes. These study require larger sample sizes than QTL mapping and requires validations in other independent populations. GWAS techniques result in a panel of predictive markers that can predict a future phenotype of an individual. How good will be a prediction by a set of markers depends on whether or not they are linked to and/or in linkage disequilibrium with causal loci.

association studies. In genome-wide association studies (Hirschhorn et al. 2002, Pearson and Manolio 2008, Kruglyak 2008), the premise is to test for associations between the variation in a complex trait and causal mutations, however, for the most part we instead test for association between the trait and a SNP in high LD with the causal mutation. Knowledge of LD patterns has been shown to increase the power and decrease the amount of genotyping required for association studies. For example, we can use information about LD and allele frequencies across the genome to make informed decisions as to which SNPs (known as tag SNPs) should be selected for the genotyping array. That is, the number of SNPs required in GWAS can be reduced without a reduction in power if LD is extensive (Carlson et al., 2001). Linkage disequilibrium is also used in the studies of a species genetic history and origins, the detection of natural selection, and the biology of recombination from inferring the distribution of crossover events from patterns of LD Pritchard (2001). In particular for animal production, working out LD is important within breeds to determine the SNP

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

defining regions with high or low haplotype diversity.

**1.2 Phasing SNP genotypes for deriving paternal and maternal haplotypes**

We currently have the technology to observe genotypes but not haplotypes. That is, we do not observe individual alleles on the chromosome. This immediately presents a problem for haplotype analysis since the phase is not known when SNPs are heterozygous. For example, given the genotype of 2 SNPs with homozygous alleles at 2 different loci, "11" and "22" respectively; the haplotype on both the paternal and maternal chromosome is conclusively "12". However, given the genotype of 2 SNPs with heterozygous alleles, "12" and "12"; we do not know which allele is inherited from which parent. The possible haplotypes are

We cannot say for certain which alleles on a haplotype go together when using genotype data with heterozygous SNP alleles. Consequently we need to determine or infer the phase from other methods. There are 3 possible methods available to the researcher: 1) use pedigree information; 2) use molecular methods to single out individual chromosomes to do genotyping (currently only possible on small regions; and 3) statistical methods to infer the haplotype given genotype data. From literature, there are several algorithms and programs for inferring haplotypes. Two of the most popular programs are called PHASE, which uses algorithms based on Bayesian coalescent models Stephens et al. (2001) and fastPHASE, which uses an EM algorithm and cluster model Scheet et al. (2006). The default PHASE and fastPHASE output format has been adopted as the format required for the input data to

recombination events.

numbers of SNPs.

shown in Figure 1.

*SNPpattern.*

and Measure Genomic Diversity in Populations Using SNP Genotypes 429

is variation in recombination rates, and regions of recombination appear and disappear over evolutionary time. By studying the patterns of LD we can at least infer the distribution of

In the literature LD is intertwined with the term haplotype. There are many definitions of the term haplotype in the literature, herein haplotype is used as being half of a genotype, that is, a set of ordered SNP alleles on a *single* chromosome that are transmitted as one unit from a parent to an offspring (Ardlie et al., 2002). Theoretically a haplotype, one unit, could comprise any number of SNPs from only 2 SNPs to every single SNP on the chromosome. In reality, however, recombination events result in haplotype blocks comprised of varying

Early studies of pairwise LD (i.e. using 2-locus haplotypes) observed complex patterns of LD implying a random nature. It is now becoming clear that despite many generations of segregation from a common ancestral chromosome, certain combinations of neighbouring SNP alleles (haplotype units) have remained unchanged. In other words, there are stretches of DNA that are almost never divided during meiosis (Gibbs et al., 2003). Although we do not fully understand the biological processes that give rise to recombination in some regions of the chromosome and not in others, there still appears to be some non-random underpinning mechanism. More recently the International Hapmap Project (Gibbs et al., 2003) has shown that the underlying structure of LD in a genome could be divided into discrete haplotype blocks. Using evidence from their LD measures, a haplotype block represents a region with a few haplotypes (2-4 per block) in a population separated by a region with many haplotypes in the population. Their proposed haplotype block model of LD, from a recombination perspective, is a region of high LD separated by recombination hotspots. There are two popular methods for block definition: 1) using pair-wise disequilibrium to define regions of high LD separated by recombination hotspots, and 2)

density for GWAS, and across breeds to check whether LD based predictions are expected to persist between breeds.

To quantify the amount of LD, a variety of different statistical measures have been proposed: D, D´ and r2. D is the basic measure for LD and the formula is D = PAB – PA \* PB (where PA and PB are the marginal allele frequencies at two loci on a chromosome; and PAB is the probability of the observed haplotype). D equates to 0 if and only if the two loci are independent. A disadvantage of D is that the range of possible values depends on the marginal allele frequencies and therefore, as there is no standardisation, it is difficult to compare D values. D´ is the standardisation of D and its formula is shown in Equation

$$\begin{aligned} D' &= \frac{D}{D\_{\text{max}}} \\\\ D' &= \frac{D}{D\_{\text{max}}} \\\\ D' &= \frac{D}{D\_{\text{min}}} \end{aligned} \text{ where } D \ge 0 \qquad \text{Where } D < 0 \qquad \text{Where } D\_{\text{min}} = \text{larger of -P\_{AB} and -P\_{ab}}$$

Equation 1. Measuring LD using D´ for 2 loci A and B with 2 alleles.

The most widely used measure for LD is a correlation between pairs of biallelic SNPs denoted by r2 (refer Equation 2). Some of the properties of r2: a value of 0 implies independence between the SNP alleles (perfect equilibrium); a value of 1 implies perfect LD. Most pairs of SNP alleles have an r2 greater than 0 or less than 1 indicating the strength of the association between their alleles. An r2 of 0.7 or 0.8 is considered strong LD between SNPs. For the most part, the strength of the correlation between SNPs decreases as the genetic distance between the SNP increases. The r2 measure also has another useful property; it is claimed to be related to the power of association mapping and can consequently be used to estimate how large the sample size needs to be to capture association (n2 = n1 / r2 where n1 is the number of cases and n2 is the number of controls). Currently for human genotyping arrays, tag SNPs are selected based on an r2 concept of LD structure for their pairwise ability to predict the genotype of untyped SNPs. For species with limited knowledge of LD, the SNPs are selected evenly distributed.

$$r^2 = \frac{D^2}{P\_A \, \, ^\ast P\_B \, \, ^\ast P\_a \, \, ^\ast P\_b}$$

Equation 2. Given haplotypes for 2 loci A and B with 2 alleles. Where P = allele frequencies, and D is a basic measure of LD e.g. \* *DP P P* = − *AB AB* .

Population genetic factors that affect LD among specific groups of SNPs are numerous, complex, and not clearly understood. Some of the acknowledged factors are mutation, historical recombination, natural selection, founder effects, migration, population growth, random drift, gene conversion, and population admixture. Only recombination is discussed further in this chapter. It has been argued that recombination is one of the main factors affecting LD (Ardlie et al., 2002). The rate of LD decay depends on the rate of recombination and for the most part, decay in LD is affected by how close the alleles are together. Little is known about the actual molecular mechanism of recombination and why some regions of the chromosome experience more recombination than others. What we do know is that there

density for GWAS, and across breeds to check whether LD based predictions are expected to

To quantify the amount of LD, a variety of different statistical measures have been proposed: D, D´ and r2. D is the basic measure for LD and the formula is D = PAB – PA \* PB (where PA and PB are the marginal allele frequencies at two loci on a chromosome; and PAB is the probability of the observed haplotype). D equates to 0 if and only if the two loci are independent. A disadvantage of D is that the range of possible values depends on the marginal allele frequencies and therefore, as there is no standardisation, it is difficult to compare D values. D´ is the standardisation of D and its formula is shown in Equation

*<sup>D</sup>* ′ = ≥ Where Dmax = the smaller of PAb and PaB

*<sup>D</sup>* ′ = < Where Dmin = larger of –PAB and –Pa*<sup>b</sup>*

The most widely used measure for LD is a correlation between pairs of biallelic SNPs denoted by r2 (refer Equation 2). Some of the properties of r2: a value of 0 implies independence between the SNP alleles (perfect equilibrium); a value of 1 implies perfect LD. Most pairs of SNP alleles have an r2 greater than 0 or less than 1 indicating the strength of the association between their alleles. An r2 of 0.7 or 0.8 is considered strong LD between SNPs. For the most part, the strength of the correlation between SNPs decreases as the genetic distance between the SNP increases. The r2 measure also has another useful property; it is claimed to be related to the power of association mapping and can consequently be used to estimate how large the sample size needs to be to capture association (n2 = n1 / r2 where n1 is the number of cases and n2 is the number of controls). Currently for human genotyping arrays, tag SNPs are selected based on an r2 concept of LD structure for their pairwise ability to predict the genotype of untyped SNPs. For species with

2

\*\*\* *<sup>A</sup> Bab D*

*P PPP* <sup>=</sup>

Equation 2. Given haplotypes for 2 loci A and B with 2 alleles. Where P = allele frequencies,

Population genetic factors that affect LD among specific groups of SNPs are numerous, complex, and not clearly understood. Some of the acknowledged factors are mutation, historical recombination, natural selection, founder effects, migration, population growth, random drift, gene conversion, and population admixture. Only recombination is discussed further in this chapter. It has been argued that recombination is one of the main factors affecting LD (Ardlie et al., 2002). The rate of LD decay depends on the rate of recombination and for the most part, decay in LD is affected by how close the alleles are together. Little is known about the actual molecular mechanism of recombination and why some regions of the chromosome experience more recombination than others. What we do know is that there

Equation 1. Measuring LD using D´ for 2 loci A and B with 2 alleles.

limited knowledge of LD, the SNPs are selected evenly distributed.

and D is a basic measure of LD e.g. \* *DP P P* = − *AB AB* .

2

*r*

persist between breeds.

max <sup>0</sup> *<sup>D</sup> D when D*

min <sup>0</sup> *<sup>D</sup> D when D* is variation in recombination rates, and regions of recombination appear and disappear over evolutionary time. By studying the patterns of LD we can at least infer the distribution of recombination events.

In the literature LD is intertwined with the term haplotype. There are many definitions of the term haplotype in the literature, herein haplotype is used as being half of a genotype, that is, a set of ordered SNP alleles on a *single* chromosome that are transmitted as one unit from a parent to an offspring (Ardlie et al., 2002). Theoretically a haplotype, one unit, could comprise any number of SNPs from only 2 SNPs to every single SNP on the chromosome. In reality, however, recombination events result in haplotype blocks comprised of varying numbers of SNPs.

Early studies of pairwise LD (i.e. using 2-locus haplotypes) observed complex patterns of LD implying a random nature. It is now becoming clear that despite many generations of segregation from a common ancestral chromosome, certain combinations of neighbouring SNP alleles (haplotype units) have remained unchanged. In other words, there are stretches of DNA that are almost never divided during meiosis (Gibbs et al., 2003). Although we do not fully understand the biological processes that give rise to recombination in some regions of the chromosome and not in others, there still appears to be some non-random underpinning mechanism. More recently the International Hapmap Project (Gibbs et al., 2003) has shown that the underlying structure of LD in a genome could be divided into discrete haplotype blocks. Using evidence from their LD measures, a haplotype block represents a region with a few haplotypes (2-4 per block) in a population separated by a region with many haplotypes in the population. Their proposed haplotype block model of LD, from a recombination perspective, is a region of high LD separated by recombination hotspots. There are two popular methods for block definition: 1) using pair-wise disequilibrium to define regions of high LD separated by recombination hotspots, and 2) defining regions with high or low haplotype diversity.

#### **1.2 Phasing SNP genotypes for deriving paternal and maternal haplotypes**

We currently have the technology to observe genotypes but not haplotypes. That is, we do not observe individual alleles on the chromosome. This immediately presents a problem for haplotype analysis since the phase is not known when SNPs are heterozygous. For example, given the genotype of 2 SNPs with homozygous alleles at 2 different loci, "11" and "22" respectively; the haplotype on both the paternal and maternal chromosome is conclusively "12". However, given the genotype of 2 SNPs with heterozygous alleles, "12" and "12"; we do not know which allele is inherited from which parent. The possible haplotypes are shown in Figure 1.

We cannot say for certain which alleles on a haplotype go together when using genotype data with heterozygous SNP alleles. Consequently we need to determine or infer the phase from other methods. There are 3 possible methods available to the researcher: 1) use pedigree information; 2) use molecular methods to single out individual chromosomes to do genotyping (currently only possible on small regions; and 3) statistical methods to infer the haplotype given genotype data. From literature, there are several algorithms and programs for inferring haplotypes. Two of the most popular programs are called PHASE, which uses algorithms based on Bayesian coalescent models Stephens et al. (2001) and fastPHASE, which uses an EM algorithm and cluster model Scheet et al. (2006). The default PHASE and fastPHASE output format has been adopted as the format required for the input data to *SNPpattern.*

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

model parameters.

al., 1974).

**2. Development of** *SNPpattern*

been no recombination for successive SNP pairs.

and Measure Genomic Diversity in Populations Using SNP Genotypes 431

The paper presents the development of *SNPpattern* as a simple bioinformatic tool to rapidly screen the genome for haplotype structure, perform some basic descriptive genome statistics and link interesting haplotypes to functional information. We have tested our software *SNP pattern* on Ovine 60k SNPchip data (Goodswen et al., 2010). One impetus for the development of *SNPpattern* was to understand the degree of diversity in LD architecture between different livestock breeds (McKay et al. 2007). It was thought that with our increased understanding we could potentially predict effect of genome selection across breeds, which is based on SNPs being in LD with causal variants for the trait of interest. In addition, we expect *SNPpattern* be used in the comparison of LD structure in detecting and localizing genomic regions where selective sweeps2 have occurred (Smith et

A commonly used software package for computing LD statistics and haplotype patterns for populations from genotype data is Haploview (Barrett et al., 2005). One of the interesting features of Haploview, is its ability to generate haplotype blocks. Haploview has a number of methods to partition the genome into blocks: 1) block definitions are based on D' confidence bounds e.g. SNP pairs are defined to be either "strong LD" (.i.e. no evidence of historical recombination) or "strong recombination". The algorithm is taken from Gabriel et al. (2002); 2) the block definition is based on a four gamete test of Hudson & Kaplan (1985) proposed by Wang et al (2002). In brief, for each SNP pair, the population frequencies of the 4 possible two-SNP haplotypes are computed (e.g. SNP 1 = A/a and SNP 2 = B/b. The 4 haplotypes are AB, Ab, aB, and ab). If all 4 haplotypes are observed with a frequency >= 0.01 (a user definable threshold), a recombination is assumed to have occurred. If only 3 haplotypes are observed no recombination is assumed. A block is formed when there has

HaploBlock is a software package, which has as one of its capabilities the inference of haplotype block models from phased or unphased data. It primarily uses a Markov chain and can account for recombination hotspots, bottlenecks, genetic drift and mutations (Greenspan & Geiger, 2004). HapBlock (a different program to the similarly named

<sup>2</sup> A selective sweep can be caused when there is a strong directional selection for a favourable new allele

that increases its frequency. Alleles in close proximity to the new allele are "swept" to fixation.

Pr(G|H) is the conditional probability of obtaining the genotypes given the haplotypes. The fastPHASE software package is a statistical model that captures patterns of LD. The variation in the patterns can be applied to estimate missing genotypes and to infer haplotype phase in samples of unrelated individuals from natural populations from unphased genotype data. The fastPHASE statistical model uses an "approximate coalescent with recombination" prior manifested from the fact that over short genomic regions haplotypes in a population have been observed to cluster into groups of similar haplotypes because of recombination (Stephens et al., 2005). The model also considers each cluster of observed haplotypes to represent a common haplotype and each haplotype is assumed to have evolved from a single cluster. The membership of each cluster is allowed to change along the chromosome in accordance with a hidden Markov model (Scheet et al., 2006). An expectation-maximization (EM) algorithm (Dempster et al., 1997) is used to estimate the

Fig. 1. Possible haplotype when 2 SNPs have heterozygous alleles.

PHASE is a statistical method inspired from coalescent theory. The coalescent theory in essence is the tracing of alleles, shared in a sample of individuals from a population, back to the most recent common ancestor Fu et al. (1999). This theory can predict the expected patterns of haplotypes in natural populations. The PHASE method is Bayesian and uses the a priori expectation of haplotypes to inform haplotype reconstruction (see Equation 3). The phase reconstruction procedure is to evaluate the conditional distribution of the unknown haplotypes corresponding to the genotypes for the individuals from a population sample. PHASE uses Gibbs sampling (Kim, 2001) to obtain an approximate sample from the posterior distribution of unknown haplotype pairs given genotype data (e.g. Pr (H | G) is the posterior probability that the reconstruction of the haplotype pairs is correct, given the genotypes *and* knowledge of previous haplotype reconstruction states). In the most simplistic terms, the algorithm begins by estimating the haplotypes for a randomly chosen individual on the assumption all other haplotypes are reconstructed correctly. The algorithm reiterates the process enough times to result in an approximate haplotype reconstruction from the posterior probability. Stephens[38] claims that PHASE, "is sufficiently accurate that reconstructing haplotypes experimentally, or by genotyping additional family members, may be an inefficient use of resources".

$$\Pr\left(H \mid G\right) = \frac{\Pr(G \mid H)\Pr(H)}{\Pr(G)}$$

Equation 3. Bayes theorem.

where,

Pr(H|G) is the conditional probability that the reconstruction of the haplotype pairs is correct given the genotypes.

Pr(H) is the prior (unconditional) probability the reconstruction of the haplotype pairs is correct irrespective of genotype data.

P(G) is the total probability of observed genotypes across all possible haplotypes (acts as a normalising constant).

PHASE is a statistical method inspired from coalescent theory. The coalescent theory in essence is the tracing of alleles, shared in a sample of individuals from a population, back to the most recent common ancestor Fu et al. (1999). This theory can predict the expected patterns of haplotypes in natural populations. The PHASE method is Bayesian and uses the a priori expectation of haplotypes to inform haplotype reconstruction (see Equation 3). The phase reconstruction procedure is to evaluate the conditional distribution of the unknown haplotypes corresponding to the genotypes for the individuals from a population sample. PHASE uses Gibbs sampling (Kim, 2001) to obtain an approximate sample from the posterior distribution of unknown haplotype pairs given genotype data (e.g. Pr (H | G) is the posterior probability that the reconstruction of the haplotype pairs is correct, given the genotypes *and* knowledge of previous haplotype reconstruction states). In the most simplistic terms, the algorithm begins by estimating the haplotypes for a randomly chosen individual on the assumption all other haplotypes are reconstructed correctly. The algorithm reiterates the process enough times to result in an approximate haplotype reconstruction from the posterior probability. Stephens[38] claims that PHASE, "is sufficiently accurate that reconstructing haplotypes experimentally, or by genotyping

Where P = chromosome inherited from father; M = chromosome inherited from mother; Genotype

( ) Pr( | ) Pr( ) Pr | Pr( ) *GH H H G <sup>G</sup>* <sup>=</sup>

Pr(H|G) is the conditional probability that the reconstruction of the haplotype pairs is

Pr(H) is the prior (unconditional) probability the reconstruction of the haplotype pairs is

P(G) is the total probability of observed genotypes across all possible haplotypes (acts as a

Fig. 1. Possible haplotype when 2 SNPs have heterozygous alleles.

additional family members, may be an inefficient use of resources".

Equation 3. Bayes theorem.

for SNPs = 12)

correct given the genotypes.

normalising constant).

correct irrespective of genotype data.

where,

Pr(G|H) is the conditional probability of obtaining the genotypes given the haplotypes. The fastPHASE software package is a statistical model that captures patterns of LD. The variation in the patterns can be applied to estimate missing genotypes and to infer haplotype phase in samples of unrelated individuals from natural populations from unphased genotype data. The fastPHASE statistical model uses an "approximate coalescent with recombination" prior manifested from the fact that over short genomic regions haplotypes in a population have been observed to cluster into groups of similar haplotypes because of recombination (Stephens et al., 2005). The model also considers each cluster of observed haplotypes to represent a common haplotype and each haplotype is assumed to have evolved from a single cluster. The membership of each cluster is allowed to change along the chromosome in accordance with a hidden Markov model (Scheet et al., 2006). An expectation-maximization (EM) algorithm (Dempster et al., 1997) is used to estimate the model parameters.

The paper presents the development of *SNPpattern* as a simple bioinformatic tool to rapidly screen the genome for haplotype structure, perform some basic descriptive genome statistics and link interesting haplotypes to functional information. We have tested our software *SNP pattern* on Ovine 60k SNPchip data (Goodswen et al., 2010). One impetus for the development of *SNPpattern* was to understand the degree of diversity in LD architecture between different livestock breeds (McKay et al. 2007). It was thought that with our increased understanding we could potentially predict effect of genome selection across breeds, which is based on SNPs being in LD with causal variants for the trait of interest. In addition, we expect *SNPpattern* be used in the comparison of LD structure in detecting and localizing genomic regions where selective sweeps2 have occurred (Smith et al., 1974).

#### **2. Development of** *SNPpattern*

A commonly used software package for computing LD statistics and haplotype patterns for populations from genotype data is Haploview (Barrett et al., 2005). One of the interesting features of Haploview, is its ability to generate haplotype blocks. Haploview has a number of methods to partition the genome into blocks: 1) block definitions are based on D' confidence bounds e.g. SNP pairs are defined to be either "strong LD" (.i.e. no evidence of historical recombination) or "strong recombination". The algorithm is taken from Gabriel et al. (2002); 2) the block definition is based on a four gamete test of Hudson & Kaplan (1985) proposed by Wang et al (2002). In brief, for each SNP pair, the population frequencies of the 4 possible two-SNP haplotypes are computed (e.g. SNP 1 = A/a and SNP 2 = B/b. The 4 haplotypes are AB, Ab, aB, and ab). If all 4 haplotypes are observed with a frequency >= 0.01 (a user definable threshold), a recombination is assumed to have occurred. If only 3 haplotypes are observed no recombination is assumed. A block is formed when there has been no recombination for successive SNP pairs.

HaploBlock is a software package, which has as one of its capabilities the inference of haplotype block models from phased or unphased data. It primarily uses a Markov chain and can account for recombination hotspots, bottlenecks, genetic drift and mutations (Greenspan & Geiger, 2004). HapBlock (a different program to the similarly named

<sup>2</sup> A selective sweep can be caused when there is a strong directional selection for a favourable new allele that increases its frequency. Alleles in close proximity to the new allele are "swept" to fixation.

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

implemented through *SNPpattern*.

between groups.

and Measure Genomic Diversity in Populations Using SNP Genotypes 433

one LD estimate in one region relates to an LD estimate in another region because SNP pairs are not necessarily independent (i.e. one region may functionally affect another region) and consequently this diminishes the certainty of which SNPs belong to which haplotype block. For example, there are cases where 2 SNPs exhibit strong pairwise LD but show different r2 to a SNP in between, and a low strength pairwise *LD* is not necessarily indicative of high ancestral recombination. In other words, SNPs in close proximity are not always in pairwise LD and by contrast, SNPs far apart can be in pairwise LD (Phillips et al., 2003). We can also expect the haplotype block boundaries to be different depending on the sample size and SNP density used. Another limitation of r2, particularly for marker-assisted selection in livestock, is that the r2 can be the same between a SNP marker and a potential causal variant in different populations, and yet the phase may be different (Roos et al., 2008). Deriving clear information about the joint inheritance of alleles in a chromosome segment is also expected not to be easy from r2 measures. It is argued instead that we can infer the joint inheritance of alleles from inferring which haplotype blocks were inherited, if we know which haplotype blocks exist in a particular population i.e. we can make inference about identity by descent (IBD) of alleles in particular regions. In light of some of these shortcomings discussed, a multiple SNP allele block approach in preference to r2 was

The required input data for *SNPpattern* is phased genotype data from either a single group or multiple groups of individuals (e.g. from different animal breeds or subpopulations). The premise for the multiple SNP allele block approach is to count the frequency of SNP allele blocks, of different sizes, found in the genomes of the group members. For example, a block of 5 SNPs spanning a few thousand base pairs could potentially comprise 32 different SNP allele patterns if the SNPs were totally independent and the population was of infinite size (the number of possible SNP allele patterns is 2n where n is the number of SNPs in the block). The general process of the program is that it counts the frequency of the various SNP allele patterns found in the same chromosomal location (the same SNP allele block region) across each individual in the group sample; then repeats the process for the next SNP allele block region along the genome, and so on. From the counts we can infer the haplotype blocks after taking into account the population structure and allelic frequencies (a user of *SNPpattern* also needs to be aware that there are numerous other population genetic factors that affect LD and determine haplotypes). The inferred haplotype block represents a region with a few distinct SNP allele patterns (indicating small amount of haplotype variation) in a population separated by a region with many SNP allele patterns (indicating an excessive amount of haplotype variation) in the population. In a typical short chromosome segment, we can expect only a few distinct SNP allele patterns. Hence the larger the SNP allele block size the less likely the distinct SNP allele patterns appear by chance because of the increased probability of recombination over larger distances. It is argued that the comparison of SNP allele pattern counts can be used as a measure of genetic distance and this comparison forms the basis for a haplotype diversity analysis within and

In addition to implementing the core components for the multiple SNP allele block approach, *SNPpattern* also implements similarity scoring between individuals*.* We can expect that the more the SNP allele patterns between two individuals are similar the more likely they will have a similar haplotype structure. Taking this one step further, if two individuals share the same extended SNP allele patterns over the same genomic region, the chance that they carry the same causal variant allele relationship by descent is much higher.

HaploBlock program) provides both a parametric dynamic programming algorithm for block partitioning with a fixed genome coverage using the minimum number of tag SNPs, and a discrete dynamic programming algorithm for block partitioning with a fixed number of tag SNPs that can cover maximum length of genome (Zhang, 2005). Finally, GERBIL is another software package that implements an algorithm for simultaneously phasing genotypes into haplotypes and block partitioning. It considers the phasing and the block partitioning as a maximum likelihood problem and uses the EM algorithm to solve it (Kimmel & Shamir, 2005). Table 1 shows a brief summary of the publicly available programs that provide functionality to define haplotype blocks from genotype data.


Table 1. Freely available programs providing "haplotype block definition from genotype data" functionality.

LD = Linkage disequilibrium: OS = Operating System platform: \*\* A Bayesian Network statistical model and Markov chain at its core: ++ Uses an expectation-maximization (EM) algorithm: \$\$ imports genotype data in a PHASE/fastPHASE format without modification In studies on human populations it has been shown that the human genome can be divided into haplotype blocks (Gabriel et al., 2005). A haplotype block is an ancestrally conserved region of varying size containing only a few common haplotypes in the population. The haplotype blocks have discrete boundaries defined by recombination hotspots (Wall et al., 2003) and Phillips et al., 2003) [51, 74]. *SNPpattern* implements a haplotype-block model as an empirical approach to best describe the linkage disequilibrium (LD) patterns. From a *SNPpattern* programming perspective, a haplotype block within a population is inferred from a region on the chromosome where there is a low SNP allele pattern count for a particular block size, separated by a region with a large SNP allele pattern count. It is proposed that the block with a large count relative to other counts along the chromosome is a region where more historical recombination events have occurred.

Whist the importance of pairwise measures of LD is acknowledged it may not always be the most appropriate measure of how strong LD is across an entire region that contains many SNPs. In particular, identifying precise haplotype-block boundaries may be difficult when using r2. The r2 measure produces for each pair of SNPs an LD strength estimate fundamentally based on probability. There is no practical evidence to explain a difference in the values of r2 between other paired SNPs in adjacent and further away regions. Pairwise measures of LD differ from SNP to SNP and defining haplotype blocks is especially open to interpretation when r2 values range between 0 and 1. There exists an uncertainty as to how

HaploBlock program) provides both a parametric dynamic programming algorithm for block partitioning with a fixed genome coverage using the minimum number of tag SNPs, and a discrete dynamic programming algorithm for block partitioning with a fixed number of tag SNPs that can cover maximum length of genome (Zhang, 2005). Finally, GERBIL is another software package that implements an algorithm for simultaneously phasing genotypes into haplotypes and block partitioning. It considers the phasing and the block partitioning as a maximum likelihood problem and uses the EM algorithm to solve it (Kimmel & Shamir, 2005). Table 1 shows a brief summary of the publicly available programs

Haploview D´ and r2 Yes No Java Linux

HapBlock D´ No No C++ Linux

HaploBlock \*\* No No Ansi C Linux

Gerbil ++ Yes No Java/C++ Linux

Table 1. Freely available programs providing "haplotype block definition from genotype

a region where more historical recombination events have occurred.

LD = Linkage disequilibrium: OS = Operating System platform: \*\* A Bayesian Network statistical model and Markov chain at its core: ++ Uses an expectation-maximization (EM) algorithm: \$\$ imports genotype data in a PHASE/fastPHASE format without modification In studies on human populations it has been shown that the human genome can be divided into haplotype blocks (Gabriel et al., 2005). A haplotype block is an ancestrally conserved region of varying size containing only a few common haplotypes in the population. The haplotype blocks have discrete boundaries defined by recombination hotspots (Wall et al., 2003) and Phillips et al., 2003) [51, 74]. *SNPpattern* implements a haplotype-block model as an empirical approach to best describe the linkage disequilibrium (LD) patterns. From a *SNPpattern* programming perspective, a haplotype block within a population is inferred from a region on the chromosome where there is a low SNP allele pattern count for a particular block size, separated by a region with a large SNP allele pattern count. It is proposed that the block with a large count relative to other counts along the chromosome is

Whist the importance of pairwise measures of LD is acknowledged it may not always be the most appropriate measure of how strong LD is across an entire region that contains many SNPs. In particular, identifying precise haplotype-block boundaries may be difficult when using r2. The r2 measure produces for each pair of SNPs an LD strength estimate fundamentally based on probability. There is no practical evidence to explain a difference in the values of r2 between other paired SNPs in adjacent and further away regions. Pairwise measures of LD differ from SNP to SNP and defining haplotype blocks is especially open to interpretation when r2 values range between 0 and 1. There exists an uncertainty as to how

**PHASE/ fastPHASE import\$\$**

No Yes Perl Linux

**Implemen-**

**tation OS** 

Windows

Windows

Windows

that provide functionality to define haplotype blocks from genotype data.

**Program Primary LD metric Visualisation of LD** 

frequency in block

SNPpattern Pattern

data" functionality.

one LD estimate in one region relates to an LD estimate in another region because SNP pairs are not necessarily independent (i.e. one region may functionally affect another region) and consequently this diminishes the certainty of which SNPs belong to which haplotype block. For example, there are cases where 2 SNPs exhibit strong pairwise LD but show different r2 to a SNP in between, and a low strength pairwise *LD* is not necessarily indicative of high ancestral recombination. In other words, SNPs in close proximity are not always in pairwise LD and by contrast, SNPs far apart can be in pairwise LD (Phillips et al., 2003). We can also expect the haplotype block boundaries to be different depending on the sample size and SNP density used. Another limitation of r2, particularly for marker-assisted selection in livestock, is that the r2 can be the same between a SNP marker and a potential causal variant in different populations, and yet the phase may be different (Roos et al., 2008). Deriving clear information about the joint inheritance of alleles in a chromosome segment is also expected not to be easy from r2 measures. It is argued instead that we can infer the joint inheritance of alleles from inferring which haplotype blocks were inherited, if we know which haplotype blocks exist in a particular population i.e. we can make inference about identity by descent (IBD) of alleles in particular regions. In light of some of these shortcomings discussed, a multiple SNP allele block approach in preference to r2 was implemented through *SNPpattern*.

The required input data for *SNPpattern* is phased genotype data from either a single group or multiple groups of individuals (e.g. from different animal breeds or subpopulations). The premise for the multiple SNP allele block approach is to count the frequency of SNP allele blocks, of different sizes, found in the genomes of the group members. For example, a block of 5 SNPs spanning a few thousand base pairs could potentially comprise 32 different SNP allele patterns if the SNPs were totally independent and the population was of infinite size (the number of possible SNP allele patterns is 2n where n is the number of SNPs in the block). The general process of the program is that it counts the frequency of the various SNP allele patterns found in the same chromosomal location (the same SNP allele block region) across each individual in the group sample; then repeats the process for the next SNP allele block region along the genome, and so on. From the counts we can infer the haplotype blocks after taking into account the population structure and allelic frequencies (a user of *SNPpattern* also needs to be aware that there are numerous other population genetic factors that affect LD and determine haplotypes). The inferred haplotype block represents a region with a few distinct SNP allele patterns (indicating small amount of haplotype variation) in a population separated by a region with many SNP allele patterns (indicating an excessive amount of haplotype variation) in the population. In a typical short chromosome segment, we can expect only a few distinct SNP allele patterns. Hence the larger the SNP allele block size the less likely the distinct SNP allele patterns appear by chance because of the increased probability of recombination over larger distances. It is argued that the comparison of SNP allele pattern counts can be used as a measure of genetic distance and this comparison forms the basis for a haplotype diversity analysis within and between groups.

In addition to implementing the core components for the multiple SNP allele block approach, *SNPpattern* also implements similarity scoring between individuals*.* We can expect that the more the SNP allele patterns between two individuals are similar the more likely they will have a similar haplotype structure. Taking this one step further, if two individuals share the same extended SNP allele patterns over the same genomic region, the chance that they carry the same causal variant allele relationship by descent is much higher.

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

through a configuration file in an INI file format.

**3.2 Grouping data** 

and Measure Genomic Diversity in Populations Using SNP Genotypes 435

In its simplest form, *SNPpattern* will accept an input, such as the one shown in Figure 2, and treat all individuals as members of the same group. The output will consequently be results for haplotype diversity within a group. The results will also be for the entire genome without any reference to the chromosomal location of the haplotypes. In spite of this, it is expected (although not mandatory) that an additional file be provided as input, which contains phenotypic information about the individual. Table 2 shows as an example the first 9 lines of a fictitious phenotype file and in this instance one specific to livestock species. Grouping data is obviously an essential part of the evaluation of haplotype diversity *between* groups. It is also a hugely critical part to account for the count biases that may be introduced due to population structure. For example, if in a particular sire breed group the number of progeny from each sire is disproportionate then the SNP allele pattern count will be biased in favour of the progeny with the largest number of siblings. Grouping an equal number of progeny from each sire should prevent the bias. *SNPpattern* includes the functionality to group the genotype data of individuals according to user-defined criteria specific to information held in columns in a phenotypic file. Theoretically the program can create a group based on any combination of columns when using the "AND" Boolean logic. For example, group all individuals according to sire breed AND year of birth. Separate output files for each group criteria are generated containing the genotype data of the group members. The output format is the same as that shown in Figure 2. The program also allows the user to use comparison operators (=, >=, <=, >, <) on any combination of column criteria. For example, if we want to group all the female progeny in area 03 born after the year 1972 having a particular parent ID, the equivalent pseudo code is sex = F AND Area = 03 AND Year of Birth > 1975 AND parent ID = 433. The grouping of the data is of course at the discretion of the researcher to create genetically meaningful groups. Summary information about the groups can also be generated. *SNPpattern* provides the flexibility of the grouping

Another optional file that can be provided as input is a SNP mapping file. Such a file allows the contents of a group file to be divided further into separate chromosome files. This division of the genotypes for the entire genome into their respective chromosome locations allows for the comparison of haplotype diversity of a particular chromosome in one group with the same chromosome number in another group. The fact that selective sweeps act differently in different chromosomes is one example as to why a study of haplotype diversity may be needed on a chromosome basis (Montpetit & Chagnon, 2006). It is mandatory that the SNP mapping file contains the SNP location and the chromosome number on which it resides. *SNPpattern* expects the SNPs in the file to be in the order that they are located on the chromosome. It is also an expectation that the SNP mapping file is most likely obtained from another source and will contain redundant information to *SNPpattern*. Therefore another configuration file, specific to dividing the genome into chromosomes, allows extraction of only the required SNP location and the chromosome number without the need for the researcher to modify the SNP mapping file. It may be arguable as to why a separate file is created for each group and/or each chromosome subgroup. From a programming perspective separate files are created for 3 reasons: 1) the output file format used is the same as PHASE and fastPHASE. It is envisaged that the *SNPpattern* group files could be imported into other programs that use this same format; 2) the separate files are a permanent record of the grouping that can be reused, as opposed to temporary grouping only at runtime; and 3) the data files can be extremely large and slower

to parse the content if all groups are recorded separately but in the one file.

Linkage disequilibrium mapping to identify the chromosomal region (the haplotype block) containing a QTL has proven to be a powerful tool Barrett et al., (2005) and Hayes et al., (2006). However, once the haplotype block has been identified, LD provides no further information to help localize the actual variants within the block (Rioux, 2001). It has been proposed that advantageous mutations through directional selection are more likely to occur in a region of low recombination (Wall et al., 2003). Conversely, there is evidence that there are alleles in recombination hotspots that are more likely to initiate the double-strand break associated with recombination (Jeffreys & Neumann, 2002). One of the outputs from *SNPpattern* is a list of chromosomal start and end locations of SNP allele blocks identified to have low and/or high haplotype diversity. In the program testing section of this chapter, how this output list could be used to link these identified regions to genomic annotation is demonstrated. We used the *FunctSNP* R package that we have developed earlier (Goodswen et al., 2010) .To recover the biology role of genomic regions with low haplotype diversity, a systems genetics or system biology approaches would be needed, as demonstrated in Kadarmideen et al., (2006) and Kadarmideen (2008).

#### **3. Implementation of** *SNPpattern*

*SNPpattern* was written in the Perl programming language. The following sections describe the methods and rationale that have shaped the development. We have tested *SNP pattern* on Ovine 60k SNPchip data and these results are based on our earlier work (Goodswen et al., 2010).

#### **3.1 Input data**

The default PHASE and fastPHASE output format has been adopted as the format required for the initial input data to *SNPpattern*. Figure 2 shows the format and is described here as it governs how the data are processed and is an aid to understanding the methods to be described later.

The genotype data for each individual is represented by 3 rows. On the first row is a unique identification of the individual. The second and third rows are the genotypes of the individual. For each consecutive locus, one allele is entered on the second row, and one on the third. *SNPpattern* expects that genotypes are phased such that the entire second row is inherited from one parent and the third row from the other parent. It is also expected that the alleles appear in the sequential order that they occur on the chromosome.


#### **3.2 Grouping data**

434 Bioinformatics – Trends and Methodologies

Linkage disequilibrium mapping to identify the chromosomal region (the haplotype block) containing a QTL has proven to be a powerful tool Barrett et al., (2005) and Hayes et al., (2006). However, once the haplotype block has been identified, LD provides no further information to help localize the actual variants within the block (Rioux, 2001). It has been proposed that advantageous mutations through directional selection are more likely to occur in a region of low recombination (Wall et al., 2003). Conversely, there is evidence that there are alleles in recombination hotspots that are more likely to initiate the double-strand break associated with recombination (Jeffreys & Neumann, 2002). One of the outputs from *SNPpattern* is a list of chromosomal start and end locations of SNP allele blocks identified to have low and/or high haplotype diversity. In the program testing section of this chapter, how this output list could be used to link these identified regions to genomic annotation is demonstrated. We used the *FunctSNP* R package that we have developed earlier (Goodswen et al., 2010) .To recover the biology role of genomic regions with low haplotype diversity, a systems genetics or system biology approaches would be needed, as demonstrated in

*SNPpattern* was written in the Perl programming language. The following sections describe the methods and rationale that have shaped the development. We have tested *SNP pattern* on Ovine 60k SNPchip data and these results are based on our earlier work (Goodswen et

The default PHASE and fastPHASE output format has been adopted as the format required for the initial input data to *SNPpattern*. Figure 2 shows the format and is described here as it governs how the data are processed and is an aid to understanding the methods to be

The genotype data for each individual is represented by 3 rows. On the first row is a unique identification of the individual. The second and third rows are the genotypes of the individual. For each consecutive locus, one allele is entered on the second row, and one on the third. *SNPpattern* expects that genotypes are phased such that the entire second row is inherited from one parent and the third row from the other parent. It is also expected that

> BEGIN GENOTYPES # id 1 1 2 1 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 1 1 2 2 1 2 1 2 2 2 1 2 1 1 2 2 1 2 1 2 # id 2 2 2 1 1 2 1 2 2 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 2 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 # id 3 2 2 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 2 2 2 1 1 1 2 2 2 1 END GENOTYPES

the alleles appear in the sequential order that they occur on the chromosome.

Kadarmideen et al., (2006) and Kadarmideen (2008).

**3. Implementation of** *SNPpattern*

Fig. 2. Data input format for SNPpattern.

al., 2010).

**3.1 Input data** 

described later.

In its simplest form, *SNPpattern* will accept an input, such as the one shown in Figure 2, and treat all individuals as members of the same group. The output will consequently be results for haplotype diversity within a group. The results will also be for the entire genome without any reference to the chromosomal location of the haplotypes. In spite of this, it is expected (although not mandatory) that an additional file be provided as input, which contains phenotypic information about the individual. Table 2 shows as an example the first 9 lines of a fictitious phenotype file and in this instance one specific to livestock species.

Grouping data is obviously an essential part of the evaluation of haplotype diversity *between* groups. It is also a hugely critical part to account for the count biases that may be introduced due to population structure. For example, if in a particular sire breed group the number of progeny from each sire is disproportionate then the SNP allele pattern count will be biased in favour of the progeny with the largest number of siblings. Grouping an equal number of progeny from each sire should prevent the bias. *SNPpattern* includes the functionality to group the genotype data of individuals according to user-defined criteria specific to information held in columns in a phenotypic file. Theoretically the program can create a group based on any combination of columns when using the "AND" Boolean logic. For example, group all individuals according to sire breed AND year of birth. Separate output files for each group criteria are generated containing the genotype data of the group members. The output format is the same as that shown in Figure 2. The program also allows the user to use comparison operators (=, >=, <=, >, <) on any combination of column criteria. For example, if we want to group all the female progeny in area 03 born after the year 1972 having a particular parent ID, the equivalent pseudo code is sex = F AND Area = 03 AND Year of Birth > 1975 AND parent ID = 433. The grouping of the data is of course at the discretion of the researcher to create genetically meaningful groups. Summary information about the groups can also be generated. *SNPpattern* provides the flexibility of the grouping through a configuration file in an INI file format.

Another optional file that can be provided as input is a SNP mapping file. Such a file allows the contents of a group file to be divided further into separate chromosome files. This division of the genotypes for the entire genome into their respective chromosome locations allows for the comparison of haplotype diversity of a particular chromosome in one group with the same chromosome number in another group. The fact that selective sweeps act differently in different chromosomes is one example as to why a study of haplotype diversity may be needed on a chromosome basis (Montpetit & Chagnon, 2006). It is mandatory that the SNP mapping file contains the SNP location and the chromosome number on which it resides. *SNPpattern* expects the SNPs in the file to be in the order that they are located on the chromosome. It is also an expectation that the SNP mapping file is most likely obtained from another source and will contain redundant information to *SNPpattern*. Therefore another configuration file, specific to dividing the genome into chromosomes, allows extraction of only the required SNP location and the chromosome number without the need for the researcher to modify the SNP mapping file. It may be arguable as to why a separate file is created for each group and/or each chromosome subgroup. From a programming perspective separate files are created for 3 reasons: 1) the output file format used is the same as PHASE and fastPHASE. It is envisaged that the *SNPpattern* group files could be imported into other programs that use this same format; 2) the separate files are a permanent record of the grouping that can be reused, as opposed to temporary grouping only at runtime; and 3) the data files can be extremely large and slower to parse the content if all groups are recorded separately but in the one file.

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

2111222 … continued

1 2 3 4 5 SNP allele

maternal chromosome.

and Measure Genomic Diversity in Populations Using SNP Genotypes 437

112212211111111211211211222121212

Fig. 3. Consecutive 3-SNP allele blocks along 1 row representing either a paternal or

blocks

allele pattern "111" is therefore 72 (proportion expected \* number of individuals).

haplotype block from SNPs in a random pattern is reviewed in the discussion section.

4 Used in preference to Chi-Square test since expected counts may be less than 5

3 Dependent on the chromosomal distance between SNPs

5 The Comprehensive Perl Archive Network

From a *SNPpattern* implementation perspective, some difficulty was encountered in programming Fisher's Exact Test. A Perl module (Text::NSP::Measures::2D::Fisher2) downloaded from CPAN5 is currently being investigated for its suitability. As an interim

Based on the allele frequencies, the null hypothesis is that we expect the observed and expected count to be the same, and as a consequence the SNPs to be independent (i.e. SNPs are segregating independently). A Fisher's Exact Test4 for count data is applied as a statistical significance test for each SNP allele pattern. Table 5 shows an example of how the data for SNP allele pattern "111" from Table 4 is used in a 2 \* 2 contingency table to compute the exact probability of observing a table with this result (Equation 4). The p-values are obtained directly using the hypergeometric distribution. The p-values (examples shown in Table 3) are used as the conditional criteria to determine which SNP allele patterns were most likely to have occurred by chance. In this example, the low p-value for pattern "111" indicates that the hypothesis is unlikely to be true and therefore the SNPs within the pattern are not independent. The success as to whether the challenge was met of distinguishing SNPs in a

allele patterns (e.g. 10 or more SNP alleles per block)3 that can be inferred to be a haplotype block, identity is more likely by descent. For short SNP allele patterns (e.g. 3 SNP alleles per block) inferred to be a haplotype block, it is more likely identical by chance. Nevertheless, we can statistically test whether the observed count distribution has arisen from independently segregating SNPs. On the other hand, it *is* debatable whether the test will achieve the desired results. If SNPs are very close then we would expect SNPs not to segregate independently, and the observed counts arise more from genetic drift (i.e. some SNP allele patterns are more frequent due to limited population size and the large effect of the contribution of only some ancestors to the current population). Despite the latter concern, in an attempt to meet this challenge, expected and observed counts are still tested for statistical significance. To determine the expected SNP allele pattern count, *SNPpattern* computes the SNP allele frequencies (Table 3). For example, the expected proportion for SNP allele pattern "111" based on the allelic frequencies of each of the 3 SNPs and assuming independence is 0.072 ( Pr(SNP 1 allele 1) \* Pr(SNP 2 allele 1) \* Pr(SNP 3 allele 1)). The expected count for SNP


Table 2. Example contents of a phenotypic file.

#### **3.3 Multiple SNP allele block approach**

This section describes the multiple SNP allele block approach implemented through *SNPpattern*. We have tested *SNP pattern* on Ovine 60k SNPchip data and tables 3-6 are based on our already published work (Goodswen et al., 2010). With reference to Figure 2 the 2 rows of biallelic SNPs contained within the phased genotype file are extracted (in this instance, a single 1 or 2 constitutes a SNP allele). One row represents the SNP alleles inherited from one parent, and the second row represents the SNP alleles inherited from the other parent. So in effect, we have a representation of paternal and maternal chromosomes composed of a long serial SNP allele pattern of 1s and 2s. Without prior knowledge, the user will not know which row represents which parental chromosome. However, when the SNP allele pattern analysis progresses the identity of the row representation may become apparent as will be demonstrated in the program testing section.

The underlying unit of the multiple SNP allele block approach is of course the SNP allele block. The serial SNP allele pattern from *one* row (e.g. representing the chromosome inherited from the paternal side) is divided into block sizes of any specified number of SNP alleles at the discretion of the researcher e.g. 3, 5, 10 or 100 (or larger) SNP alleles per block. Then if required, the SNP allele pattern from the other row is divided into blocks of the same specified size. Figure 3 shows the first 40 numbers of a SNP allele pattern of 1s and 2s that represents either a paternal or maternal chromosome for one individual. In this example, the entire SNP allele pattern is divided into blocks of 3 e.g. the first 3 blocks are "112", "212", and "211". For a *n* SNP allele block there are 2n possible SNP allele pattern combinations of 1 and 2. Therefore, a 3 SNP allele block has 23 possible patterns (111, 112, 121, 122, 211, 212, 221, and 222).

For each SNP allele block along the row that represents either the paternal or maternal chromosome, we count how many individuals in the group have the same SNP allele pattern. For example, at block location 1 (Figure 3) we count, for each of the 8 possible SNP allele combinations, how many individuals have the SNP allele pattern "111", then "112" etc. Table 3 shows an example of the SNP allele pattern count at the first 3-SNP allele block along a paternal chromosome. We could expect an equal chance of observing any one of the 8 possible SNP allele combinations (assuming the SNP allele frequencies were equal) if there was no underlying association between the 3 SNPs in the block. In reality however, we have a SNP allele pattern count profile which is a result of many generations of random and nonrandom SNP inheritance from a common ancestor. A challenge is to determine which of the 8 possible SNP allele combinations exist because the 3 SNPs were inherited by descent from a common ancestor and which SNP allele combinations exist by chance alone. For long SNP

**Human ID Parent ID Region Sex Parent ethnicity Year of birth Body Weight (kg)**  1 330 01 F American 1978 74.6 2 330 01 M American 1971 99.0 3 405 02 F African 1970 77.4 4 405 02 M African 1975 63.6 5 433 03 M Asian 1972 79.0 6 433 03 M Asian 1971 67.0 7 433 04 F Asian 1979 73.0 8 405 05 F European 1974 97.4 9 405 05 M European 1976 94.0

This section describes the multiple SNP allele block approach implemented through *SNPpattern*. We have tested *SNP pattern* on Ovine 60k SNPchip data and tables 3-6 are based on our already published work (Goodswen et al., 2010). With reference to Figure 2 the 2 rows of biallelic SNPs contained within the phased genotype file are extracted (in this instance, a single 1 or 2 constitutes a SNP allele). One row represents the SNP alleles inherited from one parent, and the second row represents the SNP alleles inherited from the other parent. So in effect, we have a representation of paternal and maternal chromosomes composed of a long serial SNP allele pattern of 1s and 2s. Without prior knowledge, the user will not know which row represents which parental chromosome. However, when the SNP allele pattern analysis progresses the identity of the row representation may become

The underlying unit of the multiple SNP allele block approach is of course the SNP allele block. The serial SNP allele pattern from *one* row (e.g. representing the chromosome inherited from the paternal side) is divided into block sizes of any specified number of SNP alleles at the discretion of the researcher e.g. 3, 5, 10 or 100 (or larger) SNP alleles per block. Then if required, the SNP allele pattern from the other row is divided into blocks of the same specified size. Figure 3 shows the first 40 numbers of a SNP allele pattern of 1s and 2s that represents either a paternal or maternal chromosome for one individual. In this example, the entire SNP allele pattern is divided into blocks of 3 e.g. the first 3 blocks are "112", "212", and "211". For a *n* SNP allele block there are 2n possible SNP allele pattern combinations of 1 and 2. Therefore, a 3 SNP allele block has 23 possible patterns (111, 112,

For each SNP allele block along the row that represents either the paternal or maternal chromosome, we count how many individuals in the group have the same SNP allele pattern. For example, at block location 1 (Figure 3) we count, for each of the 8 possible SNP allele combinations, how many individuals have the SNP allele pattern "111", then "112" etc. Table 3 shows an example of the SNP allele pattern count at the first 3-SNP allele block along a paternal chromosome. We could expect an equal chance of observing any one of the 8 possible SNP allele combinations (assuming the SNP allele frequencies were equal) if there was no underlying association between the 3 SNPs in the block. In reality however, we have a SNP allele pattern count profile which is a result of many generations of random and nonrandom SNP inheritance from a common ancestor. A challenge is to determine which of the 8 possible SNP allele combinations exist because the 3 SNPs were inherited by descent from a common ancestor and which SNP allele combinations exist by chance alone. For long SNP

Table 2. Example contents of a phenotypic file.

apparent as will be demonstrated in the program testing section.

**3.3 Multiple SNP allele block approach** 

121, 122, 211, 212, 221, and 222).

Fig. 3. Consecutive 3-SNP allele blocks along 1 row representing either a paternal or maternal chromosome.

allele patterns (e.g. 10 or more SNP alleles per block)3 that can be inferred to be a haplotype block, identity is more likely by descent. For short SNP allele patterns (e.g. 3 SNP alleles per block) inferred to be a haplotype block, it is more likely identical by chance. Nevertheless, we can statistically test whether the observed count distribution has arisen from independently segregating SNPs. On the other hand, it *is* debatable whether the test will achieve the desired results. If SNPs are very close then we would expect SNPs not to segregate independently, and the observed counts arise more from genetic drift (i.e. some SNP allele patterns are more frequent due to limited population size and the large effect of the contribution of only some ancestors to the current population). Despite the latter concern, in an attempt to meet this challenge, expected and observed counts are still tested for statistical significance. To determine the expected SNP allele pattern count, *SNPpattern* computes the SNP allele frequencies (Table 3). For example, the expected proportion for SNP allele pattern "111" based on the allelic frequencies of each of the 3 SNPs and assuming independence is 0.072 ( Pr(SNP 1 allele 1) \* Pr(SNP 2 allele 1) \* Pr(SNP 3 allele 1)). The expected count for SNP allele pattern "111" is therefore 72 (proportion expected \* number of individuals). Based on the allele frequencies, the null hypothesis is that we expect the observed and

expected count to be the same, and as a consequence the SNPs to be independent (i.e. SNPs are segregating independently). A Fisher's Exact Test4 for count data is applied as a statistical significance test for each SNP allele pattern. Table 5 shows an example of how the data for SNP allele pattern "111" from Table 4 is used in a 2 \* 2 contingency table to compute the exact probability of observing a table with this result (Equation 4). The p-values are obtained directly using the hypergeometric distribution. The p-values (examples shown in Table 3) are used as the conditional criteria to determine which SNP allele patterns were most likely to have occurred by chance. In this example, the low p-value for pattern "111" indicates that the hypothesis is unlikely to be true and therefore the SNPs within the pattern are not independent. The success as to whether the challenge was met of distinguishing SNPs in a haplotype block from SNPs in a random pattern is reviewed in the discussion section.

From a *SNPpattern* implementation perspective, some difficulty was encountered in programming Fisher's Exact Test. A Perl module (Text::NSP::Measures::2D::Fisher2) downloaded from CPAN5 is currently being investigated for its suitability. As an interim

<sup>3</sup> Dependent on the chromosomal distance between SNPs

<sup>4</sup> Used in preference to Chi-Square test since expected counts may be less than 5

<sup>5</sup> The Comprehensive Perl Archive Network

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

the SNP allele pattern counts to compensate for unequal sizes.

the haplotype blocks. Finding similarity between individuals.

with SNP allele pattern count less than threshold

and Measure Genomic Diversity in Populations Using SNP Genotypes 439

Table 4 shows the SNP allele pattern count for the first 6 consecutive blocks for a 10-SNP allele block size for a paternal chromosome. From the counts we can infer haplotype blocks. As per the *SNPpattern* premise for a haplotype block previously described in the introduction, it is a region with a low SNP allele pattern count separated by a region with a large SNP allele pattern count. In other words, it is expected that if a block has a large SNP allele pattern count relative to the counts within other blocks along the chromosome, it is likely to be a recombination hotspot. For each paternal or maternal chromosome, *SNPpattern* computes descriptive statistics such as the average number and standard deviation of patterns found per block. A user definable count threshold can be applied to filter large SNP allele patterns counts to infer the haplotype blocks. By default *SNPpattern* flags SNP allele patterns with counts greater than 1 standard deviation above average. Of course, the relevant count threshold to use and the interpretation of inferred haplotype blocks requires thorough knowledge of group population structure. It is therefore critically important that judicious grouping of genotypes takes place prior to the SNP allele pattern counts (refer previous section – Grouping data). Another point to note is that the chromosomal distance between SNPs is not equal and therefore the physical size of the each block of SNPs is not equal. Although *SNPpattern* computes and reports the physical block sizes, it does not adjust

**SNP ALLELE BLOCKS** 

1 2 3 4 5 6

**Pattern count** 70 69 115 76 57 20 **H/L flag** ++ L L H L L L **Physical block size** 619948 520686 437805 394152 398511 538789 ++ H indicates block with SNP allele pattern count greater than user defined threshold; L indicates block

Table 6. SNP allele pattern counts per 10-SNP allele block along paternal chromosome.

In summary, for this section on the multiple SNP allele block approach, using SNP allele pattern frequency counts as a measure, we can make comparisons between individuals, groups of individuals, and groups. These comparisons then allow us to make informed decisions about the general haplotype diversity. It is also expected that processing the same genotype data several times using different block sizes, we can fine-tune the distribution of

The method presented in this section was inspired from publications [80-83] on genetic distance and similarity matrices. Two genetically identical individuals (i.e. identical DNA sequences throughout the genome) will have identical haplotype structures. It therefore could be argued that the more genetically similar two individuals are to each other, the more likely they will have the same haplotype structure. In other words, the closer two individuals are related the more the DNA sequences are expected to be in common. The genotyped SNPs are of course not as accurate a unit of comparison as genome wide nucleotide sequences. However, it is not unreasonable to assume that comparing the SNP allele patterns between two individuals will provide a guideline as to the similarity of haplotype structure. So, although this method does not show the actual haplotype structure, the overall similarity in SNP allele patterns between individuals or groups of individuals will give an indication of similarity in haplotype structure. As a simple example we take 3


++ Number of individuals in group

\$\$ p-values are obtained directly using the hypergeometric distribution following a Fisher's Exact Test

Table 3. Example of SNP allele pattern counts at the first 3-SNP allele block along a paternal chromosome based on Goodswen et al., (2010)

measure, the statistical programming language R (http://www.r-project.org/) was used. *SNPpattern* can output a file containing a list of the observed SNP allele pattern counts per block in the first column and the expected SNP allele pattern counts per block (computed from the individual SNP allele frequencies) in the second column. The output file can be read directly into R and used as input to the function fisher.test () to conduct the Fisher's Exact Test for count data.


++ Row 2 and Row 3 are the rows that represent the genotype data for each individual (refer Figure 4-1). For genotype at SNP #1, 986 out of a total of 1003 individuals have a '1' on row 2, and 17 out of 1003 have a '2' a row 2

\$\$ Freq. = Allelic Frequency. For example, the population frequency of '1' at the SNP 1 location is (986 + 993) /2006 = 0.99. Likewise the population frequency of '2' at the SNP 1 location is (17 + 10)/ 2006 = 0.01

Table 4. Example of allele frequencies for 3 sequential SNPs.


Table 5. A 2 \* 2 contingency table for SNP allele pattern "111".

$$\Pr\left(a,b,c,d\right) = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{n!a!b!c!d!}$$

Equation 4. Fisher's formula for exact probability of observing the data in a contingency table.

111 14 72 0.014 0.072 6.07e-11 112 32 27 0.032 0.027 0.598 121 699 652 0.697 0.650 0.097 122 241 242 0.240 0.241 1.000 211 0 1 0 0.001 - 212 0 0 0 0.000 - 221 17 7 0.017 0.007 0.062 222 0 2 0 0.002 -

**Proportion Observed** 

**Proportion** 

**expected p-value\$\$**

**Expected SNP allele pattern count** 

Total 1003++ 1003++ 1.000 1.000

Count Count Count

Table 4. Example of allele frequencies for 3 sequential SNPs.

Table 5. A 2 \* 2 contingency table for SNP allele pattern "111".

\$\$ p-values are obtained directly using the hypergeometric distribution following a Fisher's Exact Test Table 3. Example of SNP allele pattern counts at the first 3-SNP allele block along a paternal

measure, the statistical programming language R (http://www.r-project.org/) was used. *SNPpattern* can output a file containing a list of the observed SNP allele pattern counts per block in the first column and the expected SNP allele pattern counts per block (computed from the individual SNP allele frequencies) in the second column. The output file can be read directly into R and used as input to the function fisher.test () to conduct the Fisher's

Allele Row 2++ Row 3 Freq.\$\$ Row 2 Row 3 Freq. Row 2 Row 3 Freq. 1 986 993 0.99 46 161 0.10 730 733 0.73 2 17 10 0.01 957 842 0.90 273 270 0.27 Total 1003 1003 1.00 1003 1003 1.00 1003 1003 1.00 ++ Row 2 and Row 3 are the rows that represent the genotype data for each individual (refer Figure 4-1). For genotype at SNP #1, 986 out of a total of 1003 individuals have a '1' on row 2, and 17 out of 1003

\$\$ Freq. = Allelic Frequency. For example, the population frequency of '1' at the SNP 1 location is (986 + 993) /2006 = 0.99. Likewise the population frequency of '2' at the SNP 1 location is (17 + 10)/ 2006 = 0.01

> Pattern found 14a 72b 86a+b Pattern not found 989c 931d 1920c+d Column totals 1003a+c 1003b+d 2006n

> > ( ) !( )!( )!( )! Pr( , , , ) !!!!! *ab cd ac bd abcd*

Equation 4. Fisher's formula for exact probability of observing the data in a contingency

*nabcd* ++++ <sup>=</sup>

**SNP 1 SNP 2 SNP 3** 

Observed Expected Row totals

**SNP allele pattern** 

++ Number of individuals in group

Exact Test for count data.

have a '2' a row 2

table.

**Observed SNP allele pattern count** 

chromosome based on Goodswen et al., (2010)

Table 4 shows the SNP allele pattern count for the first 6 consecutive blocks for a 10-SNP allele block size for a paternal chromosome. From the counts we can infer haplotype blocks. As per the *SNPpattern* premise for a haplotype block previously described in the introduction, it is a region with a low SNP allele pattern count separated by a region with a large SNP allele pattern count. In other words, it is expected that if a block has a large SNP allele pattern count relative to the counts within other blocks along the chromosome, it is likely to be a recombination hotspot. For each paternal or maternal chromosome, *SNPpattern* computes descriptive statistics such as the average number and standard deviation of patterns found per block. A user definable count threshold can be applied to filter large SNP allele patterns counts to infer the haplotype blocks. By default *SNPpattern* flags SNP allele patterns with counts greater than 1 standard deviation above average. Of course, the relevant count threshold to use and the interpretation of inferred haplotype blocks requires thorough knowledge of group population structure. It is therefore critically important that judicious grouping of genotypes takes place prior to the SNP allele pattern counts (refer previous section – Grouping data). Another point to note is that the chromosomal distance between SNPs is not equal and therefore the physical size of the each block of SNPs is not equal. Although *SNPpattern* computes and reports the physical block sizes, it does not adjust the SNP allele pattern counts to compensate for unequal sizes.


++ H indicates block with SNP allele pattern count greater than user defined threshold; L indicates block with SNP allele pattern count less than threshold

Table 6. SNP allele pattern counts per 10-SNP allele block along paternal chromosome.

In summary, for this section on the multiple SNP allele block approach, using SNP allele pattern frequency counts as a measure, we can make comparisons between individuals, groups of individuals, and groups. These comparisons then allow us to make informed decisions about the general haplotype diversity. It is also expected that processing the same genotype data several times using different block sizes, we can fine-tune the distribution of the haplotype blocks. Finding similarity between individuals.

The method presented in this section was inspired from publications [80-83] on genetic distance and similarity matrices. Two genetically identical individuals (i.e. identical DNA sequences throughout the genome) will have identical haplotype structures. It therefore could be argued that the more genetically similar two individuals are to each other, the more likely they will have the same haplotype structure. In other words, the closer two individuals are related the more the DNA sequences are expected to be in common. The genotyped SNPs are of course not as accurate a unit of comparison as genome wide nucleotide sequences. However, it is not unreasonable to assume that comparing the SNP allele patterns between two individuals will provide a guideline as to the similarity of haplotype structure. So, although this method does not show the actual haplotype structure, the overall similarity in SNP allele patterns between individuals or groups of individuals will give an indication of similarity in haplotype structure. As a simple example we take 3

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

located.

**3.4 Linking SNP allele block regions to genomic annotation** 

and Measure Genomic Diversity in Populations Using SNP Genotypes 441

One of the output files from a *SNPpattern* Perl script is a file that contains all SNP allele blocks where the number of distinct SNP allele patterns is low or high. The script allows for a user-definable upper or lower pattern frequency threshold. For example, if a user enters a threshold of "<3" then only SNP allele blocks with a distinct SNP allele pattern frequency of less than 3 will be output. Likewise, if the user enters ">99" only SNP allele blocks with a distinct SNP allele pattern frequency of greater than 99 will be output. Figure 4-4 shows an example of the output file. The output consists of a list with 4 columns: Chromosome number of the chromosome containing the SNP allele block (the genomic region of interest); start and end genomic location of SNP allele block; the number of distinct SNP allele patterns found within the SNP allele block for a group of individuals (only lists the genomic regions where the number of patterns is below or above a user-defined threshold), and the average number of patterns per block. The intended use of the output file is to act as a starting point for a researcher to find biological meaning in regions identified to have low or high haplotype diversity. Biological meaning may help in the understanding of why in some regions and not others there is a conservation of the same alleles from generation to generation. In other words, why is there only 1 or 2 distinct SNP allele patterns existing in the same genomic region for all individuals in a group? Conversely, some regions have a large number of different SNP allele patterns implying a hotspot region for recombination. Finding the underlying biology within the hotspot region may provide clues to the mechanism of recombination. The expectation is that the output list can be used for further downstream analysis such as searching for annotation of the chromosome region within which the SNP allele block is

Fig. 5. Example output file showing genomic regions with low SNP allele pattern counts.

genes located between a user-specified start and end location.

4 3000848 3170609 2 6.58 13 67400889 67687804 2 4.09
