**Meet the editor**

Ibrokhim Y. Abdurakhmonov received his B.S. Degree (1997) *in Biotechnology* from the National University of Uzbekistan, M.S. degree i*n Plant Breeding* (2001) from Texas A&M University of USA, PhD degree (2002) *in Molecular Genetics*, Doctor of Science degree (2009) *in Genetics*, and full professorship (2011) in *Molecular Genetics and Molecular Biotechnology* from the Institute of Genetics

and Plant Experimental Biology, Academy of Sciences of Uzbekistan. He founded (2012) and is currently leading the Center of Genomics and Bioinformatics of Uzbekistan. He serves as an associate editor/editorial board member of several international and national journals on plant sciences. He received Government award – 2010 chest badge "Sign of Uzbekistan", 2010 TWAS prize, and "ICAC Cotton Researcher of the Year 2013" for his outstanding contribution to cotton genomics and biotechnology. He was elected as the World Academy of Sciences (TWAS) Fellow (2014) on *Agricultural Science* and as a co-chair/chair of "Comparative Cenomics and Bioinformatics" workgroup (2015) of International Cotton Genome Initiative (ICGI).

### Contents

**Preface XI**



#### **X** Contents

### **Section 3 Medical Bioinformatics 151**


### **Section 4 Plant Bioinformatics 201**


### Preface

**Section 3 Medical Bioinformatics 151**

**VI** Contents

**Section 4 Plant Bioinformatics 201**

Chapter 7 **Application of Bioinformatics Methodologies in the Fields of**

Chapter 8 **The Study of Hepatitis B Virus Using Bioinformatics 177**

Chapter 9 **Bioinformatics: A Way Forward to Explore "Plant Omics" 203**

Mehboob-ur- Rahman, Tayyaba Shaheen, Mahmood-ur- Rahman,

**Understanding the Structure and Function of Gossypium 231** Venkateswara R. Sripathi, Ramesh Buyyarapu, Siva P. Kumpatla, Abreeotta J. Williams, Seloame T. Nyaku, Yonathan Tilahun, Venu

Sidra Younis, Valeriia Shnayder and Miroslav Blumenberg

**Skin Biology and Dermatology 153**

Trevor Graham Bell and Anna Kramvis

Muhammad Atif Iqbal and Yusuf Zafar

Kalavacharla and Govind C. Sharma

Chapter 10 **Bioinformatics Tools and Genomic Resources Available in**

Bioinformatics is an interdisciplinary science, and its emergence and development have pro‐ vided life science scientists with many important practical analysis tools and methods to ex‐ plore large-scale biological data. Bioinformatics has helped to interpret, understand, and utilize biological systems through basic and applied research in modern biological sciences, health care, and agriculture. It is expected that with current scientific developments and technological advances, there will be more complex data collected, often referred to as "Big Data" era. This is well exemplified by the exponential growth in the amount, scale, and di‐ versity of genomic sequence data in recent years with the current listing of 79,650 genome sequencing projects for 73,000 organisms out of which only 10% has been completed, 42% has draft information, and 45% is in incomplete state. Near-future completion of above-men‐ tioned genome sequencing projects will further create many challenges ahead in storing, handling, organizing, and analyzing ever-enlarging data volumes. Therefore, bioinformatics research has become essential in genomics and post–genomics era.

Bioinformatics research and application include the analysis of the molecular sequence and genomics data; genome annotation; gene/protein prediction and expression profiling; molec‐ ular folding, modeling, and design; building biological networks; development of databases and data management systems; development of software and analysis tools; bioinformatics services and workflow; mining of biomedical literature and text; and bioinformatics educa‐ tion and training. There are critical needs for the preparation of well-qualified scientists and specialists who can use modern, sophisticated bioinformatics resources. There are needs for improvements on genome assembly, gene ontology and annotation, networking and data‐ base, data visualization, and graphical tools.

Therefore, the objective of this *Bioinformatics* book, written by the international team of life scientists, is to provide some updates on bioinformatics methods, resources, approaches, and genome analysis tools useful for the development of life sciences. Here, we compiled 10 chapters that reviewed and discussed basic understanding and development of bioinformat‐ ics, its tools, approaches and application in medicine such as "skinomics", allergology, anti‐ body production, and hepatitis disease as well as genome analysis, membrane study, plant sciences, and agriculture. I trust chapters of this book should provide advanced knowledge for university students, life science researchers, and interested readers on some latest devel‐ opments in bioinformatics field.

"This book addresses an unusually wide range of features and applications of the rapidly expanding new scientific field of bioinformatics. The perspective from a developing country and the focus on agricultural genomics are particularly noteworthy" – quotes Prof. Gilbert S. Omenn, Director, Center for Computational Medicine & Bioinformatics, Professor of Com‐ putational Medicine & Bioinformatics, Internal Medicine, Human Genetics and Public Health, University of Michigan, USA.

I thank the InTech book department for giving me the opportunity to work on this book project, and Mrs. Ivona Lovrić, publication manager, for her help with my editorial duties and publishing of this book. Many thanks for all authors of the book chapters for their val‐ uable chapter contributions and cooperations with my editorial requests.

**Ibrokhim Y. Abdurakhmonov**

Center of Genomics and Bioinformatics, Academy of Sciences of Uzbekistan, Tashkent, Uzbekistan **Introduction to Bioinformatics**

putational Medicine & Bioinformatics, Internal Medicine, Human Genetics and Public

I thank the InTech book department for giving me the opportunity to work on this book project, and Mrs. Ivona Lovrić, publication manager, for her help with my editorial duties and publishing of this book. Many thanks for all authors of the book chapters for their val‐

**Ibrokhim Y. Abdurakhmonov**

Tashkent, Uzbekistan

Center of Genomics and Bioinformatics, Academy of Sciences of Uzbekistan,

uable chapter contributions and cooperations with my editorial requests.

Health, University of Michigan, USA.

VIII Preface

### **Bioinformatics: Basics, Development, and Future**

Ibrokhim Y. Abdurakhmonov

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63817

### **Abstract**

Bioinformatics is an interdisciplinary scientific field of life sciences. Bioinformatics research and application include the analysis of molecular sequence and genomics data; genome annotation, gene/protein prediction, and expression profiling; molecular folding, modeling, and design; building biological networks; development of databas‐ es and data management systems; development of software and analysis tools; bioinformatics services and workflow; mining of biomedical literature and text; and bioinformatics education and training. Astronomical accumulation of genomics, proteomics, and metabolomics data as well as a need for their storage, analysis, annotation, organization, systematization, and integration into biological networks and database systems were the main driving forces for the emergence and development of bioinformatics. Current critical needs for bioinformatics among others highlighted in this chapter, however, are to understand basics and specifics of bioinformatics as well as to prepare new generation scientists and specialists with integrated, interdisciplina‐ ry, and multilingual knowledge who can use modern bioinformatics resources powered with sophisticated operating systems, software, and database/networking technolo‐ gies. In this introductory chapter, I aim to give an overall picture on basics and developments of the bioinformatics field for readers with some future perspectives, highlighting chapters published in this book.

**Keywords:** bioinformatics, databases, molecular sequence analysis, software and anal‐ ysis tools, bioinformatics training

### **1. Introduction**

Biological data can be described as molecular sequence information and "wet-bench" experi‐ mented content of genome and gene product analyses [1]. Being an interdisciplinary branch of the life sciences, bioinformatics targets to develop methodology and analysis tools to explore

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

large volumes of biological data, helping to store, organize, systematize, annotate, visualize, query, mine, understand, and interpret complex data volumes. It uses conventional, modern computer science and cloud computing, statistics, and mathematics, as well as pattern recog‐ nition, reconstruction, machine learning, simulation and iterative approaches, and molecular modeling/folding algorithms [1, 2]. The emergence and advances of the bioinformatics field, however, are tightly associated with the computerized programming and software develop‐ ments needed for the handling and structural and functional analysis of large volumes of molecular sequences of DNA, RNA, proteins, and metabolites.

Presently, although still core for genomics and genetics field, bioinformatics became an umbrella for wider range of biological studies analyzing variety types of biological data, structuring, systemizing, annotating, querying, mining, and visualizing available biological information and a variety of biomedical text records [1–3]. Although drawing a fine line between bioinformatics and some other related fields is difficult because of increased appli‐ cations of computers, statistics, and mathematics to scientific problem solving and experiments of life sciences, there should not be a misperception about bioinformatics description and objectives. Bioinformatics should not be mixed with, for example, biometry and biostatistics, development of DNA computers, or computerized generation and filing of data from imaging.

Bioinformatics also should be differentiated from related scientific fields such as biological computation and computational biology [1, 2]. Biological computation aims to develop biological computers using advances of bioengineering, cybernetics, robotics, and molecular cell biology. In contrast, bioinformatics develops and utilizes computational algorithms to understand and interpret biological processes based on genome-derived molecular sequences and their interactions [2]. Therefore, in many aspects, bioinformatics seems similar to compu‐ tational biology objectives. A computational biology is concentrated on building and/or developing theoretical models for biological analyses [1, 2], whereas bioinformatics focuses on providing practical tools to organize and analyze basic genomic, proteomic and other "omics" data, including sequence analysis and its visualization [1, 2]. Admittedly, computational biology and bioinformatics both target to use genome data, for example, multiple sequence alignments and/or genome assembly tools. This makes distinctive boundaries of these two fields less distinguishable if their theoretical and practical scales are forgotten [2]. Thus, as mentioned above, the common core aims of bioinformatics are to handle, analyze, and interpret the genome-derived molecular sequence data and its organizational principles in broad scales/ spectra of comparative, simulative, and evolutionary/phylogenetics perspectives. These tools are applicable and widely used for studies related to genetics, genomics, biochemistry, physiology, biophysics, all agricultural, medical, and environmental sciences as well as evolution, system biology, and artificial intelligence [1–10].

For instance, bioinformatics tools such as the comparative analysis of genomic and genetic data and/or signal processing help to interpret and understand the molecular and evolutionary processes [9] and interactions from large volumes of raw data in the field of wet-bench experimental molecular biology [1, 2]. In the "omics" fields, it helps to sequence and annotate genomes, and identify distinct patterns, mutation profiles, genetic epistasis, gene/protein expression and regulation, and gene ontologies [1, 2, 4, 8–11] as well as be instrumental in mining and querying the biological data and biomedical literature text [3, 4, 7]. When applied for system biology [2, 6], bioinformatics is a key instrument to analyze and catalogue the biochemical/genetic pathways and networks, which helps to integrate pieces of analyzed information to depict and model a full picture of the life processes. Application of reconstruc‐ tion, pattern recognition, folding, simulation, and molecular modeling with bioinformatic tools can identify structural peculiarities and interactions of molecular sequences important for structural biology and medicinal drug design [12, 13]. All of these large scale, genomederived, molecular sequence analyses of raw "Big Data" are impossible to be analyzed manually [1, 2]. This prompted the biology science research community to apply interdisci‐ plinary methods and tools for "Big Data" analysis in combination with modern computing knowledge, which resulted in the emergence of novel interdisciplinary bioinformatics science. Let us, first, take a look the historic developments in the bioinformatics field.

### **1.1. History of emergence and development**

large volumes of biological data, helping to store, organize, systematize, annotate, visualize, query, mine, understand, and interpret complex data volumes. It uses conventional, modern computer science and cloud computing, statistics, and mathematics, as well as pattern recog‐ nition, reconstruction, machine learning, simulation and iterative approaches, and molecular modeling/folding algorithms [1, 2]. The emergence and advances of the bioinformatics field, however, are tightly associated with the computerized programming and software develop‐ ments needed for the handling and structural and functional analysis of large volumes of

Presently, although still core for genomics and genetics field, bioinformatics became an umbrella for wider range of biological studies analyzing variety types of biological data, structuring, systemizing, annotating, querying, mining, and visualizing available biological information and a variety of biomedical text records [1–3]. Although drawing a fine line between bioinformatics and some other related fields is difficult because of increased appli‐ cations of computers, statistics, and mathematics to scientific problem solving and experiments of life sciences, there should not be a misperception about bioinformatics description and objectives. Bioinformatics should not be mixed with, for example, biometry and biostatistics, development of DNA computers, or computerized generation and filing of data from imaging.

Bioinformatics also should be differentiated from related scientific fields such as biological computation and computational biology [1, 2]. Biological computation aims to develop biological computers using advances of bioengineering, cybernetics, robotics, and molecular cell biology. In contrast, bioinformatics develops and utilizes computational algorithms to understand and interpret biological processes based on genome-derived molecular sequences and their interactions [2]. Therefore, in many aspects, bioinformatics seems similar to compu‐ tational biology objectives. A computational biology is concentrated on building and/or developing theoretical models for biological analyses [1, 2], whereas bioinformatics focuses on providing practical tools to organize and analyze basic genomic, proteomic and other "omics" data, including sequence analysis and its visualization [1, 2]. Admittedly, computational biology and bioinformatics both target to use genome data, for example, multiple sequence alignments and/or genome assembly tools. This makes distinctive boundaries of these two fields less distinguishable if their theoretical and practical scales are forgotten [2]. Thus, as mentioned above, the common core aims of bioinformatics are to handle, analyze, and interpret the genome-derived molecular sequence data and its organizational principles in broad scales/ spectra of comparative, simulative, and evolutionary/phylogenetics perspectives. These tools are applicable and widely used for studies related to genetics, genomics, biochemistry, physiology, biophysics, all agricultural, medical, and environmental sciences as well as

For instance, bioinformatics tools such as the comparative analysis of genomic and genetic data and/or signal processing help to interpret and understand the molecular and evolutionary processes [9] and interactions from large volumes of raw data in the field of wet-bench experimental molecular biology [1, 2]. In the "omics" fields, it helps to sequence and annotate genomes, and identify distinct patterns, mutation profiles, genetic epistasis, gene/protein expression and regulation, and gene ontologies [1, 2, 4, 8–11] as well as be instrumental in

molecular sequences of DNA, RNA, proteins, and metabolites.

4 Bioinformatics - Updated Features and Applications

evolution, system biology, and artificial intelligence [1–10].

Bioinformatics term was coined by Paulien Hogeweg and Ben Hesper in 1970 [2, 14]. Its meaning was very different from current description and referred to the study of information processes in biotic systems like biochemistry and biophysics [14–16]. However, the emergence of bioinformatics tracks back to the 1960s. It was appeared in concordance with the develop‐ ment of protein sequencing methods from a variety of organisms and with the availability of protein sequences after Frederick Sanger determined the sequence of insulin in the early 1950s [17, 18]. New computer methods to analyze and compare a large number of protein sequences of different organisms were needed because handling many amino acid sequences manually was impractical. This led in compiling the first "Protein Information Resources" (PIR) [1, 19, 20] by Margaret Oakley Dayhoff and her collaborators at the National Biomedical Research Foundation [1]. Dayhoff's team successfully organized the protein sequences into distinct groups and sub-groups based on sequence similarity and percent accepted mutation (PAM) matrices [1]. This was published as protein sequences atlas [21, 22] that has been widely used in performing protein sequence alignments and database similarity searches [1, 2, 23]. This was pioneered methods of protein sequence alignment and molecular evolution [22]. In the 1970s, Elvin A. Kabat further contributed to bioinformatics development by his extended protein sequence analysis of comprehensive volumes of antibody sequences, released in collaboration with Tai Te Wu between 1980 and 1991 [2, 24].

With the objective of providing the theoretical background to immunology experiments in 1974, George Bell and colleagues initiated the collection of DNA sequences into GenBank [1]. During 1982–1992, the first version of GenBank was prepared by Walter Goad's group [1] and the efforts resulted in the development of presently known and widely used DNA sequence databases of GenBank [25], "The European Molecular Biology Laboratory (EMBL) [26], and DNA DataBank of Japan (DDBJ) [27] in 1979, 1980, and 1984, respectively [1]. Most important development in DNA sequence databases, however, was incorporation of web-based search‐ ing algorithms allowing researchers to find and compare the target DNA sequences. Such first developments and resulting computer software called "GENEINFO" and its derivative version of "Entrez" were developed by David Benson and David Lipman and colleagues [1].

**Figure 1.** Dynamics of bioinformatics-related publications over the past four decades. (A) Unquoted and (B) quoted keyword retrieved scientific publications from PubMed [74].

This software allowed researchers to rapidly search database-indexed sequences and match them with queried sequence. Software became readily available through web-based interface of the National Center of Biotechnology Information (NCBI) database [28]. Molecular sequence analysis, comparison, and visualization methods have been improved, and many different methodologies have been contributed to bioinformatics advancements in this direction. Such advancements can be exemplified by the development of dot matrix and diagram methods [29], alignment of sequences by dynamic programming [30], finding of local alignments between sequences [31], multiple sequence alignment tools [32–35], predicting the secondary structures of RNAs [36, 37], determination of evolutionary relationships of sequences [38, 39], and assigning the gene function based on sequence similarity of known function from models [40]. Development of FASTA [41, 42], BLAST [43, 44], and their various modifications [45–47] has further powered the bioinformatics field and greatly improved the biological data analysis. Development of tools for predicting the putative protein sequences, structure, and function of proteins/genes based on DNA sequences [48–58], completing full genome sequences, and building web-based genome databases for many prokaryotic and eukaryotic organisms [58] has provided a great advance in the bioinformatics field. In addition, rapid genome-wide gene expression profiling and analysis opportunities [59–62], biological pathway assignment and identification, data storing, and mining and querying for large volume of biological datasets [63–73] have further provided unprecedented popularity of bioinformatics in the scene of world science, which has been briefly reviewed below.

Since its emergence as an interdisciplinary scientific field in 1970, bioinformatics research has continuously increased over the past four-decade period. Unquoted search of keyword of bioinformatics in the PubMed database [74] has found nearly 181,000 scientific publications covering the period of 1958 to March of 2016. Repeating the search with the quoted keyword found 62,402 scientific publications over the four-decade period, demonstrating the starting point of increased publication efforts in the end of 1990s with its first raise in 2000/2001, following significant peaks in 2003/2004 and after 2013 (**Figure 1**). In this introductory chapter, I aim to give a brief highlight of these four-decade developments introducing the chapters presented in this book.

### **2. Bioinformatics help in handling and analysis of the genomics data, genome annotation, and expression profiling**

Rapid and reliable determination of DNA molecules, because of the introduction of the sequencing technique of Sanger and Coulson [75] and Maxam and Gilbert [76], provided largescale DNA sequence data that needed to be analyzed by computerized programming. This prompted the development of efficient bioinformatics methodologies. For example, a seminal effort of the Phage Φ-X174 [2, 77] and the *Haemophilus influenza* [2, 78] genome sequencing using shotgun sequencing techniques generated the sequences of many thousands of small DNA fragments, ranging from 35 to 900 nucleotides [2] and required the assembly of a complete bacterial genome. The ends of sequenced shotgun clones overlap and can be assembled using computerized similarity search algorithms into the complete genome although the assembly tasks are challenging due to the requirement for powerful computers with sufficient memory and issues of generating multiple gaps in assembled genome. Genome assembly algorithms are a critical area of bioinformatics research as fragmented genome sequencing methods have been the core approach for virtually all genomes sequenced today [1, 2].

This software allowed researchers to rapidly search database-indexed sequences and match them with queried sequence. Software became readily available through web-based interface of the National Center of Biotechnology Information (NCBI) database [28]. Molecular sequence analysis, comparison, and visualization methods have been improved, and many different methodologies have been contributed to bioinformatics advancements in this direction. Such advancements can be exemplified by the development of dot matrix and diagram methods [29], alignment of sequences by dynamic programming [30], finding of local alignments between sequences [31], multiple sequence alignment tools [32–35], predicting the secondary structures of RNAs [36, 37], determination of evolutionary relationships of sequences [38, 39], and assigning the gene function based on sequence similarity of known function from models [40]. Development of FASTA [41, 42], BLAST [43, 44], and their various modifications [45–47]

**Figure 1.** Dynamics of bioinformatics-related publications over the past four decades. (A) Unquoted and (B) quoted

keyword retrieved scientific publications from PubMed [74].

6 Bioinformatics - Updated Features and Applications

Therefore, without bioinformatics tools, it is not possible to think about genome sequencing as present bioinformatics programs such as BLAST/sequence alignments not only provide rapid practical tools to handle, analyze, compare, relate, and visualize DNA sequences but also offer help with the sequencing process itself. The development of cost-effective, next generation sequencing (NGS) platforms [79, 80] has helped to completely decode nearly the entire genome of many different organisms including human and many other model and specialty organisms, or crop genomes with complex polyploidy levels within a short period. For example, according to the listings in the Genomes OnLine Database (GOLD) as of March 8, 2016, there were 79,650 genome sequencing projects of which 8018 were completed projects, 33,489 were permanent drafts, 35,609 were incomplete projects, and 1553 were targeted projects [81]. There are 73,000 organism, including archaea (1201), bacteria (55,303), eukaryotes (11,990), and viruses (4473), listed for sequencing. These numbers should be increased if the sequencing of the 100,000 whole-human genomes [82] is added.

Bioinformatics tools are needed in annotation and prediction of genes from sequenced genomes that requires computerized approaches because genomes are large to be manually annotated as mentioned above. Bioinformatics-based gene finding and annotation including a search for protein-coding genes, RNA transcripts, and other functional sequences within a genome is possible because there are patterns to recognize the start, stop regions, introns, exons, motifs, repeats, and other regulatory and sensory as well as signaling regions with some variations between genes and among organisms. With the availability and need for analysis of *H. influenza* genome, the first genome annotation computer program system was designed in 1995 by Owen White [2, 78], which provided tools to find the genes and identify putative functions of annotated sequences. White's effort was basic for all currently available gene annotation and prediction software, which keep periodically improving [2].

Bioinformatics tools are very important to analyze gene and protein expression profiles. Largescale sequencing of cDNA libraries has generated large volumes of serial analysis of gene expression (SAGE), expressed sequences tags (ESTs), massively parallel signature sequencing (MPSS), transcriptome profiling, or RNA-Seq, and various applications of multiplexed in-situ hybridization (microarray) profile data [83–95]. All of these gene expression techniques are extremely noise-prone and/or subject to bias in the biological measurement, which requires application of statistical tools to separate signal from noise in high-throughput gene expression studies. In this context, chapter by Zhao et al. in this book reviews and discusses the main tools and algorithms currently available for RNAseq data analyses, discussing rapidly evolving RNAseq technologies such as stranded RNAseq, targeted RNAseq, and single cell RNA-seq. Moreover, Sripathy et al. have comprehensively discussed transcriptome profiling, RNAseq, and micro-RNA expression studies in cotton (*Gossypium* species), whereas Younis et al. present a chapter on skin microbiome, transcriptome, and microarray data analyses. In this book, readers can find an interesting chapter on bioinformatics challenges and tools for Hepatitis B genome analysis written by Bell and Kramvis, which highlight features of this small genome virus for bioinformatics analysis.

Similarly, protein microarrays and high-throughput mass spectrometry require bioinformatics analysis to identify proteins through the complex sequence similarity searches using protein sequence databases [96–103]. Bioinformatics is a great help for analysis of gene regulation through searching and comparing the sequence motifs related to promoters and other regulatory elements. Using bioinformatics tools and sequence motifs/regulatory elements genes can be clustered by function, and the co-expression characteristics can be determined. Examples of such bioinformatics tools include k-means clustering, hierarchical clustering, and consensus clustering methods such as the Bi-CoPaM, and self-organizing maps (SOMs) that can identify functionally active sequences from very complex microarray datasets [104–107]

specialty organisms, or crop genomes with complex polyploidy levels within a short period. For example, according to the listings in the Genomes OnLine Database (GOLD) as of March 8, 2016, there were 79,650 genome sequencing projects of which 8018 were completed projects, 33,489 were permanent drafts, 35,609 were incomplete projects, and 1553 were targeted projects [81]. There are 73,000 organism, including archaea (1201), bacteria (55,303), eukaryotes (11,990), and viruses (4473), listed for sequencing. These numbers should be increased if the

Bioinformatics tools are needed in annotation and prediction of genes from sequenced genomes that requires computerized approaches because genomes are large to be manually annotated as mentioned above. Bioinformatics-based gene finding and annotation including a search for protein-coding genes, RNA transcripts, and other functional sequences within a genome is possible because there are patterns to recognize the start, stop regions, introns, exons, motifs, repeats, and other regulatory and sensory as well as signaling regions with some variations between genes and among organisms. With the availability and need for analysis of *H. influenza* genome, the first genome annotation computer program system was designed in 1995 by Owen White [2, 78], which provided tools to find the genes and identify putative functions of annotated sequences. White's effort was basic for all currently available gene

Bioinformatics tools are very important to analyze gene and protein expression profiles. Largescale sequencing of cDNA libraries has generated large volumes of serial analysis of gene expression (SAGE), expressed sequences tags (ESTs), massively parallel signature sequencing (MPSS), transcriptome profiling, or RNA-Seq, and various applications of multiplexed in-situ hybridization (microarray) profile data [83–95]. All of these gene expression techniques are extremely noise-prone and/or subject to bias in the biological measurement, which requires application of statistical tools to separate signal from noise in high-throughput gene expression studies. In this context, chapter by Zhao et al. in this book reviews and discusses the main tools and algorithms currently available for RNAseq data analyses, discussing rapidly evolving RNAseq technologies such as stranded RNAseq, targeted RNAseq, and single cell RNA-seq. Moreover, Sripathy et al. have comprehensively discussed transcriptome profiling, RNAseq, and micro-RNA expression studies in cotton (*Gossypium* species), whereas Younis et al. present a chapter on skin microbiome, transcriptome, and microarray data analyses. In this book, readers can find an interesting chapter on bioinformatics challenges and tools for Hepatitis B genome analysis written by Bell and Kramvis, which highlight features of this small genome

Similarly, protein microarrays and high-throughput mass spectrometry require bioinformatics analysis to identify proteins through the complex sequence similarity searches using protein sequence databases [96–103]. Bioinformatics is a great help for analysis of gene regulation through searching and comparing the sequence motifs related to promoters and other regulatory elements. Using bioinformatics tools and sequence motifs/regulatory elements genes can be clustered by function, and the co-expression characteristics can be determined. Examples of such bioinformatics tools include k-means clustering, hierarchical clustering, and

sequencing of the 100,000 whole-human genomes [82] is added.

8 Bioinformatics - Updated Features and Applications

annotation and prediction software, which keep periodically improving [2].

virus for bioinformatics analysis.

Not only just these, bioinformatics plays a major role in data collection of the functional elements of sequenced genomes that use the next-generation DNA-sequencing technologies and genomic tiling arrays. This is best exemplified "Encyclopedia of DNA Elements (EN‐ CODE)" [108] project developed by the National Human Genome Research Institute that describes the functional elements of the human genome. Thanks to bioinformatics and applications of its tools, genomes and genes, and protein sequences of different organisms can be rapidly compared, searched, and interpreted. In addition, mutations can be identified that help to judge and diagnose many complex human and plant diseases, crop traits, and interpret complex evolutionary process, such as genome duplications, polyploidization, adaptation, and speciation.

### **3. Structural bioinformatics: molecular folding, modeling, and design**

One of the widely used applications of bioinformatics is identification of three-dimensional protein structures, molecular modeling, and folding to predict the possible function of proteins or other molecular structures, model behavior of molecules, fold the molecule to its native biologically functional three-dimensional structure, and design biomedical drugs for many complex human diseases. It helps *de novo* protein design, enzyme design, protein-ligand/drug docking, protein-peptide interaction, and structure prediction of biological macromolecules and macromolecular complexes [1, 2, 109].

From the coding DNA sequences, the primary structure of proteins can be easily determined that is vital in understanding the function of the protein(s). Further, based on homology patterns in primary structure of proteins and using homology modeling, important structural formations and interaction sites with other proteins can be determined. This helps to predict reliably the structure of a protein based on known structure of a homologous protein(s). Moreover, the identification of secondary, tertiary, and quaternary structures of proteins is very important to understand the function of proteins. The exact three-dimensional structure is essential for correct function, and a failure to fold into native structure generally produces inactive proteins or misfolded proteins that can be toxic [108]. Bioinformatics of protein folding includes (1) energy landscape of protein folding and (2) modeling of protein folding ap‐ proaches [12, 13, 109].

One of the freely available and leading web server/stand-alone software tools for automated protein structure prediction and structure-based functional annotation can be exemplified by the "Iterative Threading ASSEmbly Refinement"(I-TASSER), which "first generates full-length atomic structural models from multiple threading alignments and iterative structural assem‐ bly simulations followed by atomic-level structure refinement" [110]. Using the I-TASSER, all above-mentioned functional and structural characteristics of proteins, including ligandbinding sites, enzyme commission number, and gene ontology terms can be explored in a comparative scale [110, 111].

Molecular modeling through molecular mechanistic and/or the quantum chemistry ap‐ proaches is the key bioinformatics approaches to study the behavior of molecules. These are routinely used to investigate the structure, dynamics, surface properties, and thermodynamics of inorganic, biological and polymeric systems. It helps to explore conformational changes associated with biomolecular function, and molecular recognition of proteins, and membrane complexes. The protein folding, identification of catalysis sites of enzymes, and protein stability can be studied using molecular modeling. Vast different bioinformatics tools for modeling of biomolecules and designing are available [110–112]. In this book, the chapter by Leong et al. presents bioinformatics modeling and tools for biological membranes using molecular dynamic simulations, all-atom, united-atom, and coarse-grained membrane models of lipids and proteins. In addition, in this book, by Filntisi et al. a computational method for the generation of antibody-drug through site-specific cysteine conjugation using structural prediction methods based on PDB files of a drug, linker, and antibody. Moreover, Bórquez and González-Billault have presented an interesting chapter on computational algorithms of predicting kinase-substrate relationships in protein kinases; this chapter compares prediction tools and methods and discusses improving substrate prediction with contextual information.

### **4. Biological networks and system biology**

Watts and Strogatz in 1998 [113, 114] and Barabási and Albert in 1999 [115–117] fueled the opinion that complex systems can be viewed as networks where components can be repre‐ sented as nodes and they are linked through their interactions (i.e., edges). The properties of nodes and edges form the network topology. This approach has widely been applied to many scientific fields including bioinformatics that resulted in construction of large-scale biological networks denoted as "omes" like biome, interactome, microbiome [2, 6].

Above highlighted molecular sequence analysis, prediction and annotation, and molecular modeling-related bioinformatics approaches are also the core for building, organizing, and systematizing biological networks of molecules (e.g., metabolic, protein-protein interactions, etc.), and genetic and biochemical pathways of complex cellular processes. These include reception, signal transduction, and gene regulation and gene co-expression. Such molecular networks integrate many different data types including DNA sequences, regulatory RNA, proteins, secondary metabolites, gene expression data, and other small molecules, which may be all connected physically and functionally. The construction and organization of such physically and functionally connected molecular networks of cellular processes can be achieved only by applying the combination of simulative, iterative, and model-oriented bioinformatics approaches. Such biological networks are useful to analyze and visualize the complex connections of these cellular processes, helping understand other biological networks such as neuronal networks, food webs, between/within-species interaction networks, which are the central component of modern system biology [2, 6]. Examples of "omes"-related networks are the Kyoto Encyclopedia of Genes and Genomes (KEGG), BioCyc database collection, BRaunschweig ENzyme DAtabase (BRENDA), Reactome, Comparative Toxicoge‐ nomics Database, and many other [118] biological networks. Some biological network databases and their utilization in plant genomics/epigenomics have been discussed by the chapters of Sripathi et al. and Rahman et al. in this book.

### **5. Databases**

Molecular modeling through molecular mechanistic and/or the quantum chemistry ap‐ proaches is the key bioinformatics approaches to study the behavior of molecules. These are routinely used to investigate the structure, dynamics, surface properties, and thermodynamics of inorganic, biological and polymeric systems. It helps to explore conformational changes associated with biomolecular function, and molecular recognition of proteins, and membrane complexes. The protein folding, identification of catalysis sites of enzymes, and protein stability can be studied using molecular modeling. Vast different bioinformatics tools for modeling of biomolecules and designing are available [110–112]. In this book, the chapter by Leong et al. presents bioinformatics modeling and tools for biological membranes using molecular dynamic simulations, all-atom, united-atom, and coarse-grained membrane models of lipids and proteins. In addition, in this book, by Filntisi et al. a computational method for the generation of antibody-drug through site-specific cysteine conjugation using structural prediction methods based on PDB files of a drug, linker, and antibody. Moreover, Bórquez and González-Billault have presented an interesting chapter on computational algorithms of predicting kinase-substrate relationships in protein kinases; this chapter compares prediction tools and methods and discusses improving substrate prediction with contextual information.

Watts and Strogatz in 1998 [113, 114] and Barabási and Albert in 1999 [115–117] fueled the opinion that complex systems can be viewed as networks where components can be repre‐ sented as nodes and they are linked through their interactions (i.e., edges). The properties of nodes and edges form the network topology. This approach has widely been applied to many scientific fields including bioinformatics that resulted in construction of large-scale biological

Above highlighted molecular sequence analysis, prediction and annotation, and molecular modeling-related bioinformatics approaches are also the core for building, organizing, and systematizing biological networks of molecules (e.g., metabolic, protein-protein interactions, etc.), and genetic and biochemical pathways of complex cellular processes. These include reception, signal transduction, and gene regulation and gene co-expression. Such molecular networks integrate many different data types including DNA sequences, regulatory RNA, proteins, secondary metabolites, gene expression data, and other small molecules, which may be all connected physically and functionally. The construction and organization of such physically and functionally connected molecular networks of cellular processes can be achieved only by applying the combination of simulative, iterative, and model-oriented bioinformatics approaches. Such biological networks are useful to analyze and visualize the complex connections of these cellular processes, helping understand other biological networks such as neuronal networks, food webs, between/within-species interaction networks, which are the central component of modern system biology [2, 6]. Examples of "omes"-related networks are the Kyoto Encyclopedia of Genes and Genomes (KEGG), BioCyc database collection, BRaunschweig ENzyme DAtabase (BRENDA), Reactome, Comparative Toxicoge‐ nomics Database, and many other [118] biological networks. Some biological network

networks denoted as "omes" like biome, interactome, microbiome [2, 6].

**4. Biological networks and system biology**

10 Bioinformatics - Updated Features and Applications

An organized collection of data is referred to as database that aims to collect schemes, tables, queries, reports, images, and other objects. An access to information in the databases is provided by an integrated set of computer software, which is referred to as a "database management system" (DBMS) [119]. The DBMS allows users to access all of the data contained in the databases. It has general functions for data definition, entry, storage, update, adminis‐ tration, and retrieval of large quantities of information in an organized way that requires modeling (hierarchical and network models), clustering, query languages and query optimi‐ zation, and visualization algorithms [1, 2, 119].

Development of databases, therefore, is significantly dependent on bioinformatics tools, advances, research, and applications. There is a large number of different types of databases available, which cover all aspects of biological data storage and organization. Some afore‐ mentioned databases such as GenBank, EMBL, DDJB belong to primary nucleotide sequence databases. There are meta-databases that incorporate data compiled from multiple other databases such as Entrez, mGen, Metascape, etc. Some others are specialized databases such as those specific to an organism, for example, TAIR, the p53 Knowledgebase (p53), the plant alternative splicing database (PASD); the plant secretome, and subcellular proteome knowl‐ edgebase (PlantSecKB) [119]. All databases vary in their data definition, usage, format, and access types. In this book, the chapter by Kadam et al. specifically describes databases and bioinformatics algorithms related to allergen informatics, discussing the concepts of allergen bioinformatics and the key areas for potential development in the allergology, whereas Bell and Kramvis highlight public sequence database for Hepatitis B virus. In this book, readers can find a comprehensive discussion for bioinformatics resources, including databases for plant "omics," written by Rahman et al.

### **6. Software, analysis tools, services, and workflow**

As mentioned above, astronomical accumulation of genomic and proteomic as well as metabolomic data, and their expression profiles and annotation, storage, organization, systematization, and integration into biological networks as well as database systems and their wide utilization by the science research community *a priori* required computer programming algorithms, analysis tools, services, and workflow systems. Therefore, software and analysis tools, and bioinformatics services and workflow have been the main fields and core targets of bioinformatics since its emergence. Because of the contributions of various bioinformatics companies or public institutions, bioinformatics software, and tools started to exist as simple command-line tools, but later improved to more complex graphical programs standalone packages, and web services. Since development of the first bioinformatics software and analysis tools for molecular sequence evaluations in the early 1980s, many free and opensource software tools have been developed and continue to grow and improve with the advancement made in genomics sciences [2, 120].

The main driving forces for the current and future development of bioinformatics software and tools have been made on the past-decade advances of genome decoding technologies, accumulation of large volume biological data, consequent need for their analyses, as well as advancements of computer technologies, graphics, visualization, and molecular modeling and networking techniques. Moreover, the availability of various open-source codes, shared object models, and community-supported plug-ins has facilitated gathering innovative ideas from the community and performing innovative *in silico* experiments on existing "Big Data." These all-created golden opportunities for all research groups and bioinformatics companies to work, experiment, and develop more new generation of bioinformatics software and tools that are user friendly, capable of performing extended and integrated analysis with better visualization and graphical outputs. The range of open-source software packages includes titles such as UGENE, EMBOSS, GenGIS, GENtle, MOTHUR, BioPerl, PathVisio, BioJava, GenoCAD, Biopython, GeWorkbench, GenomeSpace, Bioclipse, .NET Bio, Apache Taverna, BioJS, Bioconductor, and BioRuby [121, 122].

Development of sharing models and web access tools is also an important bioinformatics objective that allows users to utilize and access bioinformatics tools over the internet and from their computer systems to the main computing resources via servers in other parts of the world. Simple Object Access Protocol (SOAP) [123] and Representational State Transfer (REST) [124– 126] are two bioinformatics tools to provide web services. SOAP is a standard-based web service access protocol, originally developed by Microsoft. REST, providing very simple web service access, has been developed to fix the problems with SOAP [127]. Both tools share similarities over the HTTP protocol and have its own issues and challenges, differ in messaging patterns, rules, architecture style, and flexibility. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads [127].

There are several basic bioinformatics services, for example, "Sequence Search Services" (SSS), "Multiple Sequence Alignment" (MSA), and "Biological Sequence Analysis" (BSA) [2, 128]. These web service-based bioinformatics analysis resources represent a collection of standalone or web-based interface data analysis tools as well as integrative, distributed, and extensible bioinformatics workflow management systems (BWMS). The BWMSs are designed specifically to compose and execute a series of interactive computational or data manipulation steps (i.e., a workflow) in a bioinformatics analyses. Such systems provide interactive analysis of biological data, build the specific workflows for the analysis, enable the visualization of the analysis outputs in real time, and simplify the process of sharing and reusing workflows between scientists. Some of the platforms giving this service: Galaxy, UGENE, Taverna, etc. [2, 121]. Several chapters of this book cover bioinformatics software, web-based analysis tools, and bioinformatics services for membrane analysis (see Leong et al.), in plant science and crop genomics (see chapters by Rahman et al. and Sripathi et al.), medicine, viral genome analysis and drug design (see chapters by Younis et al., Bell and Kramvis, and Filntisi et al.).

### **7. Text mining**

packages, and web services. Since development of the first bioinformatics software and analysis tools for molecular sequence evaluations in the early 1980s, many free and opensource software tools have been developed and continue to grow and improve with the

The main driving forces for the current and future development of bioinformatics software and tools have been made on the past-decade advances of genome decoding technologies, accumulation of large volume biological data, consequent need for their analyses, as well as advancements of computer technologies, graphics, visualization, and molecular modeling and networking techniques. Moreover, the availability of various open-source codes, shared object models, and community-supported plug-ins has facilitated gathering innovative ideas from the community and performing innovative *in silico* experiments on existing "Big Data." These all-created golden opportunities for all research groups and bioinformatics companies to work, experiment, and develop more new generation of bioinformatics software and tools that are user friendly, capable of performing extended and integrated analysis with better visualization and graphical outputs. The range of open-source software packages includes titles such as UGENE, EMBOSS, GenGIS, GENtle, MOTHUR, BioPerl, PathVisio, BioJava, GenoCAD, Biopython, GeWorkbench, GenomeSpace, Bioclipse, .NET Bio, Apache Taverna, BioJS,

Development of sharing models and web access tools is also an important bioinformatics objective that allows users to utilize and access bioinformatics tools over the internet and from their computer systems to the main computing resources via servers in other parts of the world. Simple Object Access Protocol (SOAP) [123] and Representational State Transfer (REST) [124– 126] are two bioinformatics tools to provide web services. SOAP is a standard-based web service access protocol, originally developed by Microsoft. REST, providing very simple web service access, has been developed to fix the problems with SOAP [127]. Both tools share similarities over the HTTP protocol and have its own issues and challenges, differ in messaging patterns, rules, architecture style, and flexibility. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads [127].

There are several basic bioinformatics services, for example, "Sequence Search Services" (SSS), "Multiple Sequence Alignment" (MSA), and "Biological Sequence Analysis" (BSA) [2, 128]. These web service-based bioinformatics analysis resources represent a collection of standalone or web-based interface data analysis tools as well as integrative, distributed, and extensible bioinformatics workflow management systems (BWMS). The BWMSs are designed specifically to compose and execute a series of interactive computational or data manipulation steps (i.e., a workflow) in a bioinformatics analyses. Such systems provide interactive analysis of biological data, build the specific workflows for the analysis, enable the visualization of the analysis outputs in real time, and simplify the process of sharing and reusing workflows between scientists. Some of the platforms giving this service: Galaxy, UGENE, Taverna, etc. [2, 121]. Several chapters of this book cover bioinformatics software, web-based analysis tools, and bioinformatics services for membrane analysis (see Leong et al.), in plant science and crop genomics (see chapters by Rahman et al. and Sripathi et al.), medicine, viral genome analysis

and drug design (see chapters by Younis et al., Bell and Kramvis, and Filntisi et al.).

advancement made in genomics sciences [2, 120].

12 Bioinformatics - Updated Features and Applications

Bioconductor, and BioRuby [121, 122].

Part of objectives in bioinformatics research and application is the utilization of computational algorithms and bioinformatics tools to collect, organize, and structure the growing body of biomedical literature allowing scientists to query, mine, read, and synthesize the specific literature and published articles of their research interest [2–4, 7, 129, 130]. Biomedical literature and text mining, therefore, are very important for scientific development, innova‐ tions, and integration and application of discoveries to society through extracting information (EI) and assessing the relationships of publications [3, 4]. Analysis of world literature demon‐ strates that more than 80% of text data remain unstructured that what makes it challenging to read every paper, resulting in disjointed sub-fields of research [3]. Biomedical literature text mining uses a variety of "text mining & data mining" tools, applying techniques such as data clustering, visualization and navigation, information retrieval, and extraction, and text categorization and summarization [3]. The use of IE and "Natural Language Generation and Understanding" (NLG and NLU) that have tokenizing, morphological or lexical, and syntactic analysis components helps to build structured text, and extract, collect, organize structured information [129, 130]. Pattern recognition and matching such as the recognition of biological abbreviations, terms, and interactions are important methods in text mining [2–4].

### **8. Education**

Advances of life sciences and high-throughput biology fields in particular "omics" disciplines, the scale, and complexity of "Big Data," and growing demand for specialists with multilingual and cross-field expertise to understand and solve multidisciplinary scientific concepts and tasks underlie a great need for training and education in the field of bioinformatics. Bioinfor‐ matics training and education aim to create, collect, deliver, and share educational and training materials and techniques as well as develop university degree-program curricula on bioinfor‐ matics. This is to prepare scientists and specialists, who can utilize modern bioinformatics tools with the sophisticated operating systems, software and algorithms, and database/networking technologies to handle, analyze, interpret, and publish high-throughput complex biological data. This is a great bottleneck and critical need of current life sciences and bioinformatics field, especially in all developing countries, for example, analyzed by some recent reports for African [82] and Central American [131] countries.

To address this, bioinformatics research community has put specific efforts to develop local and global platforms for bioinformatics training and education. Such examples include "Bioinformatics Training Network" (BTN) [132] and "The Global Organization for Bioinfor‐ matics Learning, Education, and Training" (GOBLET) [133] that provide a community educational and training resource for bioinformatics trainers and trainees. As an outcome of European 7th Framework grant, BTN targeted to develop and share educational materials, short courses, and training delivery methods as well as discuss the challenge, issues, and needed requirements for bioinformatics training [132]. Furthermore, GOBLET continues similar efforts beyond Europe, aiming to coordinate efforts at the global scales with concen‐ trated strategy and within the frame of single, dedicated foundation although it requires much time, focused strategic efforts, and modern innovative approaches [133].

"The Swiss Institute of Bioinformatics" training portal [134] also provides online courses for software platforms designed to teach bioinformatics concepts and methods including Rosalind [135]. There are open-access website videos and slides from the "Canadian Bioinformatics Workshops" [136]. Similarly, many different, large bioinformatics conferences, and seminars contribute for training and education on bioinformatics such as Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB), Research in Computational Molecular Biology (RECOMB), and the annual Bioinformatics Open Source Conference (BOSC) of the non-profit Open Bioinformatics Foundation [2, 128]. As public bioinformatics databases, the MediaWiki engine with the WikiOpener extension, extensively referenced in this chapter, also contributes for training and education of bioinformatics through gathering research materials and descriptions of tools that can be accessed and updated by all experts in the field [128].

With the specific objectives to develop bioinformatics research and application, its integration to genomics research, and training and education as well as to prepare well-qualified new generation scientists to life sciences, we established a dedicated organization—Center of Genomics and Bioinformatics in the developing country Uzbekistan [137]. As in other developing countries, there are many challenges and limitations in funding and in accessing to sophisticated bioinformatics tools and computer operating systems as well as lack of sufficient experience to carry bioinformatics research and resource development. However, our first step goal is to integrate genomics and bioinformatics curricula to the higher education system of Uzbekistan, develop training and educational materials, provide basic training and research practices to the university students and biology field specialists, and establish international collaborations on this direction. The long-term objective is to efficiently and broadly apply genomics and bioinformatics approaches to all areas of life sciences in national and regional levels that would contribute the development of biological sciences in Central Asia. Some efforts are ongoing regarding the establishment of international collaborations [138] and providing training and education in both national and regional levels.

### **9. Conclusions and future perspectives**

Bioinformatics has become an essential interdisciplinary scientific field to the life science helping to "omics" field and technologies and mainly handling and analyzing "omes" data. Accumulation of high-throughput biological data due to the technological advances in "omics" fields required and prioritized the use of bioinformatics resources, and research and applica‐ tion for the analysis of complex and even further enlarging "Big Data" volumes, which would be impractical and useless without bioinformatics. Therefore, as highlighted herein, there is a critical need for the preparation of well-qualified, new generation scientists with integrated knowledge, multilingual ability, and cross-field experience who are capable of using sophis‐ ticated operating systems, software and algorithms, and database/networking technologies to handle, analyze, and interpret high-throughput and increasing volume of complex biological data.

Community resources and a globally coordinated foundation of bioinformatics training and education platforms as well as research conferences, workshops, short online training, and web-based educational courses and materials are available to accomplish toward this goal. However, there is an urgent need for the development of bioinformatics education and training, in particular in developing countries, which requires innovative platforms, training techniques, better funding, web and network access, and high-performance computing systems.

In the research side, bioinformatics tools need to be improved for analysis of the growing body of high-throughput pangenomics, metagenomics, proteomics, and metabolomics data. There are needs for "effective tools" to perform better genome assembly and annotation with high accuracy; however, it requires the improvement of quality of sequenced genomes without gaps, and sequencing of more genome representatives, sub-genomes, polyploidy species, genomes of single cells, and specific tissues that would generate information to work, modify, and correct bioinformatics algorithms and programming approaches.

The use of third generation sequencing approaches and platforms as well as efforts on whole genome sequencing of, for example, 1000 or 100,000 human genome representatives [82] or transcriptome/exon sequencing of 1000 distinct plant species (e.g., 1KP) [139] will ultimately improve and advance the bioinformatics analysis tools. These efforts also help to improve orthologous gene identification tools that currently need attention [120]. There is a great need for sampling and handling diverse strains in pangenomic analysis, integration of prokaryotic genome-organization frameworks (GOFs) as well as integration of non-coding RNAs, pseu‐ dogenes, and epigenetics elements into the bioinformatics annotation and ontology tools and software [120]. There is a need to make sequenced genome data more functional and integrated through the construction of more organized, user friendly, cell-wide biological networks, and metabolic pathways [140] with better visualization effects, graphics outputs [120], and knowledge base construction (KB) [141]. This, however, requires the development of real-time imaging systems and high throughput phenotyping (referred to as "phenomics") tools that would help for efficiently determining biologically meaningful associations between genomic and phenotypic data, advancing the translational sciences, personal genomics, and personal‐ ized medicine [7] and/or agriculture [142].

### **Acknowledgements**

similar efforts beyond Europe, aiming to coordinate efforts at the global scales with concen‐ trated strategy and within the frame of single, dedicated foundation although it requires much

"The Swiss Institute of Bioinformatics" training portal [134] also provides online courses for software platforms designed to teach bioinformatics concepts and methods including Rosalind [135]. There are open-access website videos and slides from the "Canadian Bioinformatics Workshops" [136]. Similarly, many different, large bioinformatics conferences, and seminars contribute for training and education on bioinformatics such as Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB), Research in Computational Molecular Biology (RECOMB), and the annual Bioinformatics Open Source Conference (BOSC) of the non-profit Open Bioinformatics Foundation [2, 128]. As public bioinformatics databases, the MediaWiki engine with the WikiOpener extension, extensively referenced in this chapter, also contributes for training and education of bioinformatics through gathering research materials and descriptions of tools that can be accessed and

With the specific objectives to develop bioinformatics research and application, its integration to genomics research, and training and education as well as to prepare well-qualified new generation scientists to life sciences, we established a dedicated organization—Center of Genomics and Bioinformatics in the developing country Uzbekistan [137]. As in other developing countries, there are many challenges and limitations in funding and in accessing to sophisticated bioinformatics tools and computer operating systems as well as lack of sufficient experience to carry bioinformatics research and resource development. However, our first step goal is to integrate genomics and bioinformatics curricula to the higher education system of Uzbekistan, develop training and educational materials, provide basic training and research practices to the university students and biology field specialists, and establish international collaborations on this direction. The long-term objective is to efficiently and broadly apply genomics and bioinformatics approaches to all areas of life sciences in national and regional levels that would contribute the development of biological sciences in Central Asia. Some efforts are ongoing regarding the establishment of international collaborations

[138] and providing training and education in both national and regional levels.

Bioinformatics has become an essential interdisciplinary scientific field to the life science helping to "omics" field and technologies and mainly handling and analyzing "omes" data. Accumulation of high-throughput biological data due to the technological advances in "omics" fields required and prioritized the use of bioinformatics resources, and research and applica‐ tion for the analysis of complex and even further enlarging "Big Data" volumes, which would be impractical and useless without bioinformatics. Therefore, as highlighted herein, there is a critical need for the preparation of well-qualified, new generation scientists with integrated knowledge, multilingual ability, and cross-field experience who are capable of using sophis‐

time, focused strategic efforts, and modern innovative approaches [133].

updated by all experts in the field [128].

14 Bioinformatics - Updated Features and Applications

**9. Conclusions and future perspectives**

I thank Academy of Sciences of Uzbekistan and Committee for Coordination Science and Technology Development of Uzbekistan, the Office of International Research Programs (OIRP) of the United States Department of Agriculture (USDA)—Agricultural Research Service (ARS) and U.S. Civilian Research & Development Foundation (CRDF) for research Grants FA-F5- T030, FA-A6-T081, FA-A6-T085, I-2015-6-15/2, I5-FQ-0-89-870, P120, P120A, P121, P121B, UZB- TA-31016, UZB-TA-31017, and UZB-TA-2992, which have been the key factors for development of plant genomics and bioinformatics in Uzbekistan. I greatly acknowledge the Uzbekistan government support and investments/guide from Academy of Sciences of Uzbekistan, Ministry of Agriculture and Water Resources of Uzbekistan, Cotton Industry Joint Stock Company of Uzbekistan, Ministry Foreign Economic Relations, Investments and Trade of Uzbekistan, USDA-ARS, and Texas A&M University for establishment of Center of Genomics and Bioinformatics in Uzbekistan. I also thank Prof. Gilbert S. Omenn, Center for Computational Medicine & Bioinformatics, University of Michigan, USA for critical reading of this introductory chapter, and Mr. Mirzakamol Ayubov and Mr. Muhammad Mirzahme‐ dov, Center of Genomics and Bioinformatics, Uzbekistan, for their technical assistance while preparing this chapter material.

### **Author details**

Ibrokhim Y. Abdurakhmonov

Address all correspondence to: ibrokhim.abdurakhmonov@genomics.uz and genomics@uzsci.net

Center of Genomics and Bioinformatics, Academy of Science of the Republic of Uzbekistan, Tashkent, Uzbekistan

### **References**


[7] Soualmia LF, Lecroq T. Bioinformatics methods and tools to advance clinical care. Yearbook of medical informatics. 2015;10:170–173. doi:10.15265/IY-2015-026

TA-31016, UZB-TA-31017, and UZB-TA-2992, which have been the key factors for development of plant genomics and bioinformatics in Uzbekistan. I greatly acknowledge the Uzbekistan government support and investments/guide from Academy of Sciences of Uzbekistan, Ministry of Agriculture and Water Resources of Uzbekistan, Cotton Industry Joint Stock Company of Uzbekistan, Ministry Foreign Economic Relations, Investments and Trade of Uzbekistan, USDA-ARS, and Texas A&M University for establishment of Center of Genomics and Bioinformatics in Uzbekistan. I also thank Prof. Gilbert S. Omenn, Center for Computational Medicine & Bioinformatics, University of Michigan, USA for critical reading of this introductory chapter, and Mr. Mirzakamol Ayubov and Mr. Muhammad Mirzahme‐ dov, Center of Genomics and Bioinformatics, Uzbekistan, for their technical assistance while

Address all correspondence to: ibrokhim.abdurakhmonov@genomics.uz and

Spring Harbor Laboratory Press; 2004. 692 p. doi:10.1086/431054

Scientific & Engineering Research. 2015;6:769–776.

Science and Engineering.2010;1:114–118.

maceutical Sciences. 2015;4:477–493.

2011;4:tr5. doi:10.1126/scisignal.2001965

Center of Genomics and Bioinformatics, Academy of Science of the Republic of Uzbekistan,

[1] Mount WD. Bioinformatics: sequence and genome analysis. 2nd ed. New York: Cold

[2] Bioinformatics [Internet]. 2016. https://en.wikipedia.org/wiki/Bioinformatics. Ac‐

[3] Vijaya S, Radha R. Text mining in biosciences-a review. International Journal of

[4] Raza K. Application of data mining in bioinformatics. Indian Journal of Computer

[5] Shah VA, Rathod DN, Basuri T, Modi VS, Parmar IJ. Applications of bioinformatics in pharmaceutical product designing: a review. World Journal of Pharmacy and Phar‐

[6] Ma'ayan A. Introduction to network analysis in systems biology. Science Signaling.

preparing this chapter material.

16 Bioinformatics - Updated Features and Applications

Ibrokhim Y. Abdurakhmonov

cessed: 2016-03-10

**Author details**

genomics@uzsci.net

Tashkent, Uzbekistan

**References**


[36] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermo‐ dynamics and auxiliary information. Nucleic Acids Research. 1981;9:133–148. doi: 10.1093/nar/9.1.133

[22] Eck RV, Dayhoff MO. Evolution of the structure of ferredoxin based on living relics of primitive amino acid sequences. Science. 1966;152:363–366. doi:10.1126/science.

[23] Moody, G. Digital code of life: how bioinformatics is revolutionizing science, medicine,

[24] Johnson G, Wu TT. Kabat Database and its applications: 30 years after the first varia‐ bility plot. Nucleic Acids Research. 2000;28:214–218. doi:10.1093/nar/28.1.214

[25] GenBank [Internet]. 2016. http://www.ncbi.nlm.nih.gov/genbank. Accessed:

[26] The European Molecular Biology Laboratory [Internet]. 2016. http://www.embl.org.

[27] DataBank of Japan [Internet]. 2016. http://www.ddbj.nig.ac.jp. Accessed: 2016-03-10 [28] The National Center of Biotechnology Information (NCBI) [Internet]. 2016. http://

[29] Gibbs AJ, McIntyre GA. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. European Journal of Biochemistry. 1970;16:1–11.

[30] Pearson WR, Miller W. Dynamic programming algorithms for biological sequence comparison. Methods in Enzymology. 1992;210:575–601. doi:

[31] Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology. 1981;147:195–197. doi:10.1016/0022-2836(81)90087-5

[32] Johnson MS, Doolittle RF. A method for the simultaneous alignment of three or more amino acid sequences. Journal of Molecular Evolution. 1986;23:267–278 doi:10.1007/

[33] Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Research. 1994;22:4673–

[34] Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology. 2000;302:205–217. doi:

[35] Rius J, Cores F, Solsona F, van Hemert JI, Koetsier J, Notredame C. A user-friendly web portal for T-Coffee on supercomputers. BMC Bioinformatics. 2011;12:150. doi:

152.3720.363

18 Bioinformatics - Updated Features and Applications

2016-03-10

BF02115583

Accessed: 2016-03-10

and business. Chichester: Wiley; 2004. 400 p.

www.ncbi.nlm.nih.gov. Accessed: 2016-03-10

doi:10.1111/j.1432-1033.1970.tb01046.x

10.1016/0076-6879(92)10029-D

4680. doi:10.1093/nar/22.22.4673

10.1006/jmbi.2000.4042

10.1186/1471-2105-12-150


[61] Kim HS, Lee NK. Gene expression profiling in osteoclast precursors by insulin using microarray analysis. Molecules and Cells. 2014;37:827–832. doi:10.14348/molcells. 2014.0223

[49] Loh SK, Low ST, Mohamad MS, Deris S, Kasim S, Wen CY, Wardani AK. A review of software for predicting gene function. International Journal of Bio-Science and Bio-

[50] Höglund A, Kohlbacher O. From sequence to structure and back again: approaches for predicting protein-DNA binding. Proteome Sciences. 2004;2:3. doi:

[51] Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS. A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Research. 2004;32:204–

[52] Nagarajan V, Elasri MO. Structure and function predictions of the Msa protein in Staphylococcus aureus. BMC Bioinformatics. 2007;8:S5. doi:10.1186/1471-2105-8-S7-S5

[53] Pavlopoulou A, Michalopoulos I. State-of-the-art bioinformatics protein structure prediction tools. International Journal of Molecular Medicine. 2011;28:295–310. doi:

[54] Ma X, Guo J, Liu HD, Xie JM, Sun X. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9:1766–1775. doi:

[55] Lv H, Han J, Liu J, Zheng J, Zhong D, Liu R. ISDTool: a computational model for predicting immunosuppressive domain of HERVs. Computational Biology and

[56] Tuvshinjargal N, Lee W, Park B, Han K. Predicting protein-binding RNA nucleotides with consideration of binding partners. Computer Methods and Programs in Biome‐

[57] Fujimoto MS, Suvorov A, Jensen NO, Clement MJ, Bybee SM. Detecting false positive sequence homology: a machine learning approach. BMC Bioinformatics. 2016;17:101.

[58] Reddy TBK, Thomas A, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula J, Pagani I, Lobos E and Kyrpides N. The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta) genome project classifica‐

[59] Lin E, Tsai SJ. Genome-wide microarray analysis of gene expression profiling in major depression and antidepressant therapy. Progress in Neuro-Psychopharmacology and

[60] Lim do H, Kim WS, Kim SJ, Yoo HY, Ko YH. Microarray Gene-expression profiling analysis comparing PCNSL and non-CNS diffuse large B-cell lymphoma. Anticancer

tion. Nucleic Acids Research. 2014;43:D1099–106. doi:10.1093/nar/gku950

Biological Psychiatry. 2016;64:334–40. doi:10.1016/j.pnpbp.2015.02.008

Chemistry. 2014;49:45–50. doi:10.1016/j.compbiolchem.2014.02.001

dicine. 2015;120:3–15. doi:10.1016/j.cmpb.2015.03.010

Technology. 2015;7:57–70. doi:10.14257/ijbsbt.2015.7.2.06

10.1186/1477-5956-2-3

20 Bioinformatics - Updated Features and Applications

10.3892/ijmm.2011.705

10.1109/TCBB.2012.106

doi:10.1186/s12859-016-0955-3

Research. 2015;35:3333–3340.

207. doi:10.1093/nar/gkh461


[85] Uenishi H, Morozumi T, Toki D, Eguchi-Ogawa T, Rund LA, Schook LB. Large-scale sequencing based on full-length-enriched cDNA libraries in pigs: contribution to annotation of the pig genome draft sequence. BMC Genomics. 2012;13:581. doi: 10.1186/1471-2164-13-581

[73] Wu WS, Wang CC, Jhou MJ, Wang YC. YAGM: a web tool for mining associated genes in yeast based on diverse biological associations. BMC Systems Biology. 2015;9:S1. doi:

[74] PubMed database [Internet]. 2015. http://www.ncbi.nlm.nih.gov/pubmed. Accessed

[75] Sanger F, Coulson AR. A rapid method for determining sequences in DNA byprimed synthesis with DNA polymerase. Journal of Molecular Biology. 1975;94:441–448. doi:

[76] Maxam AM, Gilbert W. A new method for sequencing DNA. Proceeding of National

[77] Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature.

[78] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM. Whole-genome random sequencing and assembly of *Haemophilus influenzae* Rd. Science. 1995;269:496–512. doi:10.1126/science.

[79] Oliver GR, Hart SN, Klee EW. Bioinformatics for clinical next generation sequencing.

[80] Krampis K, Wultsch C. A review of cloud computing bioinformatics solutions for nextgen sequencing data analysis and research. Methods in Next Generation Sequencing.

[81] The Genomes OnLine Database (GOLD) [Internet]. 2016. https://gold.jgi.doe.gov/

[82] Karikari TK. Bioinformatics in Africa: The Rise of Ghana? PLoS Computational Biology.

[83] Zhang D, Choi DW, Wanamaker S, Fenton RD, Chin A, Malatrasi M, Turuspekov Y, Walia H, Akhunov ED, Kianian P, Otto C, Simons K, Deal KR, Echenique V, Stamova B, Ross K, Butler GE, Strader L, Verhey SD, Johnson R, Altenbach S, Kothari K, Tanaka C, Shah MM, Laudencia-Chingcuanco D, Han P, Miller RE, Crossman CC, Chao S, Lazo GR, Klueva N, Gustafson JP, Kianian SF, Dubcovsky J, Walker-Simmons MK, Gill KS, Dvorák J, Anderson OD, Sorrells ME, McGuire PE, Qualset CO, Nguyen HT, Close TJ. Construction and evaluation of cDNA libraries for large-scale expressed sequence tag sequencing in wheat (*Triticum aestivum* L.). Genetics. 2004;168:595–608. doi:10.1534/

[84] Henry RJ, Edwards M, Waters DL, Gopala Krishnan S, Bundock P, Sexton TR, Masouleh AK, Nock CJ, Pattemore J. Application of large-scale sequencing to marker discovery in plants. Journal of Biosciences. 2012;37:829–41. doi:10.1007/s12038-012-9253-z

Clinical Chemistry. 2015;61:124–135. doi:10.1373/clinchem.2014.224360

Academy Sciences of USA. 1977;74:560–564. doi:10.1073/pnas.74.2.560

10.1186/1752-0509-9-S6-S1

22 Bioinformatics - Updated Features and Applications

from 2016-03-10

10.1258/jrsm.99.2.62

7542800

1977;265:687–695. doi:10.1038/265687a0

2015;2:2084–7173, DOI: 10.1515/mngs-2015-0003

2015;11:e1004308. doi:10.1371/journal.pcbi.1004308.

index. Accessed: 2016-03-08

genetics.104.034785


F, Hong EL, Cherry JM. ENCODE data at the ENCODE portal. Nucleic Acids Research. 2016;44:726–732. doi:10.1093/nar/gkv1160

[109] Selkoe DJ. Folding proteins in fatal ways. Nature. 2013;426:900–904. doi:10.1038/ nature02264

[96] Loo JA, Brown J, Critchley G, Mitchell C, Andrews PC, Ogorzalek Loo RR. High sensitivity mass spectrometric methods for obtaining intact molecular weights from

[97] Figeys D, Pinto D. Proteomics on a chip: promising developments. Electrophoresis.

[98] Wark AW, Lee HJ, Corn RM. Multiplexed detection methods for profiling microRNA expression in biological samples. Angewandte Chemie International Edition in English.

[99] Tom I, Lewin-Koh N, Ramani SR, Gonzalez LC. Protein microarrays for identification of novel extracellular protein-protein interactions. Current Protocol in Protein Sciences.

[100] McKee CJ, Hines HB, Ulrich RG. Analysis of protein tyrosine phosphatase interactions with micro arrayed phosphopeptide substrates using imaging mass spectrometry.

[101] Choi HM, Beck VA, Pierce NA. Next-generation in situ hybridization chain reaction: higher gain, lower cost, greater durability. ACS Nano. 2014;8:4284–4294. doi:10.1021/

[102] Omenn GS, Lane L, Lundberg EK, Beavis RC, Nesvizhskii AI, Deutsch EW. Metrics for the Human Proteome Project 2015: progress on the human proteome and guidelines for high-confidence protein identification. J. Proteome Res. 2015;14:3452–3460. doi:

[103] Strack R. Highly multiplexed transcriptome imaging. Nature Methods. 2015;12:486–

[104] Abu-Jamous B, Fa R, Roberts DJ, Nandi AK. Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery. PLoS One.

[105] Lord E, Diallo AB, Makarenkov V. Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms. BMC

[106] Chen GK, Chi EC, Ranola JM, Lange K. Convex clustering: an attractive alternative to hierarchical clustering. PLoS Computational Biology. 2015;11:e1004228. doi:10.1371/

[107] Bouvier G, Desdouits N, Ferber M, Blondel A, Nilges M. An automatic tool to analyze and cluster macromolecular conformations based on self-organizing maps. Bioinfor‐

[108] Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, Hitz BC, Gabdank I, Narayanan AK, Ho M, Lee BT, Rowe LD, Dreszer TR, Roe G, Podduturi NR, Tanaka

gel-separated proteins. Electrophoresis. 1999;20:743–748.

2013;Chapter 27:Unit 27.3. doi:10.1002/0471140864.ps2703s72.

Analytical Biochemistry. 2013;442:62–67. doi:10.1016/j.ab.2013.07.031

2008;47:644–652. doi:10.1002/anie.200702450

2001;22:208–216.

24 Bioinformatics - Updated Features and Applications

nn405717p

487

10.1021/acs.jproteome.5b00499

journal.pcbi.1004228

2013;8:e56432. doi:10.1371/journal.pone.0056432

Bioinformatics. 2015;16:68. doi:10.1186/s12859-015-0508-1

matics. 2015;31:1490–1492. doi:10.1093/bioinformatics/btu849


[136] Canadian Bioinformatics Workshops [Internet]. 2016. http://bioinformatics.ca. Ac‐ cessed: 2016-03-11

[124] Yates A, Beal K, Keenan S, McLaren W, Pignatelli M, Ritchie GR, Ruffier M, Taylor K, Vullo A, Flicek P. The Ensembl REST API: Ensembl Data for Any Language. Bioinfor‐

[125] Genereaux BW, Dennison DK. REST enabling the report template library. Journal of

[126] Sundvall E, Nyström M, Karlsson D, Eneling M, Chen R, Örman H. Applying repre‐ sentational state transfer (REST) architecture to archetype-based electronic health record systems. BMC Medical Informatics and Decision Making. 2013;13:57. doi:

[127] Understanding SOAP and REST Basics and Differences [Internet]. 2016. http:// blog.smartbear.com/apis/understanding-soap-and-rest-basics. Accessed; 2016–03-11

[128] Open Bioinformatics Foundation [Internet]. 2016. https://en.wikipedia.org/wiki/Open

[129] Coulet A, Garten Y, Dumontier M, Altman RB, Musen MA, Shah NH. Integration and publication of heterogeneous text-mined relationships on the Semantic Web. J Biomed

[130] Ahmed A, Xing EP, Cohen WW, Murphy RF. Structured correspondence topic models for mining captioned figures in biological literature. KDD. 2009;2009:39–48. doi:

[131] Orozco A, Morera J, Jiménez S, Boza R. A review of bioinformatics training applied to research in molecular medicine, agriculture and biodiversity in Costa Rica and Central

America. Briefings in Bioinformatics. 2013;14:661–670. doi:10.1093/bib/bbt033

trainers. Briefings in Bioinformatics. 2012;13:383–389. doi:10.1093/bib/bbr064

Computational Biology. 2015;11:e1004143. doi:10.1371/journal.pcbi.1004143

[133] Attwood TK, Bongcam-Rudloff E, Brazas ME, Corpas M, Gaudet P, Lewitter F, Mulder N, Palagi PM, Schneider MV, van Gelder CW; GOBLET Consortium. GOBLET: the Global Organisation for Bioinformatics Learning, Education and Training. PLoS

[134] The Swiss Institute of Bioinformatics training portal [internet]. 2016. http://www.isb-

[135] Nunes R, Barbosa de Almeida Júnior E, Pessoa Pinto de Menezes I, Malafaia G. Learning nucleic acids solving by bioinformatics problems. Biochemistry and Molec‐

ular Biology Education. 2015;43:377–383. doi:10.1002/bmb.20886

[132] Schneider MV, Walter P, Blatter MC, Watson J, Brazas MD, Rother K, Budd A, Via A, van Gelder CW, Jacob J, Fernandes P, Nyrönen TH, De Las Rivas J, Blicher T, Jimenez RC, Loveland J, McDowall J, Jones P, Vaughan BW, Lopez R, Attwood TK, Brooksbank C. Bioinformatics Training Network (BTN): a community resource for bioinformatics

matics. 2015;31:143–145. doi:10.1093/bioinformatics/btu613

Bioinformatics Foundation. Accessed: 2016-03-11

Semantics. 2011;2 Suppl 2:S10. doi:10.1186/2041-1480-2-S2-S10

10.1186/1472-6947-13-57

26 Bioinformatics - Updated Features and Applications

10.1145/1557019.1557031

sib.ch/training. Accessed: 2016-03-11

Digital Imaging. 2014;27:331–336. doi:10.1007/s10278-013-9668-6


**Bioinformatic Methods, Approaches and Analysis Tools**

## **A Bioinformatics Method for the Production of Antibody-Drug Conjugates Through Site-Specific Cysteine Conjugation**

Arianna Filntisi, Dimitrios Vlachakis and George K. Matsopoulos

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/62747

### **Abstract**

Antibody-drug conjugates (ADCs) have emerged as a promising class of targeted anticancer therapy, and it is distinguished from traditional chemotherapeutic ap‐ proaches by its potential to kill cancer cells with limited side effects. Site-specific conjugation is one of the current challenges in ADC development because it allows for controlled conjugation and production of homogeneous ADCs. This chapter describes a computational method for the generation of antibody-drug conjugates as PDB files through site-specific cysteine conjugation, given the PDB files of a drug, a linker, and an antibody. The drug and linker are reconfigured using the rotation and translation functions of an affine transformation, which is brought in appropriate positions for the bonds to occur between the three molecules. The hydrogen and disulfide bonds are employed to connect the linker and drug as well as the linker with the antibody, respectively. Examples of conjugates produced with the presented method have been demonstrated.

**Keywords:** bioinformatics, cancer, targeted therapy, antibody-drug conjugates, cys‐ teines

### **1. Introduction**

Antibodies are large proteins produced by the immune system against an invader substance called antigen. As proteins, they consist of one or more chains of amino acid residues, two types of which are cysteines and lysines. Antibodies can also be manufactured and used as stand‐

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

alone therapeutic agents for a number of diseases including cancer. The largest part of the amino acid sequence of an antibody is conserved in all antibodies, with the exception of a small part that is called a hypervariable (HV) region. Due to the specialized and unique amino acid sequence of the HV region, an antibody has the ability to recognize and bind a specific antigen with higher affinity than different antigens—a property called specificity. The specificity and long halflives of antibodies are valuable features to their therapeutic effect. However, the performance of manufactured antibodies as standalone therapeutic agents is impeded by their limited potency. On the contrary, traditional chemotherapeutic drugs are more potent but have short half-lives and no specificity, targeting all cells indiscriminately. A class of targeted anticancer therapy (TAT), that utilizes the best qualities of antibodies and cytotoxins by combining them in one molecule, is antibody-drug conjugates (ADCs). ADCs are therapeutic agents formed by three elements: an antibody whose target is a cancer-specific molecule, a cytotoxic agent, also called the drug, and a linker molecule that connects them. An ADC ideally delivers the conjugated cytotoxic molecule directly into cancer cells, limiting damage to healthy tissues [1–8].

The typical route of an ADC molecule upon its insertion to the bloodstream can be described briefly with the following steps. At first, the ADC circulates in the plasma remaining stable, while the drug remains inactive. When the target antigen is encountered, the antibody binds to it. Typically, the ADC-antigen complex undergoes antigen-mediated or antigen-independ‐ ent internalization by the cancer cell. Subsequently, the ADC is degraded in its parts through a process that depends on the type of linker used. ADCs containing a non-cleavable linker usually undergo lysosomal digestion. The antibody is degraded to its amino acids and the final active drug complex is the drug with the linker and a single amino acid. On the other hand, the ADCs, containing a cleavable linker, are usually degraded through hydrolysis or enzymatic cleavage. At that stage, the drug is activated and kills the cell, in a way that depends on the type of the cytotoxic agent used. For example, auristatins and maytansinoids disrupt the microtubule assembly of the cell, while calicheamicins and duocarmycins target its DNA structure. An alternative ADC strategy is targeting the endothelial cells of the tumor vessels in order to deprive the cancer cells from blood [6, 9–12].

Conjugation usually occurs at solvent-accessible reactive amino acids of the preserved regions of the antibody, leaving the hypervariable regions of the antibody available for antigen binding. The reactive thiol groups of cysteine residues, being made available for conjugation after the reduction of the inter-chain disulfide bonds, as well as the amino groups of lysine residues, are typical conjugation sites. However, these methods by default produce ADCs with variable drug to antibody ratio (DAR) and conjugation sites, and therefore unstable pharma‐ cokinetic properties. For example, in cysteine conjugation, ADCs with a DAR of eight have a significantly shorter circulation time compared to the unconjugated antibodies. On the other hand, ADCs with a DAR of four have a longer circulation time than ADCs with the double DAR, but the same therapeutic effect as them. More homogeneous ADCs can be produced by site-specific conjugation methods, one of which is cysteine engineering. THIOMABs are engineered antibodies in which cysteines have been introduced to the amino acid sequence by substituting original residues. The ideal sites for residue substitution can be identified with phage display techniques. Eight residues on the light chain (LC-V205C, LC-S168C, LC-A153C, LC-S127C, LC-S121C, LC-S114C, LC-V110C, LC-V15C) and five residues on the heavy chain (HC-T116C, HC-S115C, HC-A114C, HC-S113C, HC-S112C) have been investigated as potential cysteine insertion sites (Kabat numbering). The above amino acids were substituted with cysteines, generating THIOMABs, which were subsequently conjugated with biotin-malei‐ mide. Considering the conjugation of two molecules per antibody as 100% conjugation efficiency, most THIOMABs demonstrated more than 90% conjugation efficiency, proving to be suitable for site-specific conjugation of thiol reactive probes. THIOMAB-drug conjugates (TDCs) are the product of the conjugation of a drug to a THIOMAB [11, 13–28].

alone therapeutic agents for a number of diseases including cancer. The largest part of the amino acid sequence of an antibody is conserved in all antibodies, with the exception of a small part that is called a hypervariable (HV) region. Due to the specialized and unique amino acid sequence of the HV region, an antibody has the ability to recognize and bind a specific antigen with higher affinity than different antigens—a property called specificity. The specificity and long halflives of antibodies are valuable features to their therapeutic effect. However, the performance of manufactured antibodies as standalone therapeutic agents is impeded by their limited potency. On the contrary, traditional chemotherapeutic drugs are more potent but have short half-lives and no specificity, targeting all cells indiscriminately. A class of targeted anticancer therapy (TAT), that utilizes the best qualities of antibodies and cytotoxins by combining them in one molecule, is antibody-drug conjugates (ADCs). ADCs are therapeutic agents formed by three elements: an antibody whose target is a cancer-specific molecule, a cytotoxic agent, also called the drug, and a linker molecule that connects them. An ADC ideally delivers the conjugated

cytotoxic molecule directly into cancer cells, limiting damage to healthy tissues [1–8].

in order to deprive the cancer cells from blood [6, 9–12].

32 Bioinformatics - Updated Features and Applications

The typical route of an ADC molecule upon its insertion to the bloodstream can be described briefly with the following steps. At first, the ADC circulates in the plasma remaining stable, while the drug remains inactive. When the target antigen is encountered, the antibody binds to it. Typically, the ADC-antigen complex undergoes antigen-mediated or antigen-independ‐ ent internalization by the cancer cell. Subsequently, the ADC is degraded in its parts through a process that depends on the type of linker used. ADCs containing a non-cleavable linker usually undergo lysosomal digestion. The antibody is degraded to its amino acids and the final active drug complex is the drug with the linker and a single amino acid. On the other hand, the ADCs, containing a cleavable linker, are usually degraded through hydrolysis or enzymatic cleavage. At that stage, the drug is activated and kills the cell, in a way that depends on the type of the cytotoxic agent used. For example, auristatins and maytansinoids disrupt the microtubule assembly of the cell, while calicheamicins and duocarmycins target its DNA structure. An alternative ADC strategy is targeting the endothelial cells of the tumor vessels

Conjugation usually occurs at solvent-accessible reactive amino acids of the preserved regions of the antibody, leaving the hypervariable regions of the antibody available for antigen binding. The reactive thiol groups of cysteine residues, being made available for conjugation after the reduction of the inter-chain disulfide bonds, as well as the amino groups of lysine residues, are typical conjugation sites. However, these methods by default produce ADCs with variable drug to antibody ratio (DAR) and conjugation sites, and therefore unstable pharma‐ cokinetic properties. For example, in cysteine conjugation, ADCs with a DAR of eight have a significantly shorter circulation time compared to the unconjugated antibodies. On the other hand, ADCs with a DAR of four have a longer circulation time than ADCs with the double DAR, but the same therapeutic effect as them. More homogeneous ADCs can be produced by site-specific conjugation methods, one of which is cysteine engineering. THIOMABs are engineered antibodies in which cysteines have been introduced to the amino acid sequence by substituting original residues. The ideal sites for residue substitution can be identified with phage display techniques. Eight residues on the light chain (LC-V205C, LC-S168C, LC-A153C,

The process of drug discovery and development is rather strenuous, taking up to 15 years to produce a new drug from the early *in vitro* discovery stages to the time it is available as a treatment option. However, drug discovery can be facilitated by bioinformatic techniques. The three-dimensional structure of a molecule can be described computationally in various formats, a prevalent one being the Protein Data Bank (PDB) format [29–31]. Computational drug design and molecular mechanics methods provide a means to model molecules and assess their features. Even though there is a variety of general computer-aided drug design tools, the more specific field of ADC computational design is less evolved. JSDraw–Antibody-Drug Conjugates is an editor tool that can be used for ADCs, although it is drawing-oriented and uses 2D coordinates. In addition, a mathematical model was recently developed to describe mechanistically the pharmacokinetic behavior and preclinical efficacy of THIOMABdrug conjugates [32–41].

An attempt to contribute to the study of ADC computational modeling was made in [42] describing the computational construction of ADCs through lysine conjugation. The three input PDB files of the antibody, the linker, and the drug were processed and merged to a PDB file that represented an ADC molecule. The linker and drug molecules were reconfigured and aligned with the selected lysine amino acid, while hydrogen bonds linked the three molecules [30, 31].

**Figure 1.** Scheme of the bonds formed between the drug, the linker, and the antibody molecules, according to the method described in this paper.

The present chapter aims to extend the above-mentioned technique to a more site-specific conjugation method, connecting the linker-drug complex to engineered cysteines instead of lysines. The molecule C12S has been used as a non-cleavable linker. The linker and drug form a hydrogen bond (H bond), while the linker-drug complex and the engineered cysteine form a disulfide bond (SS bond) (**Figure 1**). The configurational change of the linker and drug is accomplished using the rotation and translation functions of an affine transformation. Molecular modeling software was used for functions such as the visualization of molecules, the addition of hydrogen atoms to the antibody PDB files, the insertion of cysteine amino acids to the antibody sequence, as well as the generation of the linker PDB file.

### **2. Methods**

The main aspects of the cysteine conjugation method have been analyzed here. Firstly, the steps of the method are presented in detail. The Affine Transformation subsection contains the definition of the affine transformation, as well as details concerning its application to this project. The Rotation and Translation subsections focus on the application of the rotation and translation functions to the reconfiguration of the molecules. The Axes and Distances subsec‐ tions contain the specifications made in relevance to the axes of the molecules involved and the distances between them. Finally, the implementation of the method is described.

### **2.1. Method steps**

First, the antibody, the cytotoxic agent, and the linker that will form the conjugate are deter‐ mined (step S1). The antibody PDB files we have used were obtained from the RCSB database, while the drug PDB files from the NCI database. The linker PDB file was created with the molecular modeling software UCSF Chimera [36]. Given an antibody PDB file, it is possible to produce the PDB file of an engineered antibody (THIOMAB) by replacing selected antibody residues with cysteines using molecular modeling software (step S2). The antibody was selected to have solvent-accessible residues not in the hypervariable region of the antibody, that are suitable to be replaced with cysteines and to be conjugated. The drug was selected so as to have at least one hydrogen (H) atom bonded covalently to an electronegative (EN) atom, in order to be able to form a hydrogen bond with the linker (**Figure 1**). The linker molecule was designed to be able to form a disulfide bond with a cysteine. For that purpose, a sulfur (S) atom was included. Also, the linker was designed to be long enough to allow for the drug to be linked to the cysteine without colliding with the nearby residues. For that purpose, 12 carbon (C) atoms were incorporated following a linear, instead of circular, layout. Finally, the linker should theoretically be able to release the drug from the antibody inside the cancer cell. For the molecule C12S we used, there are three possible ways for that to happen. Breakage of the hydrogen bond between the drug and linker could occur. Also, the disulfide bond between the linker and drug could break due to disulfide exchange, a phenomenon associated with cleavable linkers. In addition, lysosomal digestion of the antibody could occur, in case the disulfide bond between the cysteine and the linker does not break, a phenomenon associated with non-cleavable linkers. In that case, the final active drug complex would contain a single residue (the cysteine connected to the linker), the linker, as well as the drug.

Even though the PDB format has more than 40 types of PDB records, the three types of records that are crucial for the representation of three-dimensional molecular structure include the PDB ATOM records, which contain information about the atoms of standard amino acids and nucleotides. The PDB HETATM records contain information about additional "non-standard", non-polymer chemical structures of the molecule, and the PDB CONECT records specify the connectivity between the atoms of the molecule [29].

a hydrogen bond (H bond), while the linker-drug complex and the engineered cysteine form a disulfide bond (SS bond) (**Figure 1**). The configurational change of the linker and drug is accomplished using the rotation and translation functions of an affine transformation. Molecular modeling software was used for functions such as the visualization of molecules, the addition of hydrogen atoms to the antibody PDB files, the insertion of cysteine amino acids

The main aspects of the cysteine conjugation method have been analyzed here. Firstly, the steps of the method are presented in detail. The Affine Transformation subsection contains the definition of the affine transformation, as well as details concerning its application to this project. The Rotation and Translation subsections focus on the application of the rotation and translation functions to the reconfiguration of the molecules. The Axes and Distances subsec‐ tions contain the specifications made in relevance to the axes of the molecules involved and

First, the antibody, the cytotoxic agent, and the linker that will form the conjugate are deter‐ mined (step S1). The antibody PDB files we have used were obtained from the RCSB database, while the drug PDB files from the NCI database. The linker PDB file was created with the molecular modeling software UCSF Chimera [36]. Given an antibody PDB file, it is possible to produce the PDB file of an engineered antibody (THIOMAB) by replacing selected antibody residues with cysteines using molecular modeling software (step S2). The antibody was selected to have solvent-accessible residues not in the hypervariable region of the antibody, that are suitable to be replaced with cysteines and to be conjugated. The drug was selected so as to have at least one hydrogen (H) atom bonded covalently to an electronegative (EN) atom, in order to be able to form a hydrogen bond with the linker (**Figure 1**). The linker molecule was designed to be able to form a disulfide bond with a cysteine. For that purpose, a sulfur (S) atom was included. Also, the linker was designed to be long enough to allow for the drug to be linked to the cysteine without colliding with the nearby residues. For that purpose, 12 carbon (C) atoms were incorporated following a linear, instead of circular, layout. Finally, the linker should theoretically be able to release the drug from the antibody inside the cancer cell. For the molecule C12S we used, there are three possible ways for that to happen. Breakage of the hydrogen bond between the drug and linker could occur. Also, the disulfide bond between the linker and drug could break due to disulfide exchange, a phenomenon associated with cleavable linkers. In addition, lysosomal digestion of the antibody could occur, in case the disulfide bond between the cysteine and the linker does not break, a phenomenon associated with non-cleavable linkers. In that case, the final active drug complex would contain a single

the distances between them. Finally, the implementation of the method is described.

residue (the cysteine connected to the linker), the linker, as well as the drug.

to the antibody sequence, as well as the generation of the linker PDB file.

**2. Methods**

34 Bioinformatics - Updated Features and Applications

**2.1. Method steps**

**Figure 2.** Flowchart of the cysteine conjugation method described in this paper. S1. Determine the three molecules (an‐ tibody, drug, linker) that will build the antibody-drug conjugate. S1a. Select the antibody PDB file from the RCSB data‐ base (e.g., ab.pdb) according to the criteria described in Section 2.1. S1b. Select the anticancer drug PDB file from the

NCI database (e.g., drug.pdb). S1c. Determine the linker PDB file (e.g., linker.pdb). S2. Engineer the antibody (ab.pdb) to create a THIOMAB (e.g., thiomab.pdb), using molecular modeling software. S2a. Select one (or more) native resi‐ due(s) of the antibody (ab.pdb) to be replaced by a cysteine residue. The residue should be solvent-accessible, on the surface of the antibody, in order to be accessible to the linker-drug complex, as well as suitable for cysteine replace‐ ment and site-specific conjugation. S2b. Replace the selected residue(s) with a cysteine residue. S2c. Add hydrogen atoms to the antibody, since the antibody PDB files from the RCSB database are provided without them. S2d. Save the engineered antibody (THIOMAB) to a new file (thiomab.pdb). S3. Load the necessary PDB records from the PDB files of the engineered antibody (thiomab.pdb), the linker (linker.pdb) and the drug (drug.pdb). Those include (i) the PDB ATOM, HETATM, and CONECT records of the engineered antibody; (ii) the PDB HETATM and CONECT records of the linker; (iii) the PDB HETATM and CONECT records of the drug (the drug and linker files do not contain PDB ATOM records). S4. Prepare the reconfiguration of the drug. S4a. Calculate the axis of the drug as specified in Eqs. (16– 18). Briefly, the drug axis has been specified as the line connecting the drug atoms participating in the hydrogen bond with the linker, which are a hydrogen (H) atom and the electronegative (EN) atom bonded covalently to it. S4b. Calcu‐ late the axis of the linker as specified in Eqs. (19–21). Briefly, the linker axis has been specified as the line connecting the two most distant non-hydrogen (NH) atoms of the linker, since that line is quite representative of the shape of the linker. Given the chemical formula of the linker used (C12S), those were always the sulfur (S) atom and its most distant carbon (C) atom. S4c. Find the atoms that will define the distance between the drug and linker according to Eqs. (30– 32). Briefly, the distance between the drug and linker can be described as the distance between the atoms participating in the hydrogen bond between the two molecules, which are a hydrogen drug atom and an electronegative linker atom. S5. Reconfigure the drug. S5a. Obtain the coordinates of the drug atoms from the drug PDB HETATM records (drug.pdb). S5b. Rotate the drug using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the drug for *a* = 1, *Tx* = *Ty* = *Tz* = 0 and the values of the variables *rx*, *ry*, *rz* in [−180°, +180°] that satisfy a specific condition, as defined in Eq. (28). Briefly, the condition can be specified as the minimization of the absolute value of the angle be‐ tween the axes of the drug and linker. S5c. Update (optional) the PDB drug records with the modified coordinates from step S5b. S5d. Translate the drug using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the drug for *a* = 1, *rx* = *ry* = *rz* = 0 and the values of the variables *Tx*, *Ty*, *Tz* in [−D, +D] that satisfy a specific condition, defined in Eq. (36). Briefly, the condition can be specified as the minimization of the absolute value of the difference between the distance of the two molecules (defined in step S4c) and the official length of the hydrogen bond. S5e. Up‐ date the PDB records of the drug with the modified coordinates. S5f. Save updated PDB drug records in a new PDB file (e.g., drug\_rt.pdb). S6. Merge the PDB files of the linker and the reconfigured drug to produce the PDB file of the link‐ er-drug complex. S6a. Renumber the sequence numbers of the PDB records of the reconfigured drug, since they will be placed after the PDB records of the linker in step S6c. S6b. Form computationally a hydrogen bond between the drug and linker by updating the PDB CONECT records of the drug and linker. S6c. Save the necessary data to a new file that will represent the linker-drug complex (e.g., linkerdrug.pdb). Those data include the (i) PDB HETATM records of the linker (linker.pdb; from step S1c); (ii) renumbered PDB HETATM records of the reconfigured drug (from step S6a); (iii) updated PDB CONECT records of the linker and drug, including the one that represents the new hydrogen bond (from step S6b). S6d. Load the records of the linker-drug complex (from step S6c) into appropriate structures, to use in the steps S7–S9. S7. Prepare the reconfiguration of the linker-drug complex. S7a. Calculate the axis of the linker-drug complex as specified in Eqs. (22–24). Briefly, the axis of the linker-drug complex has been specified as the line connect‐ ing the two most distant non-hydrogen (NH) atoms of the linker-drug complex. S7b. Calculate the axis of the selected engineered cysteine as specified in Eqs. (25–27). Briefly, the axis of the cysteine residue has been specified as the line connecting the alpha carbon (Cα) atom and the sulfur (S) atom of the cysteine side chain. S7c. Find the atoms that will determine the distance between the linker-drug complex and the cysteine residue, as specified in Eqs. (33–35). Briefly, the distance between the linker-drug complex and the cysteine has been specified as the distance between the atoms participating in the disulfide bond between the two molecules. S8. Reconfigure the linker-drug complex. S8a. Get the atomic coordinates of the linker-drug complex from its PDB HETATM records (linkerdrug.pdb). S8b. Rotate the linkerdrug complex using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the linker-drug complex for *a* = 1, *Tx* = *Ty* = *Tz* and the values of the variables *rx*, *ry*, *rz* in [−180°, +180°] that satisfy a specific condition, as defined in Eq. (29). Briefly, the condition is the minimization of the absolute value of the angle between the axes of the linkerdrug complex and the cysteine residue. S8c. Update (optional) the PDB records of the linker-drug complex with the modified coordinates (from step S8b). S8d. Translate the linker-drug complex using the affine transformation (Eqs. (1– 15)). Apply the affine transformation to the linker-drug complex for *a* = 1, *rx* = *ry* = *rz* = 0, and the values of the variables *Tx*, *Ty*, *Tz* that satisfy a specific condition, as defined in Eq. (37). Briefly, the condition is the minimization of the abso‐ lute value of the difference between the distance of the two molecules and the official length of the disulfide bond. S8e. Update the PDB records of the linker-drug complex with the modified coordinates. S8f. Save the updated PDB records of the linker-drug complex in a new PDB file (e.g., linkerdrug\_rt.pdb). S9. Merge the PDB files of the THIOMAB and the reconfigured linker-drug complex to produce the PDB file of the THIOMAB-drug conjugate. S9a. Renumber the PDB records of the linker-drug complex (from step S8), since they will be placed after the antibody records in step S9c. S9b. Form computationally a disulfide bond between the cysteine and the linker-drug complex, by updating the PDB CONECT records of the antibody and linker. S9c. Save the necessary data to a new file that will represent the final THIOMAB-drug conjugate (e.g., tdc.pdb). Those data include (i) the PDB ATOM and HETATM records of the anti‐ body (thiomab.pdb) (from step S2); (ii) the renumbered PDB HETATM records of the reconfigured linker-drug com‐ plex (from step S9a); (iii) the updated PDB CONECT records of the antibody and the linker-drug complex, including the one that contains the disulfide bond (from steps S9a, S9b).

Once the input files of the engineered antibody, linker, and drug are available, the necessary PDB records are loaded into instances of suitable data structures (step S3). The remaining process can be divided in two basic stages, which are the synthesis of a linker-drug complex molecule from the linker and drug (steps S4–S6), and the synthesis of an antibody-drug conjugate from the linker-drug complex and engineered antibody (steps S7–S9). More specifically, while the linker remains fixed, the drug undergoes rotation and translation, brought in a specific position near the linker (step S5). Following that, the two molecules are linked through a hydrogen bond, and their PDB files are combined to form a linker-drug complex (step S6). While the antibody remains fixed, the linker-drug complex undergoes rotation and translation (step S8). The two molecules are linked through a disulfide bond and their PDB files are combined to produce the final THIOMAB-drug conjugate (step S9). Since a drug molecule can contain more than one pair of atoms that can form a hydrogen bond with the linker, it is possible to either select the drug conjugation site, or produce more than one final TDCs. The computational method proposed in this chapter is described more analytically with the following steps (S1–S9), depicted in **Figure 2**.

### **2.2. Affine transformation**

NCI database (e.g., drug.pdb). S1c. Determine the linker PDB file (e.g., linker.pdb). S2. Engineer the antibody (ab.pdb) to create a THIOMAB (e.g., thiomab.pdb), using molecular modeling software. S2a. Select one (or more) native resi‐ due(s) of the antibody (ab.pdb) to be replaced by a cysteine residue. The residue should be solvent-accessible, on the surface of the antibody, in order to be accessible to the linker-drug complex, as well as suitable for cysteine replace‐ ment and site-specific conjugation. S2b. Replace the selected residue(s) with a cysteine residue. S2c. Add hydrogen atoms to the antibody, since the antibody PDB files from the RCSB database are provided without them. S2d. Save the engineered antibody (THIOMAB) to a new file (thiomab.pdb). S3. Load the necessary PDB records from the PDB files of the engineered antibody (thiomab.pdb), the linker (linker.pdb) and the drug (drug.pdb). Those include (i) the PDB ATOM, HETATM, and CONECT records of the engineered antibody; (ii) the PDB HETATM and CONECT records of the linker; (iii) the PDB HETATM and CONECT records of the drug (the drug and linker files do not contain PDB ATOM records). S4. Prepare the reconfiguration of the drug. S4a. Calculate the axis of the drug as specified in Eqs. (16– 18). Briefly, the drug axis has been specified as the line connecting the drug atoms participating in the hydrogen bond with the linker, which are a hydrogen (H) atom and the electronegative (EN) atom bonded covalently to it. S4b. Calcu‐ late the axis of the linker as specified in Eqs. (19–21). Briefly, the linker axis has been specified as the line connecting the two most distant non-hydrogen (NH) atoms of the linker, since that line is quite representative of the shape of the linker. Given the chemical formula of the linker used (C12S), those were always the sulfur (S) atom and its most distant carbon (C) atom. S4c. Find the atoms that will define the distance between the drug and linker according to Eqs. (30– 32). Briefly, the distance between the drug and linker can be described as the distance between the atoms participating in the hydrogen bond between the two molecules, which are a hydrogen drug atom and an electronegative linker atom. S5. Reconfigure the drug. S5a. Obtain the coordinates of the drug atoms from the drug PDB HETATM records (drug.pdb). S5b. Rotate the drug using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the drug for *a* = 1, *Tx* = *Ty* = *Tz* = 0 and the values of the variables *rx*, *ry*, *rz* in [−180°, +180°] that satisfy a specific condition, as defined in Eq. (28). Briefly, the condition can be specified as the minimization of the absolute value of the angle be‐ tween the axes of the drug and linker. S5c. Update (optional) the PDB drug records with the modified coordinates from step S5b. S5d. Translate the drug using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the drug for *a* = 1, *rx* = *ry* = *rz* = 0 and the values of the variables *Tx*, *Ty*, *Tz* in [−D, +D] that satisfy a specific condition, defined in Eq. (36). Briefly, the condition can be specified as the minimization of the absolute value of the difference between the distance of the two molecules (defined in step S4c) and the official length of the hydrogen bond. S5e. Up‐ date the PDB records of the drug with the modified coordinates. S5f. Save updated PDB drug records in a new PDB file (e.g., drug\_rt.pdb). S6. Merge the PDB files of the linker and the reconfigured drug to produce the PDB file of the link‐ er-drug complex. S6a. Renumber the sequence numbers of the PDB records of the reconfigured drug, since they will be placed after the PDB records of the linker in step S6c. S6b. Form computationally a hydrogen bond between the drug and linker by updating the PDB CONECT records of the drug and linker. S6c. Save the necessary data to a new file that will represent the linker-drug complex (e.g., linkerdrug.pdb). Those data include the (i) PDB HETATM records of the linker (linker.pdb; from step S1c); (ii) renumbered PDB HETATM records of the reconfigured drug (from step S6a); (iii) updated PDB CONECT records of the linker and drug, including the one that represents the new hydrogen bond (from step S6b). S6d. Load the records of the linker-drug complex (from step S6c) into appropriate structures, to use in the steps S7–S9. S7. Prepare the reconfiguration of the linker-drug complex. S7a. Calculate the axis of the linker-drug complex as specified in Eqs. (22–24). Briefly, the axis of the linker-drug complex has been specified as the line connect‐ ing the two most distant non-hydrogen (NH) atoms of the linker-drug complex. S7b. Calculate the axis of the selected engineered cysteine as specified in Eqs. (25–27). Briefly, the axis of the cysteine residue has been specified as the line connecting the alpha carbon (Cα) atom and the sulfur (S) atom of the cysteine side chain. S7c. Find the atoms that will determine the distance between the linker-drug complex and the cysteine residue, as specified in Eqs. (33–35). Briefly, the distance between the linker-drug complex and the cysteine has been specified as the distance between the atoms participating in the disulfide bond between the two molecules. S8. Reconfigure the linker-drug complex. S8a. Get the atomic coordinates of the linker-drug complex from its PDB HETATM records (linkerdrug.pdb). S8b. Rotate the linkerdrug complex using the affine transformation (Eqs. (1–15)). Apply the affine transformation to the linker-drug complex for *a* = 1, *Tx* = *Ty* = *Tz* and the values of the variables *rx*, *ry*, *rz* in [−180°, +180°] that satisfy a specific condition, as defined in Eq. (29). Briefly, the condition is the minimization of the absolute value of the angle between the axes of the linkerdrug complex and the cysteine residue. S8c. Update (optional) the PDB records of the linker-drug complex with the modified coordinates (from step S8b). S8d. Translate the linker-drug complex using the affine transformation (Eqs. (1– 15)). Apply the affine transformation to the linker-drug complex for *a* = 1, *rx* = *ry* = *rz* = 0, and the values of the variables *Tx*, *Ty*, *Tz* that satisfy a specific condition, as defined in Eq. (37). Briefly, the condition is the minimization of the abso‐ lute value of the difference between the distance of the two molecules and the official length of the disulfide bond. S8e. Update the PDB records of the linker-drug complex with the modified coordinates. S8f. Save the updated PDB records of the linker-drug complex in a new PDB file (e.g., linkerdrug\_rt.pdb). S9. Merge the PDB files of the THIOMAB and the reconfigured linker-drug complex to produce the PDB file of the THIOMAB-drug conjugate. S9a. Renumber the PDB records of the linker-drug complex (from step S8), since they will be placed after the antibody records in step S9c.

36 Bioinformatics - Updated Features and Applications

The modification of atomic coordinates serves the purpose of altering the relative positions between the different molecules. However, it should not disrupt the initial lines and distances between the atoms of a single molecule, since that would interfere with the validity of its chemical structure. To that end, an affine transformation was employed for the reconfiguration of a molecule [42–45]. According to [45], the transformation of a rigid body in three-dimen‐ sional space can be described with the following equations:

$$\mathbf{x}' = R\_{00} \mathbf{^\*x} + R\_{01} \mathbf{^\*y} + R\_{02} \mathbf{^\*z} + R\_{03} \tag{1}$$

$$\mathbf{y}' = R\_{\mathbf{t}0} \,^\ast \mathbf{x} + R\_{\mathbf{t}1} \,^\ast \mathbf{y} + R\_{\mathbf{t}2} \,^\ast \mathbf{z} + R\_{\mathbf{t}3} \,\tag{2}$$

$$\mathbf{z}' = R\_{20}\ \mathbf{^\ast x} + R\_{21}\ \mathbf{^\ast y} + R\_{22}\ \mathbf{^\ast z} + R\_{23}\tag{3}$$

$$R\_{00} = a^\* \cos r\_\varepsilon \, ^\* \cos r\_\circ \tag{4}$$

$$R\_{01} = a^\* \left(\cos r\_\varepsilon \, ^\* \sin r\_\chi \, ^\* \sin r\_\chi - \sin r\_\varepsilon \, ^\* \cos r\_\chi\right) \tag{5}$$

$$R\_{0z} = a^\* \left(\cos r\_z \, \text{\*} \sin r\_y \, \text{\*} \cos r\_x + \sin r\_z \, \text{\*} \sin r\_x \right) \tag{6}$$

$$R\_{03} = T\_{\times} \tag{7}$$

$$R\_{\rm 10} = a^\* \sin r\_\natural \, ^\* \cos r\_\natural \tag{8}$$

$$R\_{\rm t1} = a^\* \left(\sin r\_\varepsilon \, ^\* \sin r\_\chi \, ^\* \sin r\_\chi + \cos r\_\varepsilon \, ^\* \cos r\_\chi\right) \tag{9}$$

$$R\_{12} = a^\* \left(\sin r\_\varepsilon \, ^\* \sin r\_\chi \, ^\* \cos r\_\chi - \cos r\_\varepsilon \, ^\* \cos r\_\chi\right) \tag{10}$$

$$R\_{13} = T\_y \tag{11}$$

$$R\_{20} = -a^\* \sin r\_y \tag{12}$$

$$R\_{21} = a^\* \cos r\_y \, \* \sin r\_x \tag{13}$$

$$R\_{22} = a^\* \cos r\_y \, \* \cos r\_x \tag{14}$$

$$R\_{23} = T\_z \tag{15}$$

where *x*, *y*, *z* are the initial coordinates of a point of the rigid body, *x*′, *y*′, *z*′ are the coordinates of the same point after the application of the transformation, α is the scaling factor, *rx*, *ry*, *rz* are the rotation angles of the body around the *x*, *y*, *z* axis, respectively, and *Tx*, *Ty*, *Tz* are the distances according to which each point will be translated in the *x*, *y*, *z* direction, respectively. From Eqs. (1–15), we can deduce that if we apply the values *a* = 1, *rx* = *ry* = *rz* = *Tx* = *Ty* = *Tz* = 0, it will be *x*′ = *x*, *y*′ = *y*, *z*′ = *z*.

The term point used above refers not to a set of constant coordinates but to an element of the body with variable coordinates. In the context of molecular reconfiguration, each molecule was considered as a rigid body. Each atom of a molecule was seen as a single point, since an atom is assigned one set of three-dimensional Cartesian coordinates according to the PDB format. The Cartesian coordinates of an atom of a molecule correspond to the variables *x*,*y*,*z*, *x*′,*y*′,*z*′ of Eqs. (1–15). The scaling factor α was always set to the value of 1, because scaling the molecules would result in the disruption of their structure.

### **2.3. Rotation**

*Ra r r r r r* <sup>02</sup> = + \* cos \* sin \* cos sin \* sin ( *zy x zx* ) (6)

*Ra r r r r r* <sup>11</sup> = + \* sin \* sin \* sin cos \* cos ( *zyx z x* ) (9)

*Ra r r r r r* <sup>12</sup> = \* sin \* sin \* cos cos \* cos ( *zy x z x* - ) (10)

*R T* <sup>03</sup> = *<sup>x</sup>* (7)

*R T* <sup>13</sup> = *<sup>y</sup>* (11)

<sup>20</sup> = - \* sin *R ary* (12)

<sup>21</sup> = \* cos \* sin *Ra r r y x* (13)

<sup>22</sup> = \* cos \* cos *Ra r r y x* (14)

where *x*, *y*, *z* are the initial coordinates of a point of the rigid body, *x*′, *y*′, *z*′ are the coordinates of the same point after the application of the transformation, α is the scaling factor, *rx*, *ry*, *rz* are the rotation angles of the body around the *x*, *y*, *z* axis, respectively, and *Tx*, *Ty*, *Tz* are the distances according to which each point will be translated in the *x*, *y*, *z* direction, respectively. From Eqs. (1–15), we can deduce that if we apply the values *a* = 1, *rx* = *ry* = *rz* = *Tx* = *Ty* = *Tz* = 0,

The term point used above refers not to a set of constant coordinates but to an element of the body with variable coordinates. In the context of molecular reconfiguration, each molecule was considered as a rigid body. Each atom of a molecule was seen as a single point, since an atom is assigned one set of three-dimensional Cartesian coordinates according to the PDB format. The Cartesian coordinates of an atom of a molecule correspond to the variables *x*,*y*,*z*, *x*′,*y*′,*z*′ of Eqs. (1–15). The scaling factor α was always set to the value of 1, because scaling the

it will be *x*′ = *x*, *y*′ = *y*, *z*′ = *z*.

38 Bioinformatics - Updated Features and Applications

molecules would result in the disruption of their structure.

*R T* <sup>23</sup> = *<sup>z</sup>* (15)

<sup>10</sup> = \* sin \* cos *Ra r r z y* (8)

According to the method presented in this chapter, in order to perform the rotation of a molecule A in reference to a fixed molecule B, the affine transformation is applied to A without using the translation and scaling functions, for the parameter values *a* = 1, *Tx* = *Ty* = *Tz* = 0. The values of *rx*, *ry*, *rz* that satisfy a certain condition (defined in Eqs. (28, 29)) are searched for in the interval [−180°, +180°]. Finally, the affine transformation is applied to molecule A for the found *rx*, *ry*, *rz* values. According to **Figure 2**, rotation occurs in two occasions in our method. First, the drug adopts the role of molecule A and the linker adopts the role of molecule B (step S5b). Later, the linker-drug complex adopts the role of molecule A and the cysteine takes the role of B (step S8b).

### **2.4. Translation**

According to the method presented in this chapter, in order to perform the translation of a molecule A in reference to a fixed molecule B, the affine transformation is applied to A without using the rotation and scaling functions, for the parameter values *a* = 1, *rx* = *ry* = *rz* = 0. The values of *Tx*, *Ty*, *Tz* that satisfy a certain condition (Eqs. (36, 37)), are searched in the interval [−D, +D]. Usually, the value *D* = 200 was sufficient to complete the search successfully; otherwise, a larger D value was applied. Finally, the affine transformation is applied to molecule A for the found *Tx*, *Ty*, *Tz* values. According to **Figure 2**, translation occurs in two occasions in our method. First, the drug adopts the role of the molecule A undergoing translation, and the linker adopts the role of the fixed molecule B (step S5d). Later, the linker-drug complex adopts the role of the molecule A undergoing translation, and the cysteine adopts the role of the fixed molecule B (step S8d).

### **2.5. Axes**

As mentioned in Section 2.3, the rotation of a molecule A (i.e., the drug or linker-drug complex) in relation to its corresponding fixed molecule B (i.e., the linker or antibody, respectively) is performed for *a* = 1, *Tx* = *Ty* = *Tz* = 0 and the values of the variables *rx*, *ry*, *rz* that satisfy a certain condition. That condition is the minimization of the absolute value of the angle between the axes of the two molecules (Eqs. (28, 29)).

In the context of this project, the axis of each molecule has been defined by taking into account the desired final layout of the three molecules. In the lysine [34] and cysteine conjugation methods, the same kind of bond connects the linker with the drug. Therefore, in both cases, the drug axis is defined by the drug atoms participating in the hydrogen bond with the linker, which are a hydrogen (H) atom and the electronegative (EN) atom covalently bonded to it [Eqs. (16–18); **Figure 2**, step S4a]. The linker axis has been specified so as to connect the two most distant non-hydrogen linker atoms because that line is quite representative of the linker shape [Eqs. (19–21); **Figure 2**, step S4b]. Given the chemical structure C12S of the linker used in cysteine conjugation, those two atoms are always the sulfur (S) atom and its most distant carbon (C) atom. On a similar note, the axis of the linker-drug complex connects the two most distant non-hydrogen atoms of the linker-drug complex [Eqs. (22–24); **Figure 2**, step S7a]. The cysteine axis has been defined as the line going through the alpha carbon (Cα) and sulfur atoms (S) of the cysteine side chain [Eqs. (25–27); **Figure 2**, step S7b]. This line was chosen for its direction towards the exterior of the antibody, which is the desired direction for the linkerdrug complex. In the lysine conjugation of [34], the linker C15N was used, therefore the linker axis was defined by the nitrogen (N) atom with its most distant carbon (C) atom. Also, the lysine axis went through the alpha carbon (Cα) and nitrogen (N) atoms of the lysine side chain.

$$\left(p\_{d1} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{H \text{ atom of } dmg} = \left(\mathbf{x}\_{Hd}, y\_{Hd}, \mathbf{z}\_{Hd}\right) \tag{16}$$

$$\mathbf{p}\_{d2} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{EN about } \mathbf{q}' \text{ drug}} = \left(\mathbf{x}\_{\text{ENd}}, \mathbf{y}\_{\text{ENd}}, \mathbf{z}\_{\text{ENd}}\right) \tag{17}$$

$$\begin{aligned} L\_d \colon (\mathbf{x}, \mathbf{y}, \mathbf{z}) &= p\_{d1} + t\_d \ast \left( p\_{d1} - p\_{d2} \right) = \left( \mathbf{x}\_{Hd}, y\_{Hd}, \mathbf{z}\_{Hd} \right) + \\ t\_d \ast \left( \mathbf{x}\_{Hd} - \mathbf{x}\_{E \&d}, y\_{Hd} - y\_{E \&d}, \mathbf{z}\_{Hd} - \mathbf{z}\_{E \&d} \right), & -\infty < t\_d < +\infty \end{aligned} \tag{18}$$

$$\mathbf{p}\_{l1} = \begin{pmatrix} \mathbf{x}, \mathbf{y}, \mathbf{z} \end{pmatrix}\_{\text{NH atom of link}} = \begin{pmatrix} \mathbf{x}\_{\text{NH}}, \mathbf{y}\_{\text{NH}}, \mathbf{z}\_{\text{NH}} \end{pmatrix} \tag{19}$$

$$\left(\mathbf{p}\_{l2} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{S\text{ aow of } l\text{index}} = \left(\mathbf{x}\_{Sl}, \mathbf{y}\_{Sl}, \mathbf{z}\_{Sl}\right) \tag{20}$$

$$\begin{aligned} L\_{l}: \left(\mathbf{x}, y, z\right) &= p\_{l1} + t\_{l} \, \mathsf{\*} \left(p\_{l1} - p\_{l2}\right) = \left(\mathbf{x}\_{\text{NHH}}, y\_{\text{NHH}}, z\_{\text{NHH}}\right) + \\ t\_{l} \, \mathsf{\*} \left(\mathbf{x}\_{\text{NHH}} - \mathbf{x}\_{\text{Sl}}, y\_{\text{NH}} - y\_{\text{Sl}}, z\_{\text{NH}} - z\_{\text{N}}\right), & -\infty < t\_{l} < +\infty \end{aligned} \tag{21}$$

$$\mathbf{p}\_{\text{ld}1} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{NH atom of higher -drug}} = \left(\mathbf{x}\_{\text{NHId}}, \mathbf{y}\_{\text{NHId}}, \mathbf{z}\_{\text{NHId}}\right) \tag{22}$$

$$\mathbf{p}\_{\text{id}\,2} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{S\,2\,one\,off\,index-drug}} = \left(\mathbf{x}\_{\text{S\,2d}}, \mathbf{y}\_{\text{S\,2d}}, \mathbf{z}\_{\text{S\,2d}}\right) \tag{23}$$

$$\begin{aligned} L\_{\mathsf{M}} \colon (\mathsf{x}, \mathsf{y}, \mathsf{z}) &= p\_{\mathsf{M}1} + t\_{\mathsf{M}} \, \mathsf{\*} \left( p\_{\mathsf{M}1} - p\_{\mathsf{M}2} \right) = \left( \mathsf{x}\_{\mathsf{NHM}}, \mathsf{y}\_{\mathsf{NHM}}, \mathsf{z}\_{\mathsf{NHM}} \right) + \\ t\_{\mathsf{l}} \, \mathsf{\*} \left( \mathsf{x}\_{\mathsf{NHM}} - \mathsf{x}\_{\mathsf{SL}}, y\_{\mathsf{NHM}} - y\_{\mathsf{SM}}, \mathsf{z}\_{\mathsf{NHM}} - \mathsf{z}\_{\mathsf{SM}} \right), -\infty < t\_{\mathsf{M}} < +\infty \end{aligned} \tag{24}$$

$$\left(\mathbf{p}\_{\text{cyc1}} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{S atom of Cylinder}} = \left(\mathbf{x}\_{\text{Sc}}, \mathbf{y}\_{\text{Sc}}, \mathbf{z}\_{\text{Sc}}\right) \tag{25}$$

A Bioinformatics Method for the Production of Antibody-Drug Conjugates Through Site-Specific Cysteine Conjugation http://dx.doi.org/10.5772/62747 41

$$\left(p\_{cyl2} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{Ca atom of Quantum}} = \left(\mathbf{x}\_{\text{Cr}}, y\_{\text{Cr}}, \mathbf{z}\_{\text{Cr}}\right) \tag{26}$$

$$\begin{aligned} L\_{cys}: \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right) &= p\_{cys1} + t\_{cys} \, \ast \left(p\_{cys1} - p\_{cys2}\right) = \left(\mathbf{x}\_{\mathrm{Sc}}, p\_{\mathrm{Sc}}, \mathbf{z}\_{\mathrm{Sc}}\right) + \\ t\_{cys} \, \ast \left(\mathbf{x}\_{\mathrm{Sc}} - \mathbf{x}\_{\mathrm{Sc}}, y\_{\mathrm{Sc}} - y\_{\mathrm{Cr}}, \mathbf{z}\_{\mathrm{Sc}} - \mathbf{z}\_{\mathrm{Cr}}\right), & -\infty < t\_{cys} < +\infty \end{aligned} \tag{27}$$

$$(r\_{\times}, r\_{\times}, r\_{\pm} : \angle \left(L\_d, L\_l\right) = 0^{\circ} \tag{28}$$

$$\left(r\_x, r\_y, r\_z\right) : \angle \left(L\_{\text{td}}, L\_{\text{cyc}}\right) = 0^\circ \tag{29}$$

#### **2.6. Distances**

cysteine axis has been defined as the line going through the alpha carbon (Cα) and sulfur atoms (S) of the cysteine side chain [Eqs. (25–27); **Figure 2**, step S7b]. This line was chosen for its direction towards the exterior of the antibody, which is the desired direction for the linkerdrug complex. In the lysine conjugation of [34], the linker C15N was used, therefore the linker axis was defined by the nitrogen (N) atom with its most distant carbon (C) atom. Also, the lysine axis went through the alpha carbon (Cα) and nitrogen (N) atoms of the lysine side chain.

40 Bioinformatics - Updated Features and Applications

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )


=+ - = +

1 12 : , , \* , ,

*L xyz p t p p x y z t x xy yz z t*


=+ - = +

1 12 : ,, \* , ,

*L xyz p t p p x y z t x xy yz z t*

( )

*l NHl Sl NHl Sl NHl Sl l*

\* ,, ,

( )

*l NHld Sld NHld Sld NHld Sld ld*

\* ,, ,

=+ - = + - - - -¥ < < +¥

1 12 : , , \* , ,

*L xyz p t p p x y z tx xy yz z t*

*d d d d d Hd Hd Hd d Hd ENd Hd ENd Hd ENd d*

( )

\* ,, ,

<sup>1</sup> ( ) ( ) = = , , , , *<sup>d</sup> H atom of drug Hd Hd Hd p xyz xyz* (16)

<sup>2</sup> ( ) ( ) = = , , , , *<sup>d</sup> EN atom of drug ENd ENd ENd p xyz xyz* (17)

<sup>1</sup> ( ) ( ) = = , , , , *<sup>l</sup> NH atom of linker NHl NHl NHl p xyz xyz* (19)

<sup>2</sup> ( ) ( ) = = , , , , *<sup>l</sup> S atom of linker Sl Sl Sl p xyz xyz* (20)

<sup>1</sup> ( ) ( ) , , , , - *ld* = = *NH atom of linker drug NHld NHld NHld p xyz xyz* (22)

<sup>2</sup> ( ) ( ) , , , , - *ld* = = *S atom of linker drug Sld Sld Sld p xyz xyz* (23)

<sup>1</sup> ( ) ( ) = = , , , , *cys S atom of Cysteine Sc Sc Sc p xyz xyz* (25)

(18)

(21)

(24)

As mentioned in Section 2.4, the translation of a molecule A (i.e., the drug or the linker-drug complex) in relation to its corresponding fixed molecule B (i.e., the linker or the antibody, respectively) is performed for *a* = 1, *rx* = *ry rz* = 0, and the values of the variables *Tx*, *Ty*, *Tz* that satisfy a certain condition. That condition is the minimization of the absolute value of the difference between the distance of the two molecules and a fixed value l (Eqs. (36, 37)). The definitions of the distance between two molecules and the fixed value are given below.

Taking into consideration the desired connectivity between the three molecules, in the context of this project, the distance between two molecules was defined as the distance between the atoms that will participate in the intended intermolecular bond (hydrogen or disulfide bond, **Figure 1**). As a consequence, the fixed value l was defined as the official length of that intermolecular bond (*lHb* = 1.5–2.5 Å for the Hydrogen bond, *lSSb* = 2.05 Å for the disulfide bond).

In more detail, in both lysine and cysteine conjugation, the distance between the drug and linker was defined by the drug hydrogen (H) atom and the linker electronegative (EN) atom participating in the H bond between the linker and drug [Eqs. (30–32); **Figure 2**, step S4c]. In regard to the distance between the linker-drug complex and the conjugation residue, in lysine conjugation, it was defined by the linker nitrogen (N) atom and the lysine hydrogen (H) atom bonded to the side chain nitrogen atom. In cysteine conjugation, however, it has been defined as the distance between the sulfur (S) linker atom and the sulfur atom of the cysteine side chain [Eqs. (33–35); **Figure 2**, step S7c]. The translation of the drug in relation to the linker is performed for the values of the variables *Tx*, *Ty*, *Tz* for which the distance between the two molecules takes the official length *lHb* = 1.5–2.5 Å of the hydrogen bond (Eq. (36)). The transla‐ tion of the linker-drug complex in relation to the cysteine is performed for the values of the variables *Tx*, *Ty*, *Tz* for which the distance between the two molecules takes the official length *lSSb* = 2.05 Å of the disulfide bond (Eq. (37)):

$$\mathbf{p}\_d = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{H \text{ atom of drug}} = \left(\mathbf{x}\_{\text{Hd}}, \mathbf{y}\_{\text{Hd}}, \mathbf{z}\_{\text{Hd}}\right) = \mathbf{p}\_{d1} \tag{30}$$

$$p\_{l1} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{EN \, atom } q' \text{ index}} = \left(\mathbf{x}\_{\text{EN}}, \mathbf{y}\_{\text{EN}}, \mathbf{z}\_{\text{EN}}\right) = \left(\mathbf{x}\_{\text{Cl}}, \mathbf{y}\_{\text{Cl}}, \mathbf{z}\_{\text{Cl}}\right) \tag{31}$$

$$D\_{d-l} = \ \ p\_d - p\_{l1} \tag{32}$$

$$\mathbf{p}\_{12} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{S: aou of } l \text{index}} = \left(\mathbf{x}\_{\text{Sl}}, \mathbf{y}\_{\text{Sl}}, \mathbf{z}\_{\text{Sl}}\right) \tag{33}$$

$$\mathbf{p}\_{\text{sys}} = \left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right)\_{\text{S axiom of Cyzasia}} = \left(\mathbf{x}\_{\text{Sys}}, \mathbf{y}\_{\text{Sys}}, \mathbf{z}\_{\text{Sys}}\right) \tag{34}$$

$$D\_{cys-l} = \ p\_{cys} - p\_{t2} \tag{35}$$

$$\left| \left| T\_{\mathbf{x}}, T\_{\mathbf{y}}, T\_{\mathbf{z}} : \middle| D\_{d-l} - l\_{Hb} \right| = 0 \tag{36}$$

$$\left| \left| T\_{\ge}, T\_y, T\_z : \right| D\_{\text{cys}-I} - I\_{\text{SSb}} \right| = 0 \tag{37}$$

where the variables *Tx*, *Ty*, *Tz* of Eq. (36) refer to the translation of the drug (**Figure 2**, step S5d), while the variables of *Tx*′, *Ty*′, *Tz*′ Eq. (37) refer to the translation of the linker-drug complex (**Figure 2**, step S8d).

#### **2.7. Implementation**

The steps S1 and S2 (see the **Figure 2**) that involve the selection and preparation of the input files of the engineered antibody, the linker, the drug, as well as the selection of the conjugation cysteine residue, are carried out manually. The steps S3–S9 have been developed as a C++ program, and therefore can be executed automatically. In order to make the necessary data management, a number of classes and functions were created in regard to the PDB records used, the three-dimensional points and lines, as well as the affine transformation.

### **3. Results**

The method described in the previous section has been applied to various antibody and drug combinations, taking about 10–13 seconds for each TDC to be generated. In **Figure 3**, the basic stages of the production of a THIOMAB-drug conjugate are depicted. First, the drug and linker are formed in two independent PDB files. The axis of each molecule is depicted as a white line, and the initial angle between them is 98.2° (**Figure 3a**). The drug is rotated, and the angle between the two axes becomes 0.08° (**Figure 3b**). The initial distance between the atoms that will form the hydrogen bond (the C atom of the linker and the H atom of the drug, bonded covalently to an O drug atom) is 28.02 Å. The drug is translated and the distance between the two atoms becomes 1.51 Å (**Figure 3c**). The linker has remained fixed. The two molecules are merged into a linker-drug complex, whose new axis is depicted as a white line (**Figure 3d**). The residue valine of the antibody with PDB id 4GAG, with sequence number 206 on the light chain (LC-VAL206), is replaced by a cysteine, whose axis is calculated (**Figure 3e**). The initial angle between the axis of the linker-drug complex and the axis of the cysteine is 105.6°, and the distance between the sulfur atoms of the cysteine and the linker is 54.75 Å. The linker-drug complex is rotated and translated, so that the angle becomes 0.083° and the distance becomes 2.07 Å. Finally, the linker-drug complex is connected with the cysteine (**Figure 3f**).

<sup>1</sup> ( ) ( ) ( ) = == , , , , , , *<sup>l</sup> EN atom of linker ENl ENl ENl Cl Cl Cl p xyz x y z xyz* (31)

*D pp dl d l* - = - <sup>1</sup> (32)

*D pp cys l cys l* - = - <sup>2</sup> (35)

,,: 0 *TTT D l x y z d l Hb* - - = (36)

,,: 0 *TTT D l x y z cys l SSb* - - = (37)

where the variables *Tx*, *Ty*, *Tz* of Eq. (36) refer to the translation of the drug (**Figure 2**, step S5d), while the variables of *Tx*′, *Ty*′, *Tz*′ Eq. (37) refer to the translation of the linker-drug complex

The steps S1 and S2 (see the **Figure 2**) that involve the selection and preparation of the input files of the engineered antibody, the linker, the drug, as well as the selection of the conjugation cysteine residue, are carried out manually. The steps S3–S9 have been developed as a C++ program, and therefore can be executed automatically. In order to make the necessary data management, a number of classes and functions were created in regard to the PDB records

The method described in the previous section has been applied to various antibody and drug combinations, taking about 10–13 seconds for each TDC to be generated. In **Figure 3**, the basic stages of the production of a THIOMAB-drug conjugate are depicted. First, the drug and linker are formed in two independent PDB files. The axis of each molecule is depicted as a white line, and the initial angle between them is 98.2° (**Figure 3a**). The drug is rotated, and the angle between the two axes becomes 0.08° (**Figure 3b**). The initial distance between the atoms that will form the hydrogen bond (the C atom of the linker and the H atom of the drug, bonded

used, the three-dimensional points and lines, as well as the affine transformation.

(**Figure 2**, step S8d).

42 Bioinformatics - Updated Features and Applications

**2.7. Implementation**

**3. Results**

<sup>2</sup> ( ) ( ) = = , , , , *<sup>l</sup> S atom of linker Sl Sl Sl p xyz xyz* (33)

( ) ( ) = = , , , , *cys S atom of Cysteine Scys Scys Scys p xyz xyz* (34)

**Figure 3.** The basic stages of the computational production of a TDC. (a) Left: The linker molecule C12S. Right: The drug with NCI sequence number 9, before its reconfiguration. Angle between the two axes: 98.2°. (b) The linker and the rotated drug. Angle between the two axes: 0.08°. Distance between the H atom of the drug and the C atom of the linker: 28.02 Å. (c) The linker and the translated drug. Distance between the H atom of the drug and the C atom of the linker: 1.51 Å. (d) The linker-drug complex. (e) The engineered cysteine that has replaced the residue LC-VAL206 on the antibody with PDB id 4GAG. Initial angle between the axes of the cysteine and the linker-drug complex: 105.6°. Initial distance between the two S atoms: 54.75 Å. (f) The linker-drug complex conjugated to the engineered cysteine. Angle between the axes of the cysteine and the linker-drug complex: 0.084°. Distance between the two S atoms: 2.07 Å.

More examples of produced TDCs are demonstrated in **Figure 4**. The TDC composed by the antibody with PDB id 4GAJ, the linker molecule C12S and the drug with sequence number 5 (nci\_5.pdb) is depicted in **Figure 4(a)**. A hydrogen bond connects the drug with the linker, and a disulfide bond connects the engineered cysteine on the antibody with the linker-drug complex. The cysteine has replaced the solvent-accessible amino acid alanine with residue sequence number 114, at the heavy chain of the antibody (HC-ALA114). The same TDC is depicted from a closer distance at the right of the figure. In **Figure 4b**, a similar TDC composed by the antibody with PDB id 4GAJ, the linker molecule C12S and the drug with sequence number 7 (nci\_7.pdb) is depicted. The above technique can be used for the conjugation of TDCs with more than one drug molecules per antibody. **Figure 4c** depicts a TDC with a DAR of 2, composed by the antibody with PDB id 4GAG, to which two drug molecules with sequence number 1 (nci\_1.pdb) have been conjugated. The engineered cysteines that serve as conjuga‐ tion sites have replaced the solvent-accessible residues alanine with sequence number 114 at the heavy chain and valine with sequence number 206 at the light chain (LC-VAL206). The conjugation areas of the same TDC are demonstrated from a closer distance at the right of the figure. The linker and the drugs used in the examples are depicted in **Figure 5**.

**Figure 3.** The basic stages of the computational production of a TDC. (a) Left: The linker molecule C12S. Right: The drug with NCI sequence number 9, before its reconfiguration. Angle between the two axes: 98.2°. (b) The linker and the rotated drug. Angle between the two axes: 0.08°. Distance between the H atom of the drug and the C atom of the linker: 28.02 Å. (c) The linker and the translated drug. Distance between the H atom of the drug and the C atom of the linker: 1.51 Å. (d) The linker-drug complex. (e) The engineered cysteine that has replaced the residue LC-VAL206 on the antibody with PDB id 4GAG. Initial angle between the axes of the cysteine and the linker-drug complex: 105.6°. Initial distance between the two S atoms: 54.75 Å. (f) The linker-drug complex conjugated to the engineered cysteine. Angle between the axes of the cysteine and the linker-drug complex: 0.084°. Distance between the two S atoms: 2.07 Å.

44 Bioinformatics - Updated Features and Applications

More examples of produced TDCs are demonstrated in **Figure 4**. The TDC composed by the antibody with PDB id 4GAJ, the linker molecule C12S and the drug with sequence number 5 (nci\_5.pdb) is depicted in **Figure 4(a)**. A hydrogen bond connects the drug with the linker, and a disulfide bond connects the engineered cysteine on the antibody with the linker-drug complex. The cysteine has replaced the solvent-accessible amino acid alanine with residue sequence number 114, at the heavy chain of the antibody (HC-ALA114). The same TDC is depicted from a closer distance at the right of the figure. In **Figure 4b**, a similar TDC composed

**Figure 4.** (a) Left: TDC composed by the antibody with PDB id 4GAJ, the linker molecule C12S, and the drug with se‐ quence number 5. The drug has been conjugated through the linker to the engineered cysteine that has replaced the residue HC-ALA114. Right: The same TDC from a closer distance. (b) Left: TDC composed by the antibody with PDB id 4GAJ, the linker molecule C12S, and the drug with sequence number 7. The drug has been conjugated through the linker to the engineered cysteine that has replaced the residue HC-ALA114. Right: The same TDC from a closer dis‐ tance. (c) Left: TDC composed by the antibody with PDB id 4GAG, two linker molecules C12S, and two drug molecules with sequence number 1. The antibody conjugation sites are engineered cysteines that have replaced the residues HC-ALA114, and LC-VAL206. Right: The same TDC from a closer distance.

**Figure 5.** (a) The linker molecule C12S. (b) The cytotoxic drugs used in **Figures 3** and **4**. Top left: the drug with sequence number 1 (nci\_1.pdb). Top right: the drug with sequence number 5 (nci\_5.pdb). Bottom left: the drug with sequence number 7 (nci\_7.pdb). Bottom right: the drug with sequence number 9 (nci\_9.pdb).

### **4. Discussion**

This chapter has presented a computational method for site-specific formulation of ADCs, by conjugating the cytotoxin and the linker to engineered cysteines on the surface of the antibody. A hydrogen bond is employed to connect the linker with the drug, forming a linker-drug conjugate, which is then attached through a disulfide bond to the selected engineered cysteine. The change of the atomic coordinates of a molecule was implemented with an affine transfor‐ mation. The generation of the linker molecule C12S, as well as the cysteine replacement, was achieved through molecular modeling software [36, 42].

The present chapter proposes a general way to implement computationally ADC molecules using data from established databases. The common antibodies used do not target cancerspecific molecules, as opposed to antibodies that are parts of real ADCs. The employment of antibodies with cancer-specific antigens was not feasible, since their three-dimensional structures are not available. However, considering that the conjugation occurs in the preserved regions of the antibody, the same techniques could be applied using the PDB file of any full antibody. In addition, conventional anticancer drugs were used, since the three-dimensional structures of the more potent drugs used in ADCs currently are not available. It is important to note that the results of the presented method are not proposed as real anticancer drugs. The presented method is an attempt to simulate ADCs using programming techniques and available data.

Site-specific conjugation is a substantial goal in ADC development, since it gives the ability to control the pharmacokinetic and therapeutic properties of the produced ADCs. Aside from the replacement of residues with cysteines, mentioned in Section 1, a similar concept is the substitution of cysteines with serines in order to reduce the number of inter-chain cysteines available for conjugation. Another technique is the introduction of non-native amino acids in the antibody sequence, such as selenocysteine, acetylphenylalanine, para-acetylphenylala‐ nine, and para-azidophenylalanine, utilizing their ability to provide orthogonal conjugation chemistries that are not available from original residues. Enzymes have been used to help form bonds between the antibody, the linker and drug, a method called enzymatic conjugation. An additional strategy is the conjugation at the glycosylation sites of the antibody, leaving the amino acid sequence of the antibody and the cell culture condition intact. For example, sialic acid moieties can be introduced enzymatically to the antibody, which can then be oxidized to introduce aldehyde groups. The resulting antibodies can then form stable oxime bonds with drugs containing aldehyde-reactive aminooxy groups [22, 46–59].

A number of actions could be taken to improve and expand the work described in this chapter. Some of our future goals is the implementation of more site-specific conjugation ways, the use of antibody forms alternative to the full monoclonal antibody, such as scFvs, diabodies, and minibodies, as well as other non-cleavable and cleavable linker molecules. In addition, the automatic recognition of all amino acids available for conjugation given a specific antibody would be interesting, as well as the estimation of the best drugs and linkers given a conjugation method. Finally, the conditions and equations used could be modified in order to take into consideration other factors.

### **Acknowledgements**

**Figure 5.** (a) The linker molecule C12S. (b) The cytotoxic drugs used in **Figures 3** and **4**. Top left: the drug with sequence number 1 (nci\_1.pdb). Top right: the drug with sequence number 5 (nci\_5.pdb). Bottom left: the drug with sequence

This chapter has presented a computational method for site-specific formulation of ADCs, by conjugating the cytotoxin and the linker to engineered cysteines on the surface of the antibody. A hydrogen bond is employed to connect the linker with the drug, forming a linker-drug conjugate, which is then attached through a disulfide bond to the selected engineered cysteine. The change of the atomic coordinates of a molecule was implemented with an affine transfor‐ mation. The generation of the linker molecule C12S, as well as the cysteine replacement, was

The present chapter proposes a general way to implement computationally ADC molecules using data from established databases. The common antibodies used do not target cancerspecific molecules, as opposed to antibodies that are parts of real ADCs. The employment of antibodies with cancer-specific antigens was not feasible, since their three-dimensional structures are not available. However, considering that the conjugation occurs in the preserved regions of the antibody, the same techniques could be applied using the PDB file of any full antibody. In addition, conventional anticancer drugs were used, since the three-dimensional structures of the more potent drugs used in ADCs currently are not available. It is important to note that the results of the presented method are not proposed as real anticancer drugs. The presented method is an attempt to simulate ADCs using programming techniques and

number 7 (nci\_7.pdb). Bottom right: the drug with sequence number 9 (nci\_9.pdb).

achieved through molecular modeling software [36, 42].

**4. Discussion**

46 Bioinformatics - Updated Features and Applications

available data.

Funding was received through an IKY Fellowship of Excellence for postgraduate studies in Greece—Siemens Program. The authors confirm that the funder had no influence over the study design, content of the chapter, or selection of this book.

### **Author details**

Arianna Filntisi1,2\*, Dimitrios Vlachakis2 and George K. Matsopoulos1

\*Address all correspondence to: arianna.filntisi@gmail.com

1 School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece

2 Computational Biology and Medicine Group, Biomedical Research Foundation of the Academy of Athens, Athens, Greece

### **References**


[15] Carter PJ. Potent antibody therapeutics by design. Nat Rev Immunol. 2006;6(5):343– 357. DOI: 10.1038/nri1837

**References**

2014–03]

[Accessed: 2014–02]

48 Bioinformatics - Updated Features and Applications

s11523-012-0212-2

2013.03.004

mabs.1.3.8515

j.cbpa.2013.03.022

[1] National Cancer Institute at the National Institutes of Health. 2014. Available from:

[2] American Cancer Society. 2014. Available from: http://www.cancer.org/ [Accessed:

[3] Cancer Research Institute. 2014. Available from: http://www.cancerresearch.org/

[4] Li J, Chen F, Cona MM, Feng Y, Himmelreich U, Oyen R, et al. A review on various targeted anticancer therapies. Target Oncology. 2012;7(1):69–85. DOI: 10.1007/

[5] Adler MJ, Dimitrov DS. Therapeutic antibodies against cancer. Hematol Oncol Clin

[6] Sapra P, Shor B. Monoclonal antibody-based therapies in cancer: advances and challenges. Pharmacol Ther. 2013;138(3):452–469. DOI: 10.1016/j.pharmthera.

[7] Goldmacher VS, Chittenden T, Chari RVJ, Kovtun YV, Lambert JM. Antibody-drug conjugates for targeted cancer therapy. In: Desai MC, editor. Annual Reports in

[8] Casi G, Neri D. Antibody-drug conjugates: basic concepts, examples and future perspectives. J Control Release. 2012;161(2):422–428. DOI: 10.1016/j.jconrel.2012.01.026

[9] Kovtun Y V, Goldmacher VS. Cell killing by antibody-drug conjugates. Cancer Lett.

[10] Gerber H-P, Senter PD, Grewal IS. Antibody drug-conjugates targeting the tumor vasculature: Current and future developments. MAbs. 2009;1(3):247–253. DOI: 10.4161/

[11] Wu AM, Senter PD. Arming antibodies: prospects and challenges for immunoconju‐

[12] Perez HL, Cardarelli PM, Deshpande S, Gangwar S, Schroeder GM, Vite GD, et al. Antibody-drug conjugates: current status and future directions. Drug Discov Today.

[13] Alley SC, Anderson KE. Analytical and bioanalytical technologies for characterizing antibody-drug conjugates. Curr Opin Chem Biol. 2013;17(3):406–411. DOI: 10.1016/

[14] Iyer U, Kadambi VJ. Antibody drug conjugates-Trojan horses in the war on cancer. J Pharmacol Toxicol Methods. 2011;64(3):207–212. DOI: 10.1016/j.vascn.2011.07.005.

Medicinal Chemistry, Volume 47. Academic Press; 2012. p. 349–366.

2007;255(2):232–240. DOI: http://dx.doi.org/10.1016/j.canlet.2007.04.010

gates. Biotechnol. 2005;23(9):1137–1146. DOI: 10.1038/nbt1141

2014;19(7):869–881. DOI: 10.1016/j.drudis.2013.11.004

North Am.2012;26(3):447–481. DOI: 10.1016/j.hoc.2012.02.013

http://www.cancer.gov/ [Accessed: 2014–03]


[41] Sukumaran S, Gadkar K, Zhang C, Bhakta S, Liu L, Xu K, et al. Mechanism-based pharmacokinetic/pharmacodynamic model for THIOMABTM drug conjugates. Pharm Res. 2015;32(6):1884–1893. DOI: 10.1007/s11095-014-1582-1

[28] Boylan NJ, Zhou W, Proos RJ, Tolbert TJ, Wolfe JL, Laurence JS. Conjugation site heterogeneity causes variable electrostatic properties in Fc conjugates. Bioconjug

[29] Worldwide PDB (wwPDB) organization. 2003 [Updated: 2016]. Available from: http://

[30] RCSB Protein Data Bank. 2000 [Updated: 2016]. Available from: http://www.rcsb.org

[31] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucl. Acids Res. 2000;28(1):235–242. DOI: 10.1093/

[32] Chemical Computing Group Inc. Molecular Operating Environment (MOE). 2013.

[33] Filntisi A, Papangelopoulos N, Bencurova E, Kasampalidis I, Matsopoulos G, Vlachakis D, et al. State-of-the-Art Neural Networks Applications in Biology. Int J Syst Biol

[34] Filntisi A, Vlachakis D, Matsopoulos G, Kossida S. 3D structural bioinformatics of proteins and antibodies: state of the art perspectives and challenges. Int J Syst Biol

[35] Vlachakis D, Tsagrasoulis D, Megalooikonomou V, Kossida S. Introducing Drugster: a comprehensive and fully integrated drug design, lead and structure optimization toolkit. Bioinformatics. 2013;29(1):126–128. DOI: DOI: 10.1093/bioinformatics/bts637

[36] Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF Chimera-a visualization system for exploratory research and analysis. J Comput Chem.

[37] Toshimoto K, Wakayama N, Kusama M, Maeda K, Sugiyama Y, Akiyama Y. In silico prediction of major drug clearance pathways by support vector machines with feature selected descriptors. Drug Metab Dispos. 2014;42(11):1811–1819. DOI: 10.1124/dmd.

[38] Mitra A, Kesisoglou F, Dogterom P. Application of absorption modeling to predict bioequivalence outcome of two batches of etoricoxib tablets. AAPS PharmSciTech.

[39] Vlachakis D, Kossida S. Antibody drug conjugate bioinformatics: drug delivery through the letterbox. Comput Math Methods Med. 2013;2013(2013):282398.

[40] Scilligence. JSDraw - Antibody-Drug Conjugate [Internet]. 2014 [Updated: 2015]. Available from: http://www.elncloud.com/jsdrawapp/jsdraw/ADC.htm [Accessed:

Biomed Technol. 2013;2(4):63–85. DOI: 10.4018/ijsbbt.2013100105

Biomed Technol. 2013;2(3):67–74. DOI: 10.4018/ijsbbt.2013070105

2004;25(13):1605–1612. DOI: 10.1002/jcc.20084

2015;16(1):76–84. DOI: 10.1208/s12249-014-0194-8

DOI:http://dx.doi.org/10.1155/2013/282398

Chem. 2013;24(6):1008–1016. DOI: 10.1021/bc4000564

www.wwpdb.org/ [Accessed: 2015]

[Accessed: 2016]

50 Bioinformatics - Updated Features and Applications

[Accessed: 2014]

nar/28.1.235

114.057893

2016]


### **Databases and Algorithms in Allergen Informatics**

Kiran Kadam, Sangeeta Sawant, V.K. Jayaraman and Urmila Kulkarni-Kale

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63083

#### **Abstract**

[53] Kiick KL, Saxon E, Tirrell DA, Bertozzi CR. Incorporation of azides into recombinant proteins for chemoselective modification by the Staudinger ligation. Proc Natl Acad

[54] Hofer T, Skeffington LR, Chapman CM, Rader C. Molecularly defined antibody conjugation through a selenocysteine interface. Biochemistry. 2009;48(50):12047–12057.

[55] Hofer T, Thomas JD, Burke TR, Rader C. An engineered selenocysteine defines a unique class of antibody derivatives. Proc Natl Acad Sci U S A. 2008;105(34):12451–12456. DOI:

[56] Li X, Yang J, Rader C. Antibody conjugation via one and two C-terminal selenocys‐

[57] Zimmerman ES, Heibeck TH, Gill A, Li X, Murray CJ, Madlansacay MR, et al. Produc‐ tion of site-specific antibody-drug conjugates using optimized non-natural amino acids in a cell-free expression system. Bioconjug Chem. 2014;25(2):351–361. DOI: 10.1021/

[58] Axup JY, Bajjuri KM, Ritland M, Hutchins BM, Kim CH, Kazane SA, et al. Synthesis of site-specific antibody-drug conjugates using unnatural amino acids. Proc Natl Acad

[59] Sochaj AM, Świderska KW, Otlewski J. Current methods for the synthesis of homoge‐ neous antibody-drug conjugates. Biotechnol Adv. 2015;33(6):775–784. DOI: 10.1016/

teines. Methods. 2014;65(1):133–138. DOI: 10.1016/j.ymeth.2013.05.023

Sci U S A. 2012;109(40):16101–16106. DOI: 10.1073/pnas.1211023109

Sci U S A. 2002;99(1):19–24. DOI: 10.1073/pnas.012583299

DOI: 10.1021/bi901744t

52 Bioinformatics - Updated Features and Applications

10.1073/pnas.0800800105

j.biotechadv.2015.05.001

bc400490z

Allergic diseases are considered as one of the major health problems worldwide due to their increasing prevalence. Advancements in genomic, proteomic, and analytical techniques have resulted in considerable progress in the field of allergology, which has led to accumulation of huge amount of data. Allergen bioinformatics comprises allergenrelated data resources and computational methods/tools, which deal with an efficient archival, management, and analysis of allergological data. Significant work has been done in the area of allergen bioinformatics that has proven pivotal for the development and progress of this field. In this chapter, we describe the current status of databases and algorithms, encompassing the field of allergen bioinformatics by examining work carried out thus far with respect to features such as allergens and allergenicity, allergen databases, algorithms/tools for allergen/allergenicity prediction, allergen epitope prediction, and allergenic cross-reactivity assessment. This chapter illustrates concepts and algorithms in allergen bioinformatics, as well as it outlines the key areas for potential development in allergology field.

**Keywords:** allergen databases, allergen epitope prediction, allergenicity, allergenic proteins, bioinformatics

### **1. Introduction**

The immune system represents a very complex system comprising numerous biological molecules and processes, which combine to form body's defense against infectious agents and other threats. Immunity is basically divided into two types such as innate immunity and adaptive immunity [1]. Innate immunity also referred to as natural, native, or nonspecific immunity acts as a first line of defense against common harmful agents. Innate immune response provides

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

immediate protection and involves number of components such as monocytes, macrophages, neutrophils, cytokines, complement, and epithelial barriers. Adaptive or acquired immunity comprises highly specific immune responses that are elicited against particular pathogens or antigens. These immune responses are either cell mediated or antibody mediated (humoral) and executed by specialized lymphocytes or immunoglobulins, respectively. On certain occasions, the immune system produces immune responses that are harmful for the host organism. Autoimmunity denotes one such case wherein the body elicits immune responses against its own cells and tissues (self-antigens) which lead to development of autoimmune diseases. In some cases, immune system produces inappropriate immune responses known as hypersensitivity, which has deleterious effects on the host organism. Hypersensitivity reac‐ tions are categorized into four groups based on the type of immune response and the effector mechanism involved. These are (i) immediate hypersensitivity (type I), (ii) antibody-mediat‐ ed hypersensitivity (type II), (iii) immune complex–mediated hypersensitivity (type III), and (iv) cell-mediated hypersensitivity (type IV) [2].

Allergic reactions are type I hypersensitivity reactions, which are characterized by induction of specific class of antibodies known as immunoglobulin E (IgE). These reactions are elicited against specific type of antigens commonly referred to as allergens. An allergic reaction involves specialized cells and specific molecules of the immune system [3]. IgE antibodies induced by allergens upon allergic sensitization bind to effector cells such as basophils and mast cells via specific Fc receptors present on the surfaces of those cells. Subsequent exposure to the allergen causes cross-linking of membrane-bound IgE on these effector cells, which leads to their degranulation and release of pharmacologically active agents such as histamine. These pharmacological mediators are responsible for clinical manifestations of allergic reactions in the affected individuals. Immunogenicity in general refers to the potential of an antigen to elicit an immune response, while in case of allergens, allergenicity is considered as a reflection of its allergenic potential. Allergenicity indicates the capability of an allergen to induce clinical symptoms of allergy as well as to induce and bind to IgE antibodies [4]. The prevalence of allergic reactions has increased significantly in the last few years, especially in the developing countries [5]. This has resulted in considerable increase in disease burden as well as economic issues due to costs associated with these diseases. Therefore, the study of allergic diseases has gained tremendous importance as they represent one of the major health problems in urban and rural regions.

The field of allergy research has rapidly progressed in the last few years [6]. Recent advances in genomic, proteomic, and analytical methods have given rise to large amounts of data relevant to allergens. This data can be correlated with pathology of various allergic diseases based on experimental, clinical, and epidemiological data for allergic reactions. The continuous growth of data calls for the efficient archival, management, and analysis of data. This has led to the development of the field of allergen informatics, which comprises allergen specific databases/resources and computational methods/tools. Allergen informatics constitutes an important branch of immunoinformatics [7]. In this chapter, a review of the existing status of allergen informatics with respect to important aspects such as allergens and allergenicity, allergen databases, algorithms/tools for allergen/allergenicity prediction, allergen epitope prediction, and allergenic cross-reactivity assessment has been presented (**Figure 1**).

**Figure 1.** Topics covered for allergen informatics in the current review.

### **2. Allergens and allergenicity**

immediate protection and involves number of components such as monocytes, macrophages, neutrophils, cytokines, complement, and epithelial barriers. Adaptive or acquired immunity comprises highly specific immune responses that are elicited against particular pathogens or antigens. These immune responses are either cell mediated or antibody mediated (humoral) and executed by specialized lymphocytes or immunoglobulins, respectively. On certain occasions, the immune system produces immune responses that are harmful for the host organism. Autoimmunity denotes one such case wherein the body elicits immune responses against its own cells and tissues (self-antigens) which lead to development of autoimmune diseases. In some cases, immune system produces inappropriate immune responses known as hypersensitivity, which has deleterious effects on the host organism. Hypersensitivity reac‐ tions are categorized into four groups based on the type of immune response and the effector mechanism involved. These are (i) immediate hypersensitivity (type I), (ii) antibody-mediat‐ ed hypersensitivity (type II), (iii) immune complex–mediated hypersensitivity (type III), and

Allergic reactions are type I hypersensitivity reactions, which are characterized by induction of specific class of antibodies known as immunoglobulin E (IgE). These reactions are elicited against specific type of antigens commonly referred to as allergens. An allergic reaction involves specialized cells and specific molecules of the immune system [3]. IgE antibodies induced by allergens upon allergic sensitization bind to effector cells such as basophils and mast cells via specific Fc receptors present on the surfaces of those cells. Subsequent exposure to the allergen causes cross-linking of membrane-bound IgE on these effector cells, which leads to their degranulation and release of pharmacologically active agents such as histamine. These pharmacological mediators are responsible for clinical manifestations of allergic reactions in the affected individuals. Immunogenicity in general refers to the potential of an antigen to elicit an immune response, while in case of allergens, allergenicity is considered as a reflection of its allergenic potential. Allergenicity indicates the capability of an allergen to induce clinical symptoms of allergy as well as to induce and bind to IgE antibodies [4]. The prevalence of allergic reactions has increased significantly in the last few years, especially in the developing countries [5]. This has resulted in considerable increase in disease burden as well as economic issues due to costs associated with these diseases. Therefore, the study of allergic diseases has gained tremendous importance as they represent one of the major health problems in urban

The field of allergy research has rapidly progressed in the last few years [6]. Recent advances in genomic, proteomic, and analytical methods have given rise to large amounts of data relevant to allergens. This data can be correlated with pathology of various allergic diseases based on experimental, clinical, and epidemiological data for allergic reactions. The continuous growth of data calls for the efficient archival, management, and analysis of data. This has led to the development of the field of allergen informatics, which comprises allergen specific databases/resources and computational methods/tools. Allergen informatics constitutes an important branch of immunoinformatics [7]. In this chapter, a review of the existing status of allergen informatics with respect to important aspects such as allergens and allergenicity,

(iv) cell-mediated hypersensitivity (type IV) [2].

54 Bioinformatics - Updated Features and Applications

and rural regions.

Allergens represent the most critical component of an allergic reaction, although IgE antibody, Fc receptors, mast cells, and basophils as well as pharmacological mediators such as histamine and heparin also play very significant roles. Allergens are ubiquitous substances, which arise from a variety of sources such as foods, plants, animals, or environment. An allergen can either be a chemical substance (e.g., penicillin) or a protein (e.g., albumin, profilin, etc.). Majority of the allergens are proteins or glycoproteins that possess high water solubility. Several bio‐ chemical and structural features of allergens such as stability, hydrophobicity, and ligandbinding domains are known to contribute to their allergenicity [8]. However, common molecular and structural features of allergens that are responsible for allergenicity have not yet been conclusively discovered.

Allergens are provided with a unique, unambiguous, and systematic nomenclature which has been developed and maintained by the World Health Organization (WHO) and International Union of Immunological Societies' (IUIS) "Allergen Nomenclature Sub-committee" [9, 10]. The nomenclature is based on the Linnean system, and an allergen, which satisfies certain bio‐ chemical and immunological criteria, is included in the WHO/IUIS nomenclature. An allergen name consists of an abbreviation of the scientific name of the allergen source organism. First 3–4 letters denote the genus name, while the subsequent 1–2 letters represent species, followed by an Arabic numeral that denotes the order of its identification. For instance, Der p 1 represents the first allergen to be characterized from the house dust mite *Dermatophagoides pteronyssinus*. An allergen may possess isoallergens or isoforms/variants, which are considered as multiple molecular forms of the same allergen. The WHO/IUIS nomenclature defines isoallergen as an allergen belonging to a single species, with a similar molecular size and identical biological function, and possessing ≥67% amino acid sequence identity while a variant or isoform corresponds to allergen sequences that differ by only a limited number of amino acid substitutions [11]. It is very important to archive and study the data on isoallergens and isoforms/variants in a differentiated manner as it has been shown that variations in allergens significantly affect their allergenicity and cross-reactivity as well as influence recognition of epitopes by T cells and IgE [12]. An allergen can be considered as a major or minor allergen based on the measure of its allergenicity. Major allergens are the ones to which >50% of patients with an allergy to its source are sensitized, while minor allergens are recognized by a limited number of patients [13].

Allergens display important features such as epitopes and cross-reactivity that are very critical with respect to understanding of allergic reactions and developing newer approaches for diagnosis and treatment of allergic diseases. Epitope or antigenic determinant refers to the immunologically active region of the allergen. An epitope can be an IgE-binding epitope or a T-cell epitope depending on whether it interacts with an IgE or a T-lymphocyte. An IgE epitope can be either sequential (linear) that consists of contiguous stretch of amino acids or confor‐ mational (discontinuous), which comprises amino acids present at different loci in an antigen. An antibody is said to be cross-reactive when it recognizes and binds to multiple antigens.

### **2.1. IgE-binding epitopes**

IgE-binding epitopes refer to the IgE recognition sites in allergens that are involved in specific interaction of allergens and IgE antibody. Inferences drawn from allergen–antibody complexes and other important studies have shown that majority of IgE-binding epitopes are conforma‐ tional in nature [14]. IgE epitopes possess some defining structural and immunological features such as they are more cross-reactive in nature and have higher intrinsic flexibility. These features make them distinct from other antibody epitopes and contribute significantly in the allergenicity [15, 16]. Identification and in-depth analysis of IgE-binding epitopes has the potential to contribute immensely in accurate diagnosis and allergen-specific immunotherapy of allergies, especially the food allergy [17, 18]. Large amount of data on allergen epitopes are generated by employing strategies based on the use of overlapping synthetic peptides, recombinant allergenic fragments, cocrystal structure complexes, etc. However, it is believed that insights obtained from study of allergen–antibody complexes will be the most helpful in understanding the role these epitopes play in allergic reactions [19–21].

### **2.2. T-cell epitopes**

molecular and structural features of allergens that are responsible for allergenicity have not

Allergens are provided with a unique, unambiguous, and systematic nomenclature which has been developed and maintained by the World Health Organization (WHO) and International Union of Immunological Societies' (IUIS) "Allergen Nomenclature Sub-committee" [9, 10]. The nomenclature is based on the Linnean system, and an allergen, which satisfies certain bio‐ chemical and immunological criteria, is included in the WHO/IUIS nomenclature. An allergen name consists of an abbreviation of the scientific name of the allergen source organism. First 3–4 letters denote the genus name, while the subsequent 1–2 letters represent species, followed by an Arabic numeral that denotes the order of its identification. For instance, Der p 1 represents the first allergen to be characterized from the house dust mite *Dermatophagoides pteronyssinus*. An allergen may possess isoallergens or isoforms/variants, which are considered as multiple molecular forms of the same allergen. The WHO/IUIS nomenclature defines isoallergen as an allergen belonging to a single species, with a similar molecular size and identical biological function, and possessing ≥67% amino acid sequence identity while a variant or isoform corresponds to allergen sequences that differ by only a limited number of amino acid substitutions [11]. It is very important to archive and study the data on isoallergens and isoforms/variants in a differentiated manner as it has been shown that variations in allergens significantly affect their allergenicity and cross-reactivity as well as influence recognition of epitopes by T cells and IgE [12]. An allergen can be considered as a major or minor allergen based on the measure of its allergenicity. Major allergens are the ones to which >50% of patients with an allergy to its source are sensitized, while minor allergens are

Allergens display important features such as epitopes and cross-reactivity that are very critical with respect to understanding of allergic reactions and developing newer approaches for diagnosis and treatment of allergic diseases. Epitope or antigenic determinant refers to the immunologically active region of the allergen. An epitope can be an IgE-binding epitope or a T-cell epitope depending on whether it interacts with an IgE or a T-lymphocyte. An IgE epitope can be either sequential (linear) that consists of contiguous stretch of amino acids or confor‐ mational (discontinuous), which comprises amino acids present at different loci in an antigen. An antibody is said to be cross-reactive when it recognizes and binds to multiple antigens.

IgE-binding epitopes refer to the IgE recognition sites in allergens that are involved in specific interaction of allergens and IgE antibody. Inferences drawn from allergen–antibody complexes and other important studies have shown that majority of IgE-binding epitopes are conforma‐ tional in nature [14]. IgE epitopes possess some defining structural and immunological features such as they are more cross-reactive in nature and have higher intrinsic flexibility. These features make them distinct from other antibody epitopes and contribute significantly in the allergenicity [15, 16]. Identification and in-depth analysis of IgE-binding epitopes has the potential to contribute immensely in accurate diagnosis and allergen-specific immunotherapy of allergies, especially the food allergy [17, 18]. Large amount of data on allergen epitopes are

yet been conclusively discovered.

56 Bioinformatics - Updated Features and Applications

recognized by a limited number of patients [13].

**2.1. IgE-binding epitopes**

T-cell epitopes are the antigenic determinants of allergens that interact with T-lymphocytes via specific T-cell receptors. T-cell epitopes of allergens have shown to be very important for the modulation of allergic response and thereby contributing to symptoms associated with allergic diseases [22]. They have enormous potential in the development of allergy vaccines as well as newer strategies in allergen immunotherapy, considering their fundamental role in allergic response [23, 24]. Recent findings have indicated that T-cell epitope repertoire in allergens is diverse than IgE epitopes, and it can be very useful in specific immunotherapy in allergy [25]. An analysis carried out on available epitope data has shown that T-cell epitopes are known to occur more commonly in the airborne allergens as compared to food allergens [26].

### **2.3. Cross-reactivity**

Cross-reactivity denotes a clinically and immunologically critical phenomenon displayed by allergens from various sources and is the cause of pollen-food syndromes, such as the one seen in case of birch and apple. Cross-reactivity is considered as a property of antibodies and it arises when an antibody or a subgroup of antibodies recognizes more than one allergen or epitope [27]. Two allergens are considered cross-reactive if they are recognized by a single antibody (or T-cell receptor). It has been stated that cross-reactivity among allergens at the level of B cells, T cells, and mast cells reflects clinical sensitivities and contributes very significantly in the regulation of allergic sensitization [28].

Cross-reactivity is predominantly an antibody defined phenomenon and IgE antibodies are shown to be more cross-reactive in nature. Affinity of the antibodies toward the allergen is known to play an important role in cross-reactivity. However, the properties of the allergenic protein are also very important and shared features on the level of both primary and tertiary structures of the cross-reactive proteins are found to be responsible for cross-reactivity [4]. Similarity at the level of sequence is an important indicator and cross-reactivity seems to require more than 70% sequence identity. In addition to this, other factors such as the host immune response against the allergen, dosage of allergen, and mode of exposure also con‐ tribute in clinical relevance of allergic cross-reactivity. Inferences drawn from studying a large number of allergens have led to the conclusion that structural similarity among proteins from diverse sources is the molecular basis of allergic cross-reactivity [29]. Considering the role it plays in the development of allergic symptoms, a detailed analysis of cross-reactivity has the potential to contribute in the development of new strategies in diagnosis and therapy of allergic diseases.

### **3. Allergen databases**

Last few years have witnessed substantial technological advances in the field of genomics and proteomics along with tremendous improvements in analytical methods. This has led to a significant progress in the area of allergy research. As a result of this, there has been a steady and continuous increase in the number of characterized protein allergens over the last few years. Efficient storage and management of data has become very important because of such incessant accumulation of molecular and clinical data on allergens. Therefore, allergen databases represent very crucial resources for basic allergy research as they are involved in archival of available allergen knowledge.



**Table 1.** Summary of major allergen databases and their relevant features.

Many allergen-specific databases have been developed in the past few years although they differ from each other with respect to their objectives, type of data archived, accessibility of contents, and the level of annotation and applications [30]. In addition to dedicated allergen databases, primary bioinformatics databases also document significant data on allergens. Examples of these databases include GenBank/GenPept [31, 32], UniProtKB [33], and Protein Data Bank (PDB) [34], which archive sequence and structure data on allergens along with its annotation. The Summary of allergen-specific databases is provided in **Table 1**. In the following section, the existing allergen databases are described.

### **3.1. IUIS Allergen database**

**3. Allergen databases**

58 Bioinformatics - Updated Features and Applications

archival of available allergen knowledge.

Nomenclature Subcommittee (http://www.

and Experimental Allergology, Italy (http://www.allergome.

Sealy Centre for Structural Biology, University of Texas, USA (https://fermi.utmb.edu)

National Institute of Health, Japan (http://allergen.nihs. go.jp/ADFS/)

Resource Program (FARP) (http://www.allergenonline.

logy and Allergy Research, Medical University of Vienna, Austria

(http://www.meduniwien.ac.

allergen.org)

IUIS Allergen [36] WHO/IUIS Allergen

Allergome [39] Centre for Clinical

AllergenOnline [46] Food Allergy and

org)

AllFam [48] Department of Pathophysio

at/allfam)

Structural Database of Allergenic Proteins (SDAP)

Allergen Database

[41]

[44]

For Food Safety (ADFS) org)

Last few years have witnessed substantial technological advances in the field of genomics and proteomics along with tremendous improvements in analytical methods. This has led to a significant progress in the area of allergy research. As a result of this, there has been a steady and continuous increase in the number of characterized protein allergens over the last few years. Efficient storage and management of data has become very important because of such incessant accumulation of molecular and clinical data on allergens. Therefore, allergen databases represent very crucial resources for basic allergy research as they are involved in

> Sequence (isoallergens/ isoforms), structure, allergenicity

Sequence (isoallergens/ isoforms), structure, clinical, epidemiological, cross-reactivity, etc.

Sequence, structure, structural models, IgE epitopes

Sequence, structure, IgE epitopes, small molecule

Allergen family data, cross-link to Pfam database

Sequence, allergenicity Yes 2016

allergens

AllergenPro [53] The National Agricultural Sequence, IgE epitopes Yes 2015

**tools (if any)**

– Updated

– Updated

Yes 2013

Yes 2016

– 2011

**Updates**

continuously

continuously

**Database Developed by (URL) Type of data archived Computational**

The IUIS Allergen Nomenclature Sub-Committee, under the auspices of the WHO, provides the systematic nomenclature of allergenic proteins and it has developed and maintained Allergen database [35, 36]. The database archives all of the WHO/IUIS–recognized allergens along with their isoallergens and isoforms (variants). In order to maintain a consistent allergen nomenclature for newly discovered allergens, researchers are required to submit newly described allergens to the Allergen Nomenclature Sub-Committee before submitting their manuscript to a journal for consideration for publication.

Each allergen in this database is provided with annotation that includes biochemical name, molecular weight, information on its allergenicity, reference, etc. Additionally, sequence data for allergens and isoallergens/isoforms are also stored in the database, along with crossreferences to GenBank [31], GenPept [32], and UniProtKB [33], as well as to PDB [34], for nucleotide, protein sequences, and 3D structure data, respectively. Allergen database can be searched by using allergen name, biochemical name, allergen source organism, taxonomic group, etc., as search criteria. The database is updated continuously with specific names assigned to newly discovered allergens and isoallergens/variants [37]. Allergen database does not exemplify the comprehensive allergen data although it documents majority of the char‐ acterized allergens. This is because there are a large number of allergens that have been reported in literature which are not recognized by IUIS-Allergen. The database does not archive data on allergen epitopes and cross-reactivity.

### **3.2. Allergome database**

Allergome [38] represents an extensive repository of information on allergen molecules causing IgE-mediated (allergic, atopic) diseases [39]. The database comprises comprehensive data on WHO/IUIS-approved allergens along with other non-recognized allergens. These allergenic molecules are selected and curated from the published literature and web-based resources. It also contains data on allergenic sources based on whether they possess identified molecules or not. Allergome documents information on allergen and isoallergens/isoforms along with their sequences. Cross-links to sequence and structure databases like UniProtKB [33] and PDB [34] are also provided.

Allergome can be searched by using basic and advanced search options. Basic search employs numerous search criterions such as allergen name, biochemical name, source organism, etc., while advanced search enables the user to search using specific attributes. Each allergen molecule is represented by a monograph which represents information about the three parts of allergen such as basic information, data on the native form, and its recombinant form. The most important and unique feature of Allergome platform is the presence of several support modules that deal with archival of specific aspects of allergen data. A couple of important modules are RefArray, for easy access to references stored in the Allergome, and Real Time Monitoring of IgE sensitization (ReTiME), for real-time data collection and storage of IgE sensitization data and the number of other utilities. Allergome is updated regularly and allergen data curated from literature is documented.

### **3.3. Structural database of allergenic proteins**

Structural Database of Allergenic Proteins (SDAP) [40] is an allergen database that prominently deals with structural aspects of allergens [41]. It houses comprehensive cross-referenced sequence data on allergens, IgE-binding epitopes, 3D structures, and models of allergens. Each allergen in SDAP is provided with cross-links to primary databases such as UniProtKB [33], PDB [34], as well as to important resources such as NCBI Taxonomy Browser [42] and PubMed [42] for literature references. SDAP also has a utility as a web server that integrates various computational tools, which assist structural biology–related studies dealing with allergens and their epitopes. It employs an algorithm based on the conserved properties of amino acid side chains to detect regions associated with allergenicity in novel sequences. The database consists of number of tools that can be used to assess potential cross-reactivity of allergens and also help in screening of IgE epitopes in novel proteins. The last update of the database was carried out on February 25, 2013. SDAP does not archive complete data for allergens that are not recognized by IUIS while data on allergen cross-reactivity is also not documented.

### **3.4. Allergen database for food safety**

reported in literature which are not recognized by IUIS-Allergen. The database does not

Allergome [38] represents an extensive repository of information on allergen molecules causing IgE-mediated (allergic, atopic) diseases [39]. The database comprises comprehensive data on WHO/IUIS-approved allergens along with other non-recognized allergens. These allergenic molecules are selected and curated from the published literature and web-based resources. It also contains data on allergenic sources based on whether they possess identified molecules or not. Allergome documents information on allergen and isoallergens/isoforms along with their sequences. Cross-links to sequence and structure databases like UniProtKB

Allergome can be searched by using basic and advanced search options. Basic search employs numerous search criterions such as allergen name, biochemical name, source organism, etc., while advanced search enables the user to search using specific attributes. Each allergen molecule is represented by a monograph which represents information about the three parts of allergen such as basic information, data on the native form, and its recombinant form. The most important and unique feature of Allergome platform is the presence of several support modules that deal with archival of specific aspects of allergen data. A couple of important modules are RefArray, for easy access to references stored in the Allergome, and Real Time Monitoring of IgE sensitization (ReTiME), for real-time data collection and storage of IgE sensitization data and the number of other utilities. Allergome is updated regularly and

Structural Database of Allergenic Proteins (SDAP) [40] is an allergen database that prominently deals with structural aspects of allergens [41]. It houses comprehensive cross-referenced sequence data on allergens, IgE-binding epitopes, 3D structures, and models of allergens. Each allergen in SDAP is provided with cross-links to primary databases such as UniProtKB [33], PDB [34], as well as to important resources such as NCBI Taxonomy Browser [42] and PubMed [42] for literature references. SDAP also has a utility as a web server that integrates various computational tools, which assist structural biology–related studies dealing with allergens and their epitopes. It employs an algorithm based on the conserved properties of amino acid side chains to detect regions associated with allergenicity in novel sequences. The database consists of number of tools that can be used to assess potential cross-reactivity of allergens and also help in screening of IgE epitopes in novel proteins. The last update of the database was carried out on February 25, 2013. SDAP does not archive complete data for allergens that are not

recognized by IUIS while data on allergen cross-reactivity is also not documented.

archive data on allergen epitopes and cross-reactivity.

**3.2. Allergome database**

[33] and PDB [34] are also provided.

60 Bioinformatics - Updated Features and Applications

allergen data curated from literature is documented.

**3.3. Structural database of allergenic proteins**

Allergen Database for Food Safety (ADFS) [43] is developed as a project of the Division of Biochemistry and Immunochemistry of National Institute of Health Sciences (Japan). The aim of the database is to archive allergenic proteins and their IgE epitopes with a special emphasis on food allergens and food safety [44]. Allergens archived in ADFS are grouped into eight categories such as pollen, mite, animal, fungus, insect, food, latex, and others, and each allergen entry is provided with the primary database accession numbers of their genes and 3D structure information. The database is also equipped with homology-based sequence search tool for the evaluation of allergenicity. One of the most distinct features of ADFS is the archival of data on small molecule, nonprotein (chemical) allergens. The database does not archive data on allergen cross-reactivity.

### **3.5. AllergenOnline database**

AllergenOnline [45] is a well curated allergen database that documents a peer reviewed allergen list, which is compiled from various resources such IUIS-Allergen, PubMed, scientific publications, and other allergen databases. The database was developed within the Food Allergy Research and Resource Program (FARRP) at the University of Nebraska [46]. For each allergen, data on source organism, common name, IUIS official nomenclature, protein length, class of allergen like food allergen, contact allergen, etc., and a link to the NCBI protein (GenPept) [32] database are provided. AllergenOnline also provides the utility for sequencebased searches for allergens, which include alignments by FASTA and an eight-amino acid short-sequence identity search. This utility can be very useful in the identification of proteins that may present a potential risk of allergenic cross-reactivity. AllergenOnline is updated every year and the last update that resulted in version 16 of the database was reported on January 27, 2016. It does not archive data on allergen epitopes as well as on allergenic cross-reactivity.

### **3.6. AllFam database**

AllFam [47] represents a very important resource for allergens as it is involved in classification of allergens into protein families [48]. This study has shown that allergens are distributed into relatively few protein families and possess a limited number of biochemical functions. The structural classification of allergens in AllFam is performed by using family information from PFam [49] and the Structural Classification of Proteins (SCOP) database [50], while biochemical functions of allergens were extracted from the Gene Ontology annotation database [51]. The database provides the option of browsing lists of allergen families based on allergen source (plants, animals, and fungi) and route of exposure (inhalation, ingestion, etc.) while search for specific protein families can also be performed. Each allergen family in AllFam is linked to a family fact sheet that describes the biochemical properties of the family members as well as a list of key references related to this family. The last update of AllFam was reported on September 12, 2011. AllFam does not archive data on molecular features of individual allergens although cross-link to IUIS-Allergen and Allergome is provided for each documented allergen.

### **3.7. AllergenPro database**

AllergenPro [52] is a recently developed allergen database that archives data on allergen sequences, structures, and epitopes from various sources. It is an integrated database which provides information about allergens in foods, microorganisms, fungi, animals, and plants [53]. It has been provided with a utility to search for allergens based on keywords as well as the sequence. AllergenPro is also equipped with a computational tool for the prediction of allergenicity. Prediction is based on three different approaches such as FAO/WHO guidelines (sequence)–based approach, motif-based approach, and epitope-based approach. The data‐ base was last updated on June 4, 2015. AllergenPro does not archive data on allergen crossreactivity while the literature references for documented allergens and epitopes have also not been provided.

### **3.8. Archival of allergen epitope data**

As mentioned earlier, epitopes denote very important feature of allergens as they play vital role in allergic diseases. Because the molecular characterization of allergens has risen im‐ mensely in recent years, the data on allergenic epitopes has also increased significantly. Therefore, it has become necessary to store and manage the epitope data for its efficient utilization.

Some of the existing allergen databases described above, such as SDAP, ADFS, and Allergen‐ Pro are involved in storage of allergen epitope data. There are few databases available that are dedicated for epitope data from all types of antigens, which also document information on allergen epitopes [54–56]. However, the allergy-associated epitope data stored in these databases may not be comprehensive. The Immune Epitope Database (IEDB) [57], which is a repository of immune epitope reactivity data, is also a major database of allergy-derived epitope data [58]. It archives extensive allergen epitope data along with biological assays associated with them, including IgE-binding as well as T-cell epitopes curated and compiled from allergy-related references. IEDB is also equipped with several strategies for efficient searching and visualization of data on allergy-related epitopes [59]. Therefore, it represents a very useful and user friendly platform to access and retrieve allergy-related epitope data for the community of allergists. In a study involving classification of all the epitope-specific literature in various immunological domains, it is stated that IEDB comprises relatively fewer references for allergy-derived epitopes as compared to Cancer and Infectious Diseases [60]. This indicates that there is considerable scope for more in-depth archival of allergen epitope data from literature. Another study on meta-analysis of the allergy-associated epitope data in IEDB has indicated that relatively lesser data is archived for allergen T-cell epitopes as compared to IgE epitopes [26].

### *3.8.1. AllerBase database*

Observations from the study of the existing allergen databases indicated that they archive significant data on various aspects of allergen and allergenicity, although the level of com‐ pleteness of data differs considerably for diverse allergen features. AllerBase [61] is a recently developed comprehensive database of allergens and allergen features which addresses some of the limitations associated with the existing allergen databases [Kadam et al. 2016, unpub‐ lished]. The database comprises extensive data on experimentally validated allergens and allergen specific features such as IgE-binding epitopes, IgE cross-reactivity, IgE antibodies, and evidences for experimental validation of allergens. AllerBase is provided with basic and advanced search utilities along with browse database option to retrieve desired allergen data. The Completeness Index, which represents availability of data for various features for each allergen and a structure visualization utility, denote important features of the database. AllerBase also provides cross-references to several immunological and allergen databases and represents a notable instance of integration of allergen data from number of resources.

### **4. Computational prediction of allergens/allergenicity**

**3.7. AllergenPro database**

62 Bioinformatics - Updated Features and Applications

been provided.

utilization.

**3.8. Archival of allergen epitope data**

compared to IgE epitopes [26].

*3.8.1. AllerBase database*

AllergenPro [52] is a recently developed allergen database that archives data on allergen sequences, structures, and epitopes from various sources. It is an integrated database which provides information about allergens in foods, microorganisms, fungi, animals, and plants [53]. It has been provided with a utility to search for allergens based on keywords as well as the sequence. AllergenPro is also equipped with a computational tool for the prediction of allergenicity. Prediction is based on three different approaches such as FAO/WHO guidelines (sequence)–based approach, motif-based approach, and epitope-based approach. The data‐ base was last updated on June 4, 2015. AllergenPro does not archive data on allergen crossreactivity while the literature references for documented allergens and epitopes have also not

As mentioned earlier, epitopes denote very important feature of allergens as they play vital role in allergic diseases. Because the molecular characterization of allergens has risen im‐ mensely in recent years, the data on allergenic epitopes has also increased significantly. Therefore, it has become necessary to store and manage the epitope data for its efficient

Some of the existing allergen databases described above, such as SDAP, ADFS, and Allergen‐ Pro are involved in storage of allergen epitope data. There are few databases available that are dedicated for epitope data from all types of antigens, which also document information on allergen epitopes [54–56]. However, the allergy-associated epitope data stored in these databases may not be comprehensive. The Immune Epitope Database (IEDB) [57], which is a repository of immune epitope reactivity data, is also a major database of allergy-derived epitope data [58]. It archives extensive allergen epitope data along with biological assays associated with them, including IgE-binding as well as T-cell epitopes curated and compiled from allergy-related references. IEDB is also equipped with several strategies for efficient searching and visualization of data on allergy-related epitopes [59]. Therefore, it represents a very useful and user friendly platform to access and retrieve allergy-related epitope data for the community of allergists. In a study involving classification of all the epitope-specific literature in various immunological domains, it is stated that IEDB comprises relatively fewer references for allergy-derived epitopes as compared to Cancer and Infectious Diseases [60]. This indicates that there is considerable scope for more in-depth archival of allergen epitope data from literature. Another study on meta-analysis of the allergy-associated epitope data in IEDB has indicated that relatively lesser data is archived for allergen T-cell epitopes as

Observations from the study of the existing allergen databases indicated that they archive significant data on various aspects of allergen and allergenicity, although the level of com‐ pleteness of data differs considerably for diverse allergen features. AllerBase [61] is a recently Allergens mainly comprise commonly occurring proteins in foods, pollens, and other biolog‐ ical entities in the environment. It has become necessary to assess the potential allergenicity of these proteins considering the health hazards associated with allergic reactions to them. In recent years, genetic engineering and food processing methods are routinely employed for modifying the existing proteins or introducing new ones. Analysis of allergenicity of such proteins/products along with newly introduced biopharmaceuticals is absolutely essential in order to avoid transfer of an allergenic molecule. Computational assessment or prediction of allergenicity represents the major approach to test for allergenicity, and numerous bioinfor‐ matics tools/methods have been employed successfully for this purpose [62]. The majority of these methods utilize the amino acid sequence of allergens along with its different features, while a very few approaches use structure information. **Table 2** denotes the list of computa‐ tional tools/servers available for the prediction of allergens/allergenicity. In the following section, the prominent approaches used for the computational assessment or prediction of allergens/allergenicity are described briefly.




*A*ROC: area under the receiver-operating curve; SE: sensitivity; SP: specificity; MCC: Matthews correlation coefficient.

**Table 2.** List of computational tools/servers for allergen/allergenicity prediction.

### **4.1. Sequence similarity-based approaches**

**No. Method (URL) Approach used Efficiency**

sequence and SVM

Sequence features and SVM, sequence motifs,

Sequence based descriptors, auto and cross-covariance, machine learning

DFLAP algorithm and SVM –

Iterative pairwise sequence similarity and SVM

Sequence based features

AFFP dataset, normalized BLAST *E*-values and

physicochemical descriptors,

Integration of methods based on FAO/WHO guidelines, sequence motifs and

Auto and cross-covariance, descriptor-based fingerprints

and SVM

SVM

SVM

Biochemical and

sequence features, subcellular locations, mRMR, SVM

Sequences as text documents, Naive Bayes classifier and SVM

epitopes, allergen representative peptides

Sequence motifs –

Accuracy = 85%, SE = 88%,

Accuracy = 85.3%, SE = 82.5%,

*A*ROC = 0.928, accuracy = 95.3%, SE = 83.4%, SP = 96.4%

MCC = 0.95, SE = 93%,

MCC = 0.97, SE = 98.6%,

Accuracy = 93.42%

Accuracy = 88%, MCC = 0.759

SP = 99.9%

SP = 98.4%

–

–

SP = 81%

SP = 88.1%

[67] of protein

(http://weballergen.bii.a-star.edu.sg/) [73]

64 Bioinformatics - Updated Features and Applications

(http://www.imtech.res.in/raghava/algpred/)

(http://www.ddg-pharmfac.net/AllerTOP)

(http://bioinformatics.bmc.uu.se/evaller.html)

(http://tiger.dbs.nus.edu.sg/AllerHunter/)

(http://jing.cz3.nus.edu.sg/cgi-bin/APPEL)

(http://gmobl.sjtu.edu.cn/PREAL/index.php)

(http://gmobl.sjtu.edu.cn/proAP/main.html)

(http://ddg-pharmfac.net/AllergenFP/) [103]

(http://sortaller.gzhmu.edu.cn/)

(http://allerdictor.vbi.vt.edu/) [99]

6 WebAllergen

7 AlgPred

[75]

8 AllerTOP

[83]

9 EVALLER

[86]

[89]

[91]

[93]

[96]

14 Allerdictor

15 proAP

[100]

16 AllergenFP

13 PREAL

12 SORTALLER

11 APPEL

10 AllerHunter

One of the first studies dealing with analysis of allergenicity was put forth by Metcalfe et al. [63]. They have proposed a decision tree–based approach for allergenicity assessment of foods derived from genetically modified crops. The first computational approach for the assessment of allergenicity was provided by "Codex Alimentarius Commission" of FAO/WHO [64, 65]. It stated that a protein can be regarded as an allergen if it consists of an exact match with at least six contiguous amino acids or showed more than 35% similarity over a window of 80 amino acids when compared with a sequence of known allergen. This approach has been widely used to predict allergenicity and there are number of web servers for allergen prediction, which are based on it. Allermatch [66], AllerTool [67], and AllergenPro [53] are some of the prominent web servers which employ these FAO/WHO guidelines for allergen prediction. Additionally, some of the major allergen databases such as SDAP [41] and AllergenOnline [46] also utilize this strategy for allergenicity prediction. A recent study performed by Verma et al. [68] has shown that the sequence similarity-based approach gives substantially better results when used in combination with other bioinformatics methods. However, results obtained by certain studies indicated that approaches based on these guidelines are not highly efficient for identifying allergenic proteins and many of times they lead to false or irrelevant allergenicity estimations [69–71]. As a result of these observations, it became necessary to discover and employ other strategies for the prediction of allergenicity.

### **4.2. Motif-based approaches**

In a study carried out by Stadler and Stadler [71], it was observed that the use of sequence motifs, which represent the secondary structures of proteins, performs significantly better than the approach based on FAO/WHO guidelines. This method employs MEME motifs of a length of 50 residues for the prediction of allergenicity by using pairwise sequence alignment with certain threshold. WebAllergen [72] is a web server for the prediction of allergenic proteins which is also based on specific detectable allergenic motifs in known allergens [73]. Further‐ more, a study carried out by Kong et al. showed that an approach based on search of multiple motifs is more specific and efficient than the conventional single motif search [74]. AlgPred [75] and AllergenPro [53] are important web servers for allergen prediction in which one of the prediction approaches is based on allergen-derived motifs. A recent study that employs computational approaches for comparison of allergens and metazoan parasite proteins stated that significant sequence and structure similarity exists between parasite proteins and allergenic proteins [76]. The analysis was carried out using sequence and structural motifs in allergens and a workflow was developed for the computational analysis of parasite proteins.

### **4.3. Machine learning–based approaches**

Recent years have witnessed tremendous increase in the application of machine learning methods for solving biological problems. Machine learning–based approaches have been widely used for predicting various aspects of protein function [77]. These methods are also employed routinely for the development of algorithms to predict allergenicity of novel proteins.

Although Support Vector Machine (SVM) is the most commonly used machine learning method for allergen prediction, other methods have also been frequently employed. One of the earliest methods was developed by Zorzet et al. [78] that utilizes a k-Nearest-Neighbor (kNN) classification algorithm for the prediction of allergenicity, while a Bayesian classifier was employed by Soeria-Atmadja et al. [79] for the same purpose. An approach based on the combination of hidden Markov model (HMM) and conserved motifs in allergen was also used to successfully predict protein allergenicity [80]. Dimitrov et al. [81] developed two artificial neural network (ANN)-based algorithms for allergenicity prediction, which utilize descriptors derived from amino acids that denote their structural and physicochemical properties. AllerTOP [82] is an online bioinformatics tool to perform the computational prediction of allergens [83]. This algorithm employs descriptors that denote the chemical properties of amino acids in allergen sequences and auto- and cross-covariance transformation along with five machine learning methods for classification. These methods are random forest, multilayer perceptron, logistic regression, decision tree, naïve Bayes, and kNN.

There are number of web-based tools/servers developed which use SVM for performing classification/prediction of allergens. AlgPred [84] is one of the earlier web servers developed for the prediction of allergenic proteins [75]. It employs SVM with amino acid and dipeptide composition as features of allergens to achieve accuracy of 85.02 and 84.00%, respectively. EVALLER [85] is another web server created for *in silico* determination of potential allerge‐ nicity with very good efficiency [86]. It performs detection based on filtered length–adjusted allergen peptides (DFLAP) algorithm and SVM. AllerTool [87] web server also applies SVMbased algorithm for the prediction of allergenicity and provides sensitivity and specificity of 86.00% [67]. AllerHunter [88] is an important web-based computational system for allerge‐ nicity assessment which uses a scheme based on iterative pairwise sequence similarity encoding along with SVM [89]. The method is very efficient with a sensitivity of 83.4% and a specificity of 96.4%.

A web-based tool APPEL [90] is developed for the prediction of allergenic proteins that employs physicochemical and structural features derived from allergen sequence in combi‐ nation with SVM [91]. Zhang et al. have developed an online allergen prediction tool titled SORTALLER [92], which is based on allergen family featured peptide (AFFP) dataset and employs SVM as a classifier [93]. An algorithm developed by Mohabatkar et al. [94] for the prediction of allergenic proteins utilizes pseudoamino acid composition (PseAAC) along with SVM and provides an accuracy of 91.19%. PREAL [95] is web-based tool that performs allergen prediction by using SVM along with feature selection methods such as maximum relevance minimum redundancy (mRMR) and incremental feature selection (IFS) [96]. A combination of hydrophobicity amino acid index and discrete Fourier transform along with an SVM classifier is employed for highly efficient prediction of allergenicity in a signal-processing bioinformatics approach [97]. Allerdictor [98] is web server that specializes in large-scale allergen discovery. It models protein sequences as text documents and employs SVM in text classification for carrying out allergen prediction [99].

### **4.4. Other approaches**

allergenic proteins [76]. The analysis was carried out using sequence and structural motifs in allergens and a workflow was developed for the computational analysis of parasite proteins.

Recent years have witnessed tremendous increase in the application of machine learning methods for solving biological problems. Machine learning–based approaches have been widely used for predicting various aspects of protein function [77]. These methods are also employed routinely for the development of algorithms to predict allergenicity of novel

Although Support Vector Machine (SVM) is the most commonly used machine learning method for allergen prediction, other methods have also been frequently employed. One of the earliest methods was developed by Zorzet et al. [78] that utilizes a k-Nearest-Neighbor (kNN) classification algorithm for the prediction of allergenicity, while a Bayesian classifier was employed by Soeria-Atmadja et al. [79] for the same purpose. An approach based on the combination of hidden Markov model (HMM) and conserved motifs in allergen was also used to successfully predict protein allergenicity [80]. Dimitrov et al. [81] developed two artificial neural network (ANN)-based algorithms for allergenicity prediction, which utilize descriptors derived from amino acids that denote their structural and physicochemical properties. AllerTOP [82] is an online bioinformatics tool to perform the computational prediction of allergens [83]. This algorithm employs descriptors that denote the chemical properties of amino acids in allergen sequences and auto- and cross-covariance transformation along with five machine learning methods for classification. These methods are random forest, multilayer

There are number of web-based tools/servers developed which use SVM for performing classification/prediction of allergens. AlgPred [84] is one of the earlier web servers developed for the prediction of allergenic proteins [75]. It employs SVM with amino acid and dipeptide composition as features of allergens to achieve accuracy of 85.02 and 84.00%, respectively. EVALLER [85] is another web server created for *in silico* determination of potential allerge‐ nicity with very good efficiency [86]. It performs detection based on filtered length–adjusted allergen peptides (DFLAP) algorithm and SVM. AllerTool [87] web server also applies SVMbased algorithm for the prediction of allergenicity and provides sensitivity and specificity of 86.00% [67]. AllerHunter [88] is an important web-based computational system for allerge‐ nicity assessment which uses a scheme based on iterative pairwise sequence similarity encoding along with SVM [89]. The method is very efficient with a sensitivity of 83.4% and a

A web-based tool APPEL [90] is developed for the prediction of allergenic proteins that employs physicochemical and structural features derived from allergen sequence in combi‐ nation with SVM [91]. Zhang et al. have developed an online allergen prediction tool titled SORTALLER [92], which is based on allergen family featured peptide (AFFP) dataset and employs SVM as a classifier [93]. An algorithm developed by Mohabatkar et al. [94] for the prediction of allergenic proteins utilizes pseudoamino acid composition (PseAAC) along with SVM and provides an accuracy of 91.19%. PREAL [95] is web-based tool that performs allergen

perceptron, logistic regression, decision tree, naïve Bayes, and kNN.

**4.3. Machine learning–based approaches**

66 Bioinformatics - Updated Features and Applications

proteins.

specificity of 96.4%.

A study carried out by Wang et al. evaluated sequence-, motif-, and SVM-based approaches for the computational prediction of allergens and also performed parameter optimization to obtain better performance [100]. The resulting methods from this study are integrated and made available as a web application titled proAP [101]. AllergenFP [102] is a recently devel‐ oped web server for allergenicity prediction that utilizes alignment-free descriptor-based fingerprint approach [103]. The descriptors used here are important properties of amino acid such as size, hydrophobicity, relative abundance, helix, and beta-strand forming propensities, etc. In a structure-based approach proposed by Bragin et al. [104], information derived from protein 3D structure is used for the representation of protein surface as patches designated as discontinuous peptides. It is observed that prediction of allergenic proteins based on this approach gave better accuracy. Vijayakumar and Lakshmi have developed a fuzzy inference system–based algorithm for allergenicity prediction that utilizes five different modules [105]. These modules consist of a machine learning classifier, motif search, sequence similarity, FAO/ WHO evaluation scheme, etc. FuzzyApp [106], a web server based on fuzzy rule–based system, is then developed for the prediction of allergenicity [107]. Jiang et al. performed an analysis of food allergens using a computational model that simulates gastric fluid digestion [108]. This study stated that food allergens could be classified as alimentary canal-sensitized and nona‐ limentary canal-sensitized allergens based on the digestibility of these allergens in simulated gastric fluid.

### **5. Computational prediction of allergen epitopes**

Epitopes represent distinctive amino acid residues on the antigens and are important deter‐ minants of an immune response. Identification of epitopes is considered a key aspect of designing highly effective multiple-subunit vaccines and developing efficient diagnostic and therapy methods against allergens. Although experimental methods have been very useful for the identification of epitopes, their usefulness is restricted because of their time- and costintensive nature and inability in dealing with large-scale elucidation of epitopes. Hence, computational approaches are considered to be very beneficial alternative as they are cost and time effective.

Large number of highly efficient algorithms and tools have been developed over the years for the computational prediction of epitopes. These methods deal with the prediction of both B-cell and T-cell epitopes as well as sequential (linear) and discontinuous (conforma‐ tional) epitopes. Based on the information (data) utilized for performing prediction, the methodologies can be grouped as sequence-based or structure-based approaches. Many se‐ quence-based linear epitope prediction methods for B cells have been developed and used since long time and majority of them are propensity scale and machine learning–based methods [109, 110]. Some of the major tools/servers that deal with the prediction of linear Bcell epitopes are listed in **Table 3**.


*A*ROC: area under the receiver-operating curve; SE: sensitivity; SP: specificity; MCC: Matthews correlation coefficient.

**Table 3.** List of major tools/servers for sequential (linear) epitope prediction.

Number of methods that utilize 3D structure of antigens for discontinuous epitope prediction have also been developed. These methods use different approaches for prediction such as solvent accessibility of surface residues [123, 124], solvent accessibility with propensity scores [125], and propensity scores with packing density of amino acids [126]. An account of major tools/servers that are involved in conformational epitope prediction is provided in **Table 4**.

both B-cell and T-cell epitopes as well as sequential (linear) and discontinuous (conforma‐ tional) epitopes. Based on the information (data) utilized for performing prediction, the methodologies can be grouped as sequence-based or structure-based approaches. Many se‐ quence-based linear epitope prediction methods for B cells have been developed and used since long time and majority of them are propensity scale and machine learning–based methods [109, 110]. Some of the major tools/servers that deal with the prediction of linear B-

Fixed length epitope patterns,

Amino acid anchoring pair composition (APC) and SVM

SVM classifiers with string

Physicochemical properties of

Sequence features and multiple linear regression (MLR)

A collection of tools based on

Large datasets of epitopes,

Tri-peptide similarity, propensity

various methods

scores and SVM

Number of methods that utilize 3D structure of antigens for discontinuous epitope prediction have also been developed. These methods use different approaches for prediction such as

*A*ROC: area under the receiver-operating curve; SE: sensitivity; SP: specificity; MCC: Matthews correlation coefficient.

**Table 3.** List of major tools/servers for sequential (linear) epitope prediction.

KNN, SVM

Antigen sequence features, SVM *A*ROC = 0.85

Bayes feature extraction and SVM *A*ROC = 0.84,

Antigen fragment score and SVM *A*ROC = 0.829

Accuracy = 65.93%, SE = 67.14%, SP = 64.71%

accuracy

*A*ROC = 0.809,

*A*ROC = 0.758

Accuracy = 58.7%

accuracy = 74.5%

*A*ROC = 0.728,

Accuracy = 86%

*A*ROC = 0.702,

–

SE = 81.8%, SP = 64.1%

SE = 80.1%, SP = 55.2%

= 72.94%

–

recurrent ANNs

epitope residues

Parker's hydrophilicity scale and HMM

kernels

**No. Method (URL) Approach used Efficiency**

cell epitopes are listed in **Table 3**.

68 Bioinformatics - Updated Features and Applications

1 ABCPred (http://www.imtech.res. in/raghava/abcpred/) [111]

iastate.edu/bcpreds/) [113]

4 BcePred (http://www.imtech.res. in/raghava/bcepred/) [114]

5 BepiPred (http://www.cbs.dtu.dk/ services/BepiPred/) [115]

7 Bayesb (http://www.immunopred.org/ bayesb/index.html) [117]

8 COBEpro (http://scratch.proteomics.

9 EPMLR (http://www.bioinfo.tsinghua. edu.cn/epitope/EPMLR/) [119]

10 IEDB Analysis Resource (http://tools.

2 APCpred (http://ccb.bmi.ac. cn/APCpred/) [112]

3 BCPreds (http://ailab.cs.

6 BEST (http://biomine.ece. ualberta.ca/BEST/) [116]

ics.uci.edu) [118]

iedb.org/bcell/) [120]

12 SVMTriP (http://sysbio. unl.edu/SVMTriP/) [122]

11 LBtope (http://www.imtech.res. in/raghava/lbtope/) [121]


*A*ROC: area under the receiver-operating curve; SE: sensitivity; SP: specificity; MCC: Matthews correlation coefficient.

**Table 4.** List of major tools/servers for conformational (discontinuous) epitope prediction.

Studies have shown that the analysis of antigen–antibody complex structures is very useful for the characterization of conformational epitopes [134]. A dedicated resource titled AgAbDb [135] that archives the interactions derived from antigen–antibody complexes is available, which can be very useful for the analysis of epitopes [20, 21]. Several algorithms have also been developed for the prediction of T-cell epitopes in antigens. These methodologies deal with the prediction of peptides that possess the ability to interact with specific major histocompatibility complex (MHC) molecules [136]. Machine learning–based approaches are very commonly employed for this purpose and are found to be very efficient [137]. The details of epitope prediction methods/tools for B cells and T cells have been reviewed elsewhere [138, 139]. Some of the important tools/servers that perform the prediction of T-cell epitopes are listed in **Table 5**. Recently, it has been shown that epitope prediction can be performed over the whole proteome by integrating multiple epitope prediction methods [149]. Antibody-specific epitope prediction has emerged as a significant alternative to the traditional antibody-independent epitope prediction methods [150].


**Table 5.** List of major tools/servers for T-cell epitope prediction.

Epitopes represent critical components of allergens from the perspective of allergic reactions and development of new diagnosis and treatment strategies. Therefore, the computational prediction of these epitopes in allergens is of immense importance. Due to limitations associ‐ ated with detailed archival of allergen epitope data and highly heterogeneous nature of the data, the number of tools available for allergen epitope prediction are far less, especially when compared with the number of tools available for allergen/allergenicity prediction. Therefore, the general epitope prediction methods listed above can be employed for epitope prediction studies in allergens. Kleter and Peijnenburg [70] developed a strategy to screen for the potential linear IgE-epitopes using sequence comparison with a minimal length of six amino acids. The approach was moderately effective and it showed that further verification of IgE binding of epitopes by experimental tests is necessary. AlgPred [84] developed by Saha and Raghava is one of the first and major tools for the computational assessment of IgE epitopes [75]. Here, a database of known IgE epitopes is created and it is used to accurately predict allergenic proteins. AllerPred is a SVM-based computational system for the assessment of overlapping continuous and discontinuous B-cell epitope binding patterns in allergenic proteins [151]. This approach is successfully used to predict allergenicity of novel proteins. Dall'Antonia et al. [152] have developed a software tool titled Surface comparison–based Prediction of Allergenic Discontinuous Epitopes (SPADE). The algorithm consists of a structure-based comparison of allergen surfaces and IgE cross-reactivity data and is able to predict IgE epitopes from three important allergen families. A recent work performed by Lollier et al. [153] on meta-analysis of IgE-binding epitopes provided some important findings regarding these epitopes. They computed the fraction of allergen amino acids that are involved in epitopes and modeled a relationship between the rising number of literature references and the amino acid fractions to assess the possibility of binary classification of epitopes and nonepitopes. A web-based tool LocAllEpi [154] is also developed for the visualization of allergen epitopes along the protein sequence and their structural features.

### **6. Computational prediction of allergenic cross-reactivity**

of the important tools/servers that perform the prediction of T-cell epitopes are listed in **Table 5**. Recently, it has been shown that epitope prediction can be performed over the whole proteome by integrating multiple epitope prediction methods [149]. Antibody-specific epitope prediction has emerged as a significant alternative to the traditional antibody-independent

**No. Method (URL) Approach used Efficiency**

SVM, ANN

residues

Epitopes represent critical components of allergens from the perspective of allergic reactions and development of new diagnosis and treatment strategies. Therefore, the computational prediction of these epitopes in allergens is of immense importance. Due to limitations associ‐ ated with detailed archival of allergen epitope data and highly heterogeneous nature of the data, the number of tools available for allergen epitope prediction are far less, especially when compared with the number of tools available for allergen/allergenicity prediction. Therefore, the general epitope prediction methods listed above can be employed for epitope prediction

integrated approach

QSAR approach based on proteochemometrics

Integrated method employing proteasomal cleavage, TAP transport efficiency, and MHC class I binding affinity

A method for all HLA class II molecules based on peptidebinding MHC environment

Based on specificity-determining

Scoring system based on position of

Algorithm based on HLA-DR binding

Combination of methods based on proteasomal cleavage, TAP transport and MHC binding

residue in the epitopes

pocket similarity

Cytotoxic T-lymphocyte epitopes,

Multi-step algorithm that employs

Accuracy = 75.2%

Accuracy = 60%

Accuracy = 89%

*A*ROC = 0.95

*A*ROC = 0.807

–

–

–

–

epitope prediction methods [150].

70 Bioinformatics - Updated Features and Applications

1 CTLPred (http://www.imtech.res. in/raghava/ctlpred/index.html) [140]

2 EpiJen (http://www.ddg-pharmfac. net/epijen/EpiJen/EpiJen.htm) [141]

3 EpiTOP (http://www.pharmfac. net/EpiTOP/) [142]

4 NetCTLpan (http://www.cbs.dtu. dk/services/NetCTLpan/) [143]

6 PREDIVAC (http://predivac. biosci.uq.edu.au/) [145]

5 NetMHCIIpan-3.0 (http://www.cbs.dtu.dk/ services/NetMHCIIpan-3.0/) [144]

7 SYFPEITHI (http://www.syfpeithi.de/bin/ MHCServer.dll/EpitopePrediction.htm) [146]

8 TEPITOPEpan (http://www.biokdd.fudan. edu.cn/Service/TEPITOPEpan/) [147]

9 WAPP (http://abi.inf.uni-tuebingen.de/

*A*ROC: area under the receiver-operating curve.

**Table 5.** List of major tools/servers for T-cell epitope prediction.

Services/WAPP/) [148]

Cross-reactivity plays an important role in allergic reaction from the immunological and clinical context. Therefore, the computational prediction of allergenic cross-reactivity has been considered of substantial significance. The prediction of cross-reactivity in allergens is associated with the prediction of allergenicity for the majority of the cases. This is mainly because the antigenic determinants that contribute to the cross-reactivity in allergens are also responsible for their allergenicity. As a result of this, many of the tools/algorithms that have been developed for the prediction of allergens/allergenicity also perform cross-reactivity prediction.

The criteria defined by FAO/WHO experts, which have been mentioned earlier, help to identify cross-reactivity in allergens [155]. AllerTool [87] is a web server that performs cross-reactivity prediction based on amino acid sequence and WHO/FAO guidelines [67]. It also provides a graphical representation of the published and predicted cross-reactivity patterns of allergens. Stadler and Stadler [71] developed a sequence-based approach and stated that motif-based strategy provides better results for the computational assessment of cross-reactivity than the FAO/WHO guidelines. SDAP [40], which is a specialized allergen database described before, also comprises a sequence-based tool for the identification of cross-reactivity among allergens [41]. AllerHunter [88] is a SVM-based web server that deals with efficient assessment of allergic cross-reactivity in proteins [89]. A recently developed fuzzy inference system–based algorithm for allergenicity prediction is also able to predict cross-reactivity in allergens [105].

### **7. Future perspectives and challenges**

Allergy represents a serious problem, as allergic diseases are known to affect millions of people worldwide. Advancements in genomic, proteomic, and analytical techniques have led to the generation of large amount of data related to allergy and allergens. Archival and analysis of these data denotes a major challenge in allergen bioinformatics. Data integration is one of the key limitations for efficient and useful storage of allergy associated data. This is mainly due to the heterogeneous nature of the data, which is derived from various sources such as molecular data from experimental characterization of allergic reactions, clinical, and epide‐ miological data from patients/populations. Bioinformatics resources and tools have an important role to play in overcoming this problem. In the wake of ever-expanding volume of data, it is vital to focus on developing databases/resources that will integrate information from different sources as well as from literature and provide rapid access to it. Analysis of such data can be further utilized to obtain important insights to understand allergic reactions. Structural features of allergens contribute significantly to their allergenicity and therefore this knowledge can be employed for developing more efficient methods for allergen/allergenicity and allergic cross-reactivity prediction. Recent advances in epitope prediction methodologies focus on antibody-specific epitope prediction approaches [150]. Application of such approaches for predicting IgE-binding epitopes will be extremely important in the development of newer and effective strategies for diagnosis and treatment of allergic diseases. Allergen immunotherapy (AIT), which is an individualized and allergens-based treatment approach, has been consid‐ ered as a prototype of precision medicine or personalized medicine [156]. Bioinformatics has the potential to play an important role in the development of novel approaches in AIT as well as contribute for further enrichment of the field of allergen informatics. This will surely aid in gaining better understanding of allergic diseases and positively influence upcoming research in the field.

### **Acknowledgements**

The work was supported under the Senior Research Fellowship (SRF) granted to KK by the University Grants Commission (UGC), New Delhi, India. UKK and SS would like to acknowl‐ edge the Centre of Excellence grant from the Department of Biotechnology (DBT), New Delhi, India. The authors would also like to acknowledge the Bioinformatics Centre, Savitribai Phule Pune University, for providing the infrastructure and resources.

### **Author details**

Kiran Kadam1 , Sangeeta Sawant1 , V.K. Jayaraman2 and Urmila Kulkarni-Kale1\*

\*Address all correspondence to: urmila@bioinfo.net.in

1 Bioinformatics Centre, Savitribai Phule Pune University, Pune, India 2 Shiv Nadar University, Gautam Buddha Nagar, Uttar Pradesh, India

### **References**

**7. Future perspectives and challenges**

72 Bioinformatics - Updated Features and Applications

in the field.

**Acknowledgements**

**Author details**

Kiran Kadam1

Allergy represents a serious problem, as allergic diseases are known to affect millions of people worldwide. Advancements in genomic, proteomic, and analytical techniques have led to the generation of large amount of data related to allergy and allergens. Archival and analysis of these data denotes a major challenge in allergen bioinformatics. Data integration is one of the key limitations for efficient and useful storage of allergy associated data. This is mainly due to the heterogeneous nature of the data, which is derived from various sources such as molecular data from experimental characterization of allergic reactions, clinical, and epide‐ miological data from patients/populations. Bioinformatics resources and tools have an important role to play in overcoming this problem. In the wake of ever-expanding volume of data, it is vital to focus on developing databases/resources that will integrate information from different sources as well as from literature and provide rapid access to it. Analysis of such data can be further utilized to obtain important insights to understand allergic reactions. Structural features of allergens contribute significantly to their allergenicity and therefore this knowledge can be employed for developing more efficient methods for allergen/allergenicity and allergic cross-reactivity prediction. Recent advances in epitope prediction methodologies focus on antibody-specific epitope prediction approaches [150]. Application of such approaches for predicting IgE-binding epitopes will be extremely important in the development of newer and effective strategies for diagnosis and treatment of allergic diseases. Allergen immunotherapy (AIT), which is an individualized and allergens-based treatment approach, has been consid‐ ered as a prototype of precision medicine or personalized medicine [156]. Bioinformatics has the potential to play an important role in the development of novel approaches in AIT as well as contribute for further enrichment of the field of allergen informatics. This will surely aid in gaining better understanding of allergic diseases and positively influence upcoming research

The work was supported under the Senior Research Fellowship (SRF) granted to KK by the University Grants Commission (UGC), New Delhi, India. UKK and SS would like to acknowl‐ edge the Centre of Excellence grant from the Department of Biotechnology (DBT), New Delhi, India. The authors would also like to acknowledge the Bioinformatics Centre, Savitribai Phule

, V.K. Jayaraman2

and Urmila Kulkarni-Kale1\*

Pune University, for providing the infrastructure and resources.

, Sangeeta Sawant1

\*Address all correspondence to: urmila@bioinfo.net.in


[27] Aalberse RC. Assessment of allergen cross-reactivity. Clin Mol Allergy. 2007;5:1. DOI: 10.1186/1476-7961-5-2

[14] Pomés A. Relevant B cell epitopes in allergic disease. Int Arch Allergy Immunol.

[15] Bufe A. Significance of IgE-binding epitopes in allergic disease. J Allergy Clin Immunol.

[16] Aalberse RC, Crameri R. IgE‐binding epitopes: a reappraisal. Allergy. 2011;66:1261–

[17] Bannon GA, Ogawa T. Evaluation of available IgE‐binding epitope data and its utility in bioinformatics. Mol Nutr Food Res. 2006;50:638–644. DOI: 10.1002/mnfr.200500276

[18] Matsuo H, Yokooji T, Taogoshi T. Common food allergens and their IgE-binding

[19] Meno KH. Allergen structures and epitopes. Allergy. 2011;66:19–21. DOI: 10.1111/j.

[20] Ghate AD, Bhagwat BU, Bhosle SG, Gadepalli SM, Kulkarni-Kale UD. Characterization of antibody-binding sites on proteins: development of a knowledgebase and its applications in improving epitope prediction. Protein Pept Lett. 2007;14:531–535. DOI:

[21] Kulkarni-Kale U, Raskar-Renuse S, Natekar-Kalantre G, Saxena SA. Antigen–antibody interaction database (AgAbDb): a compendium of antigen–antibody interactions. In: De RK, Tomar N, editors. Immunoinformatics. New York: Springer; 2014. pp. 149–164.

[22] Cavkaytar O, Akdis CA, Akdis M. Modulation of immune responses by immunother‐ apy in allergic diseases. Curr Opin Pharmacol. 2014;17:30–37. DOI: 10.1016/j.coph.

[23] Larché M. T cell epitope-based allergy vaccines. In: Valenta R, Coffman R, editors. Vaccines against Allergies. Berlin, Heidelberg: Springer; 2011. pp. 107–119. DOI:

[24] Prickett SR, Rolland JM, O'Hehir RE. Immunoregulatory T cell epitope peptides: the new frontier in allergy therapy. Clin Exp Allergy. 2015;45:1015–1026. DOI: 10.1111/cea.

[25] Frazier A, Schulten V, Hinz D, Oseroff C, Sidney J, Peters B, Sette A. Allergy-associated T cell epitope repertoires are surprisingly diverse and include non-IgE reactive

[26] Vaughan K, Greenbaum J, Kim Y, Vita R, Chung J, Peters B, Broide D, Goodman R, Grey H, Sette A. Towards defining molecular determinants recognized by adaptive immunity in allergic disease: an inventory of the available data. J Allergy (Cairo).

antigens. World Allergy Organ J. 2014;7:26. DOI: 10.1186/1939-4551-7-26

epitopes. Allergol Int. 2015;64:332–343. DOI: 10.1016/j.alit.2015.06.009

2010;152:1–11. DOI: 10.1159/000260078

74 Bioinformatics - Updated Features and Applications

2001;107:219–221. DOI: 10.1067/mai.2001.112850

1274. DOI: 10.1111/j.1398-9995.2011.02656.x

1398-9995.2011.02625.x

10.2174/092986607780989921

DOI: 10.1007/978-1-4939-1115-8\_8

2011;2010:628026. DOI: 10.1155/2010/628026

2014.07.003

12554

10.1007/82\_2011\_131


action along with the tertiary structure information. J Proteomics Bioinform. 2012;5:84– 89. DOI: 10.4172/jpb.1000217

[57] The Immune Epitope Database (IEDB). 2010. Available from: http://www.iedb.org [Accessed: 2016-03-07].

[41] Ivanciuc O, Schein C, Braun W. SDAP: database and computational tools for allergenic

[42] NCBI Resource Coordinators. Database resources of the National Center for Biotech‐ nology Information. Nucl Acids Res. 2015;43:D6–D17. DOI: 10.1093/nar/gku1130 [43] Allergen Database for Food Safety (ADFS). Available from: http://allergen.nihs.go.jp/

[44] Nakamura R, Teshima R, Takagi K, Sawada J. Development of Allergen Database for Food Safety (ADFS): an integrated database to search allergens and predict allergenic‐

[45] AllergenOnline. Available from: http://www.allergenonline.org [Accessed:

[46] Goodman RE, Ebisawa M, Ferreira F, Sampson HA, van Ree R, Vieths S, et al. Aller‐ genOnline: a peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity. Mol Nutr Food Res. 2016;60:1183–98. DOI: 10.1002/mnfr.

[47] AllFam. Available from: http://www.meduniwien.ac.at/allfam [Accessed: 2016-03-07]. [48] Radauer C, Bublin M, Wagner S, Breiteneder H. Allergens are distributed into few protein families and possess a restricted number of biochemical functions. J Allergy

[49] Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hether‐ ington K, Holm L, Mistry J, Sonnhammer EL, Tate J, Punta M. Pfam: the protein families

[50] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol.

[51] Gene Ontology Consortium. The Gene Ontology (GO) database and informatics

[52] AllergenPro. Available from: http://nabic.rda.go.kr/allergen [Accessed: 2016-06-01]. [53] Kim CK, Seol YJ, Lee DJ, Jeong IS, Yoon UH, Lee JY, Lee GS, Park DS. AllergenPro: an integrated database for allergenicity analysis and prediction. Bioinformation.

[54] Saha S, Bhasin M, Raghava GP. Bcipep: a database of B-cell epitopes. BMC Genomics.

[55] Huang J, Honda W. CED: a conformational epitope database. BMC Immunol. 2006;7:7.

[56] Sharma OP, Das AA, Krishna R, Suresh Kumar M, Mathur PP. Structural epitope database (SEDB): a web-based database for the epitope, and its intermolecular inter‐

database. Nucl Acids Res. 2013;42:D222–D230. DOI: 10.1093/nar/gkt1223

resource. Nucl Acids Res. 2004;32:D258–D261. DOI: 10.1093/nar/gkh036

Clin Immunol. 2008;121:847–852. DOI: 10.1016/j.jaci.2008.01.025

1995;247:536–540. DOI: 10.1016/S0022-2836(05)80134-2

2014;10:378. DOI: 10.6026/97320630010378

2005;6:79. DOI: 10.1186/1471-2164-6-79

DOI: 10.1186/1471-2172-7-7

ity. Kokuritsu Iyakuhin Shokuhin Eisei Kenkyujo hokoku. 2004;123:32–36.

proteins. Nucl Acids Res. 2003;31:359–362. DOI: 10.1093/nar/gkg010

ADFS/ [Accessed: 2016-06-01].

76 Bioinformatics - Updated Features and Applications

2016-03-07].

201500769


[81] Dimitrov I, Naneva L, Bangov I, Doytchinova I. Allergenicity prediction by artificial neural networks. J Chemom. 2014;28:282–286. DOI: 10.1002/cem.2597

[68] Verma A, Misra A, Subash S, Das M, Dwivedi P. Computational allergenicity prediction of transgenic proteins expressed in genetically modified crops. Immunopharmacol

[69] Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, Hefle SL. Bioinformatic methods for allergenicity assessment using a comprehensive allergen database. Int Arch Allergy Immunol. 2002;128:280–291. DOI: 10.1159/000063861

[70] Kleter GA, Peijnenburg AA. Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE– binding linear epitopes of allergens. BMC Struct Biol. 2002;2:8. DOI:

[71] Stadler MB, Stadler BM. Allergenicity prediction by protein sequence. FASEB J.

[72] The web tool WebAllergen (http://weballergen.bii.a-star.edu.sg/) could not be accessed

[73] Riaz T, Hor HL, Krishnan A, Tang F, Li KB. WebAllergen: a web server for predicting allergenic proteins. Bioinformatics. 2005;21:2570–2571. DOI: 10.1093/bioinformatics/

[74] Kong W, Tan TS, Tham L, Choo KW. Improved prediction of allergenicity by combi‐

[75] Saha S, Raghava GP. AlgPred: prediction of allergenic proteins and mapping of IgE

[76] Tyagi N, Farnell EJ, Fitzsimmons CM, Ryan S, Tukahebwa E, Maizels RM, Dunne DW, Thornton JM, Furnham N. Comparisons of allergenic and metazoan parasite proteins: allergy the price of immunity. PLoS Comput Biol. 2015;11:e1004546. DOI: 10.1371/

[77] Kadam K, Sawant S, Kulkarni-Kale U, Jayaraman VK. Prediction of protein function based on machine learning methods: an overview. In: iConcept Press, editor. Genomics III - Methods, Techniques and Applications. Hong Kong: iConcept Press; 2014 . pp. 125–

[78] Zorzet A, Gustafsson M, Hammerling U. Prediction of food protein allergenicity: a

[79] Soeria-Atmadja D, Zorzet A, Gustafsson MG, Hammerling U. Statistical evaluation of local alignment features predicting allergenicity using supervised classification algorithms. Int Arch Allergy Immunol. 2004;133:101–112. DOI: 10.1159/000076382

[80] Li KB, Issac P, Krishnan A. Predicting allergenic proteins using wavelet transform.

bioinformatic learning systems approach. In Silico Biol. 2002;2:525–534.

Bioinformatics. 2004;20:2572–2578. DOI: 10.1093/bioinformatics/bth286

epitopes. Nucl Acids Res. 2006;34:W202–W209. DOI: 10.1093/nar/gkl343

nation of multiple sequence motifs. In Silico Biol. 2007;7:77–86.

Immunotoxicol. 2011;33:410–422. DOI: 10.3109/08923973.2010.523704

10.1186/1472-6807-2-8

78 Bioinformatics - Updated Features and Applications

on 2016-06-01

journal.pcbi.1004546

162. ISBN: 978-1-922227-096

bti356

2003;17:1141–1143. DOI: 10.1096/fj.02-1052fje


[109] Yang X, Yu X. An introduction to epitope prediction methods and software. Rev Med Virol. 2009;19:77–96

[96] Wang J, Zhang D, Li J. PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection. BMC Syst Biol. 2013;7:S9. DOI:

[97] Chrysostomou C, Seker H. Prediction of protein allergenicity based on signal-process‐ ing bioinformatics approach. In: Proceedings of the Engineering in Medicine and Biology Society (EMBC), 36th Annual International Conference of the IEEE 2014.

Chicago: IEEE; 2014. pp. 808–811. DOI: 10.1109/EMBC.2014.6943714

[98] Allerdictor. Available from: http://allerdictor.vbi.vt.edu/ [Accessed: 2016-03-07].

[99] Dang HX, Lawrence CB. Allerdictor: fast allergen prediction using text classification techniques. Bioinformatics. 2014;30:1120–1128. DOI: 10.1093/bioinformatics/btu004

[100] Wang J, Yu Y, Zhao Y, Zhang D, Li J. Evaluation and integration of existing methods for computational prediction of allergens. BMC Bioinformatics. 2013;14:S1. DOI:

[101] proAP. Available from: http://gmobl.sjtu.edu.cn/proAP/main.html [Accessed:

[102] AllergenFP. Available from: http://ddg-pharmfac.net/AllergenFP/ [Accessed:

[103] Dimitrov I, Naneva L, Doytchinova I, Bangov I. AllergenFP: allergenicity prediction by descriptor fingerprints. Bioinformatics. 2014;30:846–851. DOI: 10.1093/bioinformatics/

[104] Bragin AO, Demenkov PS, Kolchanov NA, Ivanisenko VA. Accuracy of protein allergenicity prediction can be improved by taking into account data on allergenic protein discontinuous peptides. J Biomol Struct Dyn. 2013;31:59–64. DOI:

[105] Vijayakumar S, Lakshmi PTV. A fuzzy inference system for predicting allergenicity and allergic cross-reactivity in proteins. In: Proceedings of the Bioinformatics and Biomedicine (BIBM), IEEE International Conference; 18–21 December 2013. Shanghai:

[106] FuzzyApp. Available from: http://fuzzyapp.bicpu.edu.in/fuzzyapp.php [Accessed:

[107] Saravanan V, Lakshmi PT. Fuzzy logic for personalized healthcare and diagnostics: FuzzyApp—a fuzzy logic based allergen-protein predictor. Omics. 2014;18:570–581.

[108] Jiang B, Qu H, Hu Y, Ni T, Lin Z. Computational analysis of the relationship between allergenicity and digestibility of allergenic proteins in simulated gastric fluid. BMC

IEEE; 2013. pp. 49–52. DOI: 10.1109/BIBM.2013.6732458

Bioinformatics. 2007;8:375. DOI: 10.1186/1471-2105-8-375

10.1186/1752-0509-7-S5-S9

80 Bioinformatics - Updated Features and Applications

10.1186/1471-2105-14-S4-S1

10.1080/07391102.2012.691362

DOI: 10.1089/omi.2014.0021

2016-03-07].

2016-03-07].

2016-03-07].

btt619


[136] Weber CA, Mehta PJ, Ardito M, Moise L, Martin B, De Groot AS. T cell epitope: friend or foe? Immunogenicity of biologics in context. Adv Drug Deliv Rev. 2009;61:965–976. DOI: 10.1016/j.addr.2009.07.001

[123] Kolaskar AS, Kulkarni-Kale U. Prediction of three-dimensional structure and mapping of conformational epitopes of envelope glycoprotein of Japanese encephalitis virus.

[124] Kulkarni-Kale U, Bhosle S, Kolaskar AS. CEP: a conformational epitope prediction

[125] Kringelum JV, Lundegaard C, Lund O, Nielsen M. Reliable B cell epitope predictions: impacts of method development and improved benchmarking. PLoS Comput Biol.

[126] Qi T, Qiu T, Zhang Q, Tang K, Fan Y, Qiu J, Wu D, Zhang W, Chen Y, Gao J, Zhu R. SEPPA 2.0—more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen. Nucl Acids Res.

[127] Sweredoski MJ, Baldi P. PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure. Bioinformatics.

[128] Giacò L, Amicosante M, Fraziano M, Gherardini PF, Ausiello G, Helmer-Citterich M, Colizzi V, Cabibbo A. B-Pred, a structure based B-cell epitopes prediction server. Adv

[129] Ansari HR, Raghava GP. Identification of conformational B-cell epitopes in an antigen from its primary sequence. Immunome Res. 2010;6:6. DOI: 10.1186/1745-7580-6-6 [130] Ponomarenko J, Bui HH, Li W, Fusseder N, Bourne PE, Sette A, Peters B. ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics.

[131] Rubinstein ND, Mayrose I, Martz E, Pupko T. Epitopia: a web-server for predicting Bcell epitopes. BMC Bioinformatics. 2009;10:287. DOI: 10.1186/1471-2105-10-287 [132] Liang S, Zheng D, Standley DM, Yao B, Zacharias M, Zhang C. EPSVR and EPMeta: prediction of antigenic epitopes using support vector regression and multiple server

[133] Liang S, Zheng D, Zhang C, Zacharias M. Prediction of antigenic epitopes on protein surfaces by consensus scoring. BMC Bioinformatics. 2009;10:302. DOI:

[134] Zheng W, Ruan J, Hu G, Wang K, Hanlon M, Gao J. Analysis of conformational B-cell epitopes in the antibody–antigen complex using the depth function and the convex

[135] AgAbDb. Available from: http://bioinfo.net.in/AgAbDb.htm [Accessed: 2016-03-07].

results. BMC Bioinformatics. 2010;11:381. DOI: 10.1186/1471-2105-11-381

Hull. PLoS One. 2015;10:e0134835. DOI: 10.1371/journal.pone.0134835

server. Nucl Acids Res. 2005;33:W168–W171. DOI: 10.1093/nar/gki460

Virology. 1999;261:31–42. DOI: 10.1006/viro.1999.9859

82 Bioinformatics - Updated Features and Applications

2012;8:e1002829. DOI: 10.1371/journal.pcbi.1002829

2008;24:1459–1460. DOI: 10.1093/bioinformatics/btn199

Appl Bioinform Chem. 2012;5:11–21. DOI: 10.2147/AABC.S30620

2014;42:W59–W63. DOI: 10.1093/nar/gku395

2008;9:1. DOI: 10.1186/1471-2105-9-514

10.1186/1471-2105-10-302


### **Bioinformatics for Membrane Lipid Simulations: Models, Computational Methods, and Web Server Tools**

S. W. Leong, T. S. Lim and Y. S. Choong

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/62576

#### **Abstract**

[150] Sela-Culang I, Ofran Y, Peters B. Antibody specific epitope prediction—emergence of a new paradigm. Curr Opin Virol. 2015;11:98–102. DOI: 10.1016/j.coviro.2015.03.012

[151] Tong JC, Tammi MT. Prediction of protein allergenicity using local description of amino

[152] Dall'Antonia F, Gieras A, Devanaboyina SC, Valenta R, Keller W. Prediction of IgEbinding epitopes by means of allergen surface comparison and correlation to crossreactivity. J Allergy Clin Immunol. 2011;128:872–879. DOI: 10.1016/j.jaci.2011.07.007

[153] Lollier V, Denery-Papini S, Brossard C, Tessier D. Meta-analysis of IgE-binding allergen

[154] LocAllEpi. Available from: http://wwwappli.nantes.inra.fr:8180/LocAllEpi/ [Accessed:

[155] Aalberse RC, Stadler BM. In silico predictability of allergenicity: from amino acid sequence via 3‐D structure to allergenicity. Mol Nutr Food Res. 2006;50:625–627. DOI:

[156] Canonica GW, Bachert C, Hellings P, Ryan D, Valovirta E, Wickman M, De Beaumont O, Bousquet J. Allergen immunotherapy (AIT): a prototype of precision medicine.

World Allergy Organ J. 2015;8:31. DOI: 10.1186/s40413-015-0079-7

epitopes. Clin Immunol. 2014;153:31–39. DOI: 10.1016/j.clim.2014.03.010

2016-03-07].

10.1002/mnfr.200500270

84 Bioinformatics - Updated Features and Applications

acid sequence. Front Biosci. 2007;13:6072–6078. DOI: 10.2741/3138

Biological membranes are complex environments consisting of different types of lipids and membrane proteins. The structure of a lipid bilayer is typically difficult to study because the membrane liquid crystalline state is made up of multiple disordered lipid molecules. This complicates the description of the lipid membrane properties by the conformation of any single lipid molecule. Molecular dynamics (MD) simulations have been used extensively to investigate properties of membrane lipids, lipid vesicles, and membrane protein systems. All-atom membrane models can elucidate detailed contacts between membrane proteins and its surrounding lipids, while united-atom and coarsegrained description have allowed larger models and longer timescales up to microsec‐ ond mark to be probed. Additionally, membrane models with mixed phospholipids and lipopolysaccharide content have made it possible to model improved views of biologi‐ cal membranes. Here, we present an overview of commonly used lipid force fields by the biosimulation community, useful tools for membrane MD simulations, and recent advances in membrane simulations.

**Keywords:** all-atom (AT), coarse-grain (CG), force field, lipid bilayer, united-atom (UA)

### **1. Introduction**

The biological membrane is a selective barrier delineating the boundaries of cells and organiz‐ ing cellular organelles into compartments. One of the major functions of the membrane is to regulate movements of water, ions, and certain constituents in and out of the cell or organelle. Singer and Nicholson described the membrane as a "fluid mosaic" model in which lipids and proteins freely diffuse in the membrane plane [1]. While the biological membrane is actually

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

made up of 1.5- to 4-fold proteins by weight, it is usually described as phospholipids ar‐ ranged in a bilayer.

The lipid composition in membrane is highly specific with variation identified from mem‐ branes of different organisms, different cells of the same organism, and even different membranes of the same cell [2]. Generally, a lipid consists of a polar head group and hydro‐ phobic tail region where a cell may have over a thousand different lipid species. Formation of bilayers and other lipid structures such as micelles occurs as the hydrophobic tails orientates away from the aqueous environment while the polar head groups interact favorably with water. Changes in the bilayer structures can affect the function of the membrane, membraneembedded proteins and complexes. The surrounding environment can greatly affect the structure and functions of proteins, much like the effects of water on water-soluble proteins. The membrane produces a complex environment for embedded proteins, and membrane thickness, fluidity, charge, curvature, and phase have been demonstrated to play a role in protein structure and function determination [3].

The structure and properties of membranes are a complex matter and cannot be easily described by a single-lipid molecule alone. Molecular dynamics (MD) simulations offer a viable alternative to study the properties of membrane and lipid-forming structures such as vesicles. This review serves as an introduction to currently available membrane lipid force fields and recent advances in membrane simulations.

### **2. Force fields for lipid simulation**

In general, all-atom (AT), united-atom (UA), and coarse-grained (CG) are the three-membrane lipid force fields. **Figure 1** illustrates the AT, UA and CG force field of a lipid by spherical representation.

**Figure 1.** Representation of 1,2-dipalmitoyl-sn-glycero-3-phosphocholine (DPPC) with (a) atomistic (all-atom; AT), (b) united-atom (UA), and (c) coarse-grain (CG) force fields as van der Waals spheres.

### **2.1. All-atom (AT) MD simulation: details and details**

made up of 1.5- to 4-fold proteins by weight, it is usually described as phospholipids ar‐

The lipid composition in membrane is highly specific with variation identified from mem‐ branes of different organisms, different cells of the same organism, and even different membranes of the same cell [2]. Generally, a lipid consists of a polar head group and hydro‐ phobic tail region where a cell may have over a thousand different lipid species. Formation of bilayers and other lipid structures such as micelles occurs as the hydrophobic tails orientates away from the aqueous environment while the polar head groups interact favorably with water. Changes in the bilayer structures can affect the function of the membrane, membraneembedded proteins and complexes. The surrounding environment can greatly affect the structure and functions of proteins, much like the effects of water on water-soluble proteins. The membrane produces a complex environment for embedded proteins, and membrane thickness, fluidity, charge, curvature, and phase have been demonstrated to play a role in

The structure and properties of membranes are a complex matter and cannot be easily described by a single-lipid molecule alone. Molecular dynamics (MD) simulations offer a viable alternative to study the properties of membrane and lipid-forming structures such as vesicles. This review serves as an introduction to currently available membrane lipid force

In general, all-atom (AT), united-atom (UA), and coarse-grained (CG) are the three-membrane lipid force fields. **Figure 1** illustrates the AT, UA and CG force field of a lipid by spherical

**Figure 1.** Representation of 1,2-dipalmitoyl-sn-glycero-3-phosphocholine (DPPC) with (a) atomistic (all-atom; AT), (b)

united-atom (UA), and (c) coarse-grain (CG) force fields as van der Waals spheres.

ranged in a bilayer.

86 Bioinformatics - Updated Features and Applications

protein structure and function determination [3].

fields and recent advances in membrane simulations.

**2. Force fields for lipid simulation**

representation.

AT MD simulation represents every atom in the system as a single interaction site. **Figure 2** shows the example of AT simulation system of *Salmonella enterica ser.* Typhi TolC protein in POPE. To date, Chemistry at HARvard Macromolecular Mechanics (CHARMM) and Assisted Model Building with Energy Refinement (AMBER) are the only fully AT force field parame‐ terization available for lipids. Prior to the development of CHARMM36, CHARMM27r was widely used for membrane simulations [4, 5]. Simulations using CHARMM27r require a large positive surface tension (30–40 dyn/cm) to achieve the experimentally determined surface area per lipid (APL). However, theoretical considerations of self-assembled systems and macro‐ scopic black lipid bilayers [6] indicate that the surface tension of bilayers are about zero to several dyn/cm even when undulations are taken into account [7]. Therefore, simulations of lipid bilayers using CHARMM27 and CHARMM27r shrinks to a near gel phase state without the use of surface tension. Additionally, CHARMM27 and CHARMM27r also failed to reproduce the experimental deuterium order parameters, *SCD* in the glycerol and upper chain regions [4]. A wide range of glycerophospolipids exhibit splitting in the carbon 2 of the aliphatic chain and carbon 1 for glycerols, but this observation cannot be replicated when simulations were performed with CHARMM27 or CHARMM27r [8]. This may affect conclu‐ sions of interactions of lipids with surface-active agents drawn from MD simulations. Besides that, the area compressibility modulus, *KA*, was underestimated, the head group region was underhydrated, the electron density in the bilayer midplane was underestimated while the frequency dependence of the 13C NMR *T1* of the acyl chains near the head group was overes‐ timated.

**Figure 2.** All-atom (AT) simulation of *Salmonella enterica ser*. Typhi TolC protein in POPE.

These major weaknesses of the CHARMM27r force field eventually motivated the develop‐ ment of CHARMM36. CHARMM36 corrected most of the existing problems with lipid force fields, most importantly allowing the simulation of lipid bilayer without the use of surface tension and improved the reproducibility of the experimental deuterium order parameters in the glycerol and upper chain regions of phospholipids [8]. A comparative study of lipid force fields by Piggot and colleagues found that CHARMM36 was the only force field that accurately reproduced the experimental order parameters of carbon 2 in both acyl chains of 1,2-dipalmi‐ toyl-*sn*-glycero-3-phosphocholine (DPPC) and 1,2-dipalmitoyl-*sn*-glycero-3-phosphocholine (POPC) [9]. The other properties of the membrane were reasonably well reproduced. However, there are several limitations that have to be considered while pursuing CHARMM36. One is the approximate treatment of long-range Lennard-Jones (LJ) forces for bilayers. Simulations augmented with long-range LJ forces using 3D-isotropic periodic sum/discrete fast Fourier transform (3D-IPS/DFFT) may improve results for lipid monolayers but this increases surface tension, therefore making it less suited for bilayer simulations. The recommendation for bilayer simulations is to use particle-mesh Ewald (PME) with *rc* = 10 or 12 Å and no long-range correction for the LJ term, but these setting will cause underestimation of the surface tension of lipid monolayers.

AMBER force fields were generally less used for membrane protein simulation due to the lack of a specific parameter set for lipids. However, the general AMBER force field (GAFF) which was originally parameterized for the simulation of arbitrary organic molecules with preexisting AMBER force fields have been shown to reproduce lipid bilayer parameters satisfac‐ torily [10]. GAFF has been tested on a range of lipids, including 1,2-dimyristoyl-d54-*sn*glycero-3-phosphocholine (DMPC), 2-dioleoyl-*sn*-glycero-3-phosphocholine (DOPC), 1,2 dilauroyl-*sn*-glycero-3-phosphocholine (DLPC), DPPC, POPC, and 1-palmitoyl-2-oleoyl-*sn*glycero-3-phosphoethanolamine (POPE) [11, 12]. The use of surface tension is necessary to achieve the correct APL for POPC [10], DOPC [13], and DMPC [12]. In order to better fit GAFF for lipid molecules, Dickson and colleagues set to re-parameterize the LJ terms for acyl chain carbons and hydrogens as the initial GAFF LJ terms were developed for proteins, nucleic acids, and small organic molecules [11].

In addition, torsion parameters were also re-optimized using high-level quantum chemical data with "Paramfit." This strategy has allowed a tensionless simulation of the lipids in the isothermal-isobaric (NPT) ensemble, achieving high level of agreement with experiment data for volume per lipid (VL) and thickness value within 5% of experiment [11]. However, the APL for POPE lipids is lower than expected and thickness value was overestimated by 10% [11]. The force field has also been shown to be able to reproduce the data on large lipid bilayers containing 288 lipids, with little changes in the APL, VL, and bilayer thickness [11]. The comparison with CHARMM27 and Berger force field also displayed GAFF's ability to repro‐ duce the order asymmetry found at the beginning of lipid acyl chains [13]. Prior to the development of CHARMM36, GAFF was the only force field capable of capturing the carbon-2 deuterium order parameter splitting [13], although it has also produced lower surface APL and higher deuterium order parameters compared to experimental data [10, 12, 13]. Even so, compatibility of GAFF with previous AMBER force fields makes it a suitable choice for the simultaneous simulation of membrane, protein, and organic molecule simulation [13].

These major weaknesses of the CHARMM27r force field eventually motivated the develop‐ ment of CHARMM36. CHARMM36 corrected most of the existing problems with lipid force fields, most importantly allowing the simulation of lipid bilayer without the use of surface tension and improved the reproducibility of the experimental deuterium order parameters in the glycerol and upper chain regions of phospholipids [8]. A comparative study of lipid force fields by Piggot and colleagues found that CHARMM36 was the only force field that accurately reproduced the experimental order parameters of carbon 2 in both acyl chains of 1,2-dipalmi‐ toyl-*sn*-glycero-3-phosphocholine (DPPC) and 1,2-dipalmitoyl-*sn*-glycero-3-phosphocholine (POPC) [9]. The other properties of the membrane were reasonably well reproduced. However, there are several limitations that have to be considered while pursuing CHARMM36. One is the approximate treatment of long-range Lennard-Jones (LJ) forces for bilayers. Simulations augmented with long-range LJ forces using 3D-isotropic periodic sum/discrete fast Fourier transform (3D-IPS/DFFT) may improve results for lipid monolayers but this increases surface tension, therefore making it less suited for bilayer simulations. The recommendation for bilayer simulations is to use particle-mesh Ewald (PME) with *rc* = 10 or 12 Å and no long-range correction for the LJ term, but these setting will cause underestimation of the surface tension

AMBER force fields were generally less used for membrane protein simulation due to the lack of a specific parameter set for lipids. However, the general AMBER force field (GAFF) which was originally parameterized for the simulation of arbitrary organic molecules with preexisting AMBER force fields have been shown to reproduce lipid bilayer parameters satisfac‐ torily [10]. GAFF has been tested on a range of lipids, including 1,2-dimyristoyl-d54-*sn*glycero-3-phosphocholine (DMPC), 2-dioleoyl-*sn*-glycero-3-phosphocholine (DOPC), 1,2 dilauroyl-*sn*-glycero-3-phosphocholine (DLPC), DPPC, POPC, and 1-palmitoyl-2-oleoyl-*sn*glycero-3-phosphoethanolamine (POPE) [11, 12]. The use of surface tension is necessary to achieve the correct APL for POPC [10], DOPC [13], and DMPC [12]. In order to better fit GAFF for lipid molecules, Dickson and colleagues set to re-parameterize the LJ terms for acyl chain carbons and hydrogens as the initial GAFF LJ terms were developed for proteins, nucleic acids,

In addition, torsion parameters were also re-optimized using high-level quantum chemical data with "Paramfit." This strategy has allowed a tensionless simulation of the lipids in the isothermal-isobaric (NPT) ensemble, achieving high level of agreement with experiment data for volume per lipid (VL) and thickness value within 5% of experiment [11]. However, the APL for POPE lipids is lower than expected and thickness value was overestimated by 10% [11]. The force field has also been shown to be able to reproduce the data on large lipid bilayers containing 288 lipids, with little changes in the APL, VL, and bilayer thickness [11]. The comparison with CHARMM27 and Berger force field also displayed GAFF's ability to repro‐ duce the order asymmetry found at the beginning of lipid acyl chains [13]. Prior to the development of CHARMM36, GAFF was the only force field capable of capturing the carbon-2 deuterium order parameter splitting [13], although it has also produced lower surface APL and higher deuterium order parameters compared to experimental data [10, 12, 13]. Even so,

of lipid monolayers.

88 Bioinformatics - Updated Features and Applications

and small organic molecules [11].

A recently developed Lipid14 force field with an updated Lipid11 head group and tail group charges and parameters enabled proper tensionless simulation of lipid bilayers in AMBER [14]. Lipid14 LJ and torsion parameters were modified to reproduce the experimental density, *ρ* and heat of vaporization, *∆Hvap* of alkanes of different chain lengths by fitting the CH2-CH2- CH2-CH2 torsion to *ab initio* data using "Paramfit" and altering the LJ and torsion simultane‐ ously. Lipid14 force field has a modular nature that allows new lipid species to be added into the force field by constructing them from head group and tail group "building blocks." Testing of Lipid14 on DLPC, DMPC, DPPC, DOPC, POPC, and POPE showed that APL for all simulations was within 3% of the experimental value, with the exception of POPE. The APL was closer to the older experimental value of 56.6 Å2 [15] than the more current APL of 59–60 Å2 [16]. The VL was found to be within 5% of experimental value which was acceptable, although it may be considered a slight underestimation [14].

Furthermore, the isothermal area compressibility modulus, *KA* falls close to the experiment, again with the exception of POPE, which has a higher *KA* with a large standard deviation value. The authors suggested that implementation of other barostats into AMBER may improve *KA* values as the Berendsen method for pressure control is not ideal for simulations in which volume fluctuations is an important parameter that is capable of influencing the outcome of the simulation [14]. The Luzzati thickness, *DB*, which is calculated using the z-dimension of the simulation box and the integral probability distribution of the water density along the zaxis was slightly underestimated for saturated lipids, which implies more water penetration into the hydrophobic region of the bilayer of these lipids. Lipid14 was also able to reproduce the experimental order parameter trend, including the splitting of the sn-2 chain from the sn-1 chain and the drop at the carbon-9 and carbon-10 positions of POPC and POPE lipids, which is the *cis* double bond. Results from GPU repeats and CPU runs were also consistent.

Recent analysis applying full AT force field have reached timescales of up to tens of nanosec‐ onds, with ambitious simulations pushing the microsecond mark, e.g. to characterize the interaction of the multiple sclerosis synthetic biomarker CSF114(Glc) with the membrane bilayer [17], probing of the huntingtin Htt17 membrane anchor on a POPC bilayer [18], mixing of lipids [19], and membrane-binding mechanism of the yeast Osh4 peripheral membrane protein [20]. Even though microsecond simulations have been reported, the accessibility of this approach is limited as many biologically relevant phenomena may require sampling time beyond the microsecond timescale and investigators may not have access to intensive com‐ putational resources. Given that lipid reorganization is quite slow, a properly equilibrated membrane may require 20–40 ns of simulation time. Convergence of membrane protein simulations still remains a question, as 100 ns of simulation may still be insufficient to fully describe e.g. rhodopsin loop fluctuations in a membrane [21]. Indeed, since all atomic-level interactions are retained and time-steps for integration of Newtonian motions are in the femtosecond range, AT simulations could be a time consuming and computationally expensive practice. The issue of convergence has been rectified in part by running several long simula‐ tions and using non-equilibrium sampling method or steered MD to calculate the observables of interest. A series of extended simulations (~100 ns) on the voltage sensor domain of potassium channels revealed the importance of lipid phosphates in accommodating the significantly charged S4 helix [22]. Using umbrella sampling, potential of mean force has been calculated for permeation and effect of a potassium ion on the second ion passing through gramicidin A channel [23].

Regardless, AT simulations still provided the highest level of detail and reliability when it comes to quantitative prediction of properties such as motional timescales or interaction strengths. Such details recently provided insightful interactions between protein and mem‐ brane, such as the rearrangement of amino acid side chains and local bilayer deformation due to hydrophobic mismatch and how the hydrophobic membrane layer accommodates charged arginine side chain of outer membrane phospholipase A [24, 25].

### **2.2. United-atom (UA) lipid models: the best of both worlds?**

The UA representation of lipids simplifies the carbon tails of the lipid by associating the aliphatic carbon and its hydrogen atoms into a single particle. Because the non-polar hydrogen atoms are treated implicitly, the number of interaction sites per lipid can be reduced by two third. The computational costs for simulations of such membrane systems become relatively cheap as the 60% of the pairwise interactions in the membrane is reduced. The model lipid DPPC can be represented by 50 particles in UA force field, but needed 130 interaction sites in an AT force field. Since limited physical information may be collected from explicit acyl chain hydrogens, it became desirable to utilize UA force fields for membrane with an AT protein force field for membrane protein simulation [26, 27].

The UA lipid models parameterized by Berger et al*.* (1997) were one of the most popular lipid force field for lipids and were originally developed by Essex and colleagues [28] from the Optimized Potentials for Liquid Simulations (OPLSs) UA force field. Bonded parameters of the Berger lipids were obtained from the GROMOS87 force field (note: GROMOS is the GROningen Molecular Simulation package), the acyl chains used Ryckaert-Bellemans dihedral parameters whereas the van der Waals terms were from OPLS and atomic partial charges were from Chiu and colleagues' calculations [29]. Berger and colleagues further optimized the LJ parameters of the lipid tails based on thermodynamic data of pentadecane [30]. Berger lipid parameters were recommended if one desires maximal sampling due to its fast diffusion and good simulation efficiency [9]. For membrane protein simulations, Berger lipids are commonly used with OPLS and GROMOS [27]. It has also been demonstrated to be compatible with AMBER99SB and could give marginally better free energy calculations than the widely used OPLS/Berger combination [31]. While simulations of membrane proteins using such hybrid parameters have been validated by various groups [32–34], the combination of different force fields require care. Protein lipid interactions in Berger-OPLS and Berger-GROMOS have been found to be overestimated and result in drastic changes of lipid properties upon protein insertion [13, 27].

A CHARMM UA representation of the acyl chains is also available, and compatible for simulations with AT CHARMM protein force fields [26]. The CHARMM UA model was derived from C27 phospholipids, where explicit hydrogen atoms of the acyl chains were replaced with a UA representation. The force field, called C27-UA, retains the accuracy of the AT counterpart and provides a practical alternative when used in simulations of proteins and other compounds described by C27. C27-UA was parameterized by fitting to experimental data and AT simulations of liquid phase model systems (pentadecane for saturated hydrocar‐ bon chains, *cis*-5-decene for monounsaturated chains and methyl hexanoate for the ester region). Simulation of POPC bilayer with C27-UA was comparable to C27 and reproduced experimental NMR and X-ray diffraction data, including electron density profile and carbondeuterium order parameter. The free energy profiles of transfer of ethane, methanol, and water across a water-dodecane interface were identical in C27-UA and C27 simulation, suggesting that the force field is capable of simulating mixed system containing UA lipids and AT proteins and organic molecules described by the standard CHARMM force field. However, it also retained C27's feature of requiring a positive surface tension to be able to reproduce the experimental APL.

of interest. A series of extended simulations (~100 ns) on the voltage sensor domain of potassium channels revealed the importance of lipid phosphates in accommodating the significantly charged S4 helix [22]. Using umbrella sampling, potential of mean force has been calculated for permeation and effect of a potassium ion on the second ion passing through

Regardless, AT simulations still provided the highest level of detail and reliability when it comes to quantitative prediction of properties such as motional timescales or interaction strengths. Such details recently provided insightful interactions between protein and mem‐ brane, such as the rearrangement of amino acid side chains and local bilayer deformation due to hydrophobic mismatch and how the hydrophobic membrane layer accommodates charged

The UA representation of lipids simplifies the carbon tails of the lipid by associating the aliphatic carbon and its hydrogen atoms into a single particle. Because the non-polar hydrogen atoms are treated implicitly, the number of interaction sites per lipid can be reduced by two third. The computational costs for simulations of such membrane systems become relatively cheap as the 60% of the pairwise interactions in the membrane is reduced. The model lipid DPPC can be represented by 50 particles in UA force field, but needed 130 interaction sites in an AT force field. Since limited physical information may be collected from explicit acyl chain hydrogens, it became desirable to utilize UA force fields for membrane with an AT protein

The UA lipid models parameterized by Berger et al*.* (1997) were one of the most popular lipid force field for lipids and were originally developed by Essex and colleagues [28] from the Optimized Potentials for Liquid Simulations (OPLSs) UA force field. Bonded parameters of the Berger lipids were obtained from the GROMOS87 force field (note: GROMOS is the GROningen Molecular Simulation package), the acyl chains used Ryckaert-Bellemans dihedral parameters whereas the van der Waals terms were from OPLS and atomic partial charges were from Chiu and colleagues' calculations [29]. Berger and colleagues further optimized the LJ parameters of the lipid tails based on thermodynamic data of pentadecane [30]. Berger lipid parameters were recommended if one desires maximal sampling due to its fast diffusion and good simulation efficiency [9]. For membrane protein simulations, Berger lipids are commonly used with OPLS and GROMOS [27]. It has also been demonstrated to be compatible with AMBER99SB and could give marginally better free energy calculations than the widely used OPLS/Berger combination [31]. While simulations of membrane proteins using such hybrid parameters have been validated by various groups [32–34], the combination of different force fields require care. Protein lipid interactions in Berger-OPLS and Berger-GROMOS have been found to be overestimated and result in drastic changes of lipid properties upon protein

A CHARMM UA representation of the acyl chains is also available, and compatible for simulations with AT CHARMM protein force fields [26]. The CHARMM UA model was derived from C27 phospholipids, where explicit hydrogen atoms of the acyl chains were

arginine side chain of outer membrane phospholipase A [24, 25].

**2.2. United-atom (UA) lipid models: the best of both worlds?**

force field for membrane protein simulation [26, 27].

gramicidin A channel [23].

90 Bioinformatics - Updated Features and Applications

insertion [13, 27].

Several GROMOS parameters are used in membrane simulations, such as 43A1-S3 [35], 53A6 [36], and Kukol's modification of the 53A6 parameters [37]. 43A1-S3 is an extension and modification of the 43A1 force field designed to improve the properties of lipid membranes in simulations. 43A1-S3 employs charges from Chiu et al. [29] while van der Waals and dihedral parameters were modified from 43A1 to improve hydrocarbon and choline head group dynamics [38]. GROMOS 43A1-S3 accurately reproduces many properties of DPPC bilayers, including APL, lipid diffusion, and the order parameter [39]. The 53A6 parametrization also uses Chiu et al.'s charges [29] and LJ parameters of the choline methyls and phosphate ester oxygen atoms, thus providing good agreement for the APL, density profile, and order parameter for saturated and unsaturated acyl chain PC lipids [40, 41]. Even with improvements introduced, the isothermal area compressibility modulus was still overestimated. Meanwhile, Kukol reparametrize 53A6 following reports that the force field failed to reproduce DPPC satisfactorily [37]. He used Chiu's charges and increased the van der Waals radii of the carbonyl carbons of DPPC, DMPC, POPC, and POPG. He also employed non-standard GROMOS dihedral parameters for the double bond in the unsaturated lipids. This results in fairly good agreement with experimentally determined properties with the exception of order parameters for the *sn*-2 oleoyl chains. A comparative force field study by Piggot et al. recommended against using 43A1-S3 for POPC membranes and Kukol's 53A6 POPC parameters [9]. This is because 43A1-S3 could not reliably reproduce the drop in order parameter at the carbon 10 of oleoyl chains whereas Kukol's POPC parameters showed several disagreements with experimental value in terms of membrane thickness and order parameter [9].

With the advances in computational power, large-scale simulation projects using UA repre‐ sentation of the membrane region have been able to reach the microsecond mark. Access to such timescale allowed the observation of structural rearrangement and changes in hydrogen bonding pattern in an integral Kv1.2 channel embedded in a hyperpolarized membrane [42]. On a shorter timescale, CorA magnesium transport channel has been probed to undergo conformational changes to a putative open state in 110 ns [43].

### **2.3. Coarse-grained (CG): the need for speed**

CG simulations are being widely used to investigate phenomenon occurring in timescales not accessible by AT simulation. In a CG simulation, 3–4 heavy atoms (non-H) are grouped together and represented by a single particle. For example, a DMPC lipid consisting of 130 atoms can be represented by 12 interaction sites [44]. The choline moiety is modeled with a single positively charged particle, the phosphate group with a negatively charged particle, the glycerol linkages with two nonpolar particles, while the lipid chains are modeled with 4 apolar particles each [45].

Early CG approaches were typically parameterized based on comparison to AT simulations by using inverted Monte Carlo schemes [46–48] or force matching [49], which aims to repro‐ duce the structural details at a particular state for a particular system. Instead, the MARTINI (note: MARTINI is a CG force field developed by Marrink and coworkers [44]) CG force field calibrated the building blocks of biomolecules against thermodynamic data, particularly the oil/water partitioning coefficients so that there is no need to re-parameterize the model each time. A consistent, atomic level compatible CG force field will be beneficial for multi-scale applications. In MARTINI, an average of four heavy atoms were represented by a single interaction site, with the exception of ring structures which has 2 or 3 ring atoms mapped to a CG bead. The beads are then distinguished into four main types, i.e., polar, nonpolar, apolar, and charged. MARTINI was able to reproduce the properties of lipid bilayers on a semiquan‐ titative level. These are including the APL, the distribution of groups across the membrane and the bending and area compression moduli.

MARTINI 2.0 further improved the stress profile across the lipid bilayer and its tendency to form pores. Additionally, the free energy of lipid desorption and lipid flip-flop across the bilayer agreed well with AT simulations and the condensing effect of cholesterol on the APL can be reproduced. MARTINI achieved 5- to 10-fold faster sampling of the configurational space of liquid hydrocarbons and the lipid tails inside a bilayer compared to AT force fields [44]. Among others, MARTINI has allowed applications in simulations of vesicle formation and fusion [50, 51], phase transition of lipid bilayers [52], and the structure and dynamics of membrane-protein assemblies [53, 54]. Currently, MARTINI has also been supplemented with an implicit solvent force field, named Dry MARTINI, which provided 1–2 order speed up and is expected to find application in simulations of large membranes containing millions of lipids [55].

The ELBA (an acronym for "electrostatic-based" by Orsi and Essex [56]) force field, so named because it is electrostatics based, offers another alternative for CG simulations of lipids and lipid bilayers. The ELBA model features two approaches in contrast to other CG methods. Notably, LJ interactions are treated using standard Lorentz-Berthelot mixing rules, similar to AT force fields. This is due to the explicit treatment of lipid electrostatics and water dipoles whereby a relative dielectric constant of unity (ɛ<sup>r</sup> = 1) was used to model their interactions. In addition, a realistic dipolar water model based on a simpler soft sticky dipole potential, i.e. the Stockmayer potential is used in ELBA, which helped provide a correct diffusion coefficient of lipids in the liquid phase, a point often overestimated by other CG models. It also shows 15 and 200 times speed improvement over UA and AT models, respectively. Validation has been performed on DOPC, DOPE, and gel phase DSPC, whereby ELBA satisfactorily reproduced several fundamental experimental properties including APL, volume per lipid, curvature elastics constants, electrostatic potential distribution, electrostatic diffusion constant, and lipid diffusion coefficient. At the time of the development, ELBA was only available with the authors' in-house software, and it has since been implemented in LAMMPS [57]. However, there has yet to be a current parameterization for protein and other organic molecules, potentially limiting its use in mixed systems.

**2.3. Coarse-grained (CG): the need for speed**

92 Bioinformatics - Updated Features and Applications

and the bending and area compression moduli.

particles each [45].

[55].

CG simulations are being widely used to investigate phenomenon occurring in timescales not accessible by AT simulation. In a CG simulation, 3–4 heavy atoms (non-H) are grouped together and represented by a single particle. For example, a DMPC lipid consisting of 130 atoms can be represented by 12 interaction sites [44]. The choline moiety is modeled with a single positively charged particle, the phosphate group with a negatively charged particle, the glycerol linkages with two nonpolar particles, while the lipid chains are modeled with 4 apolar

Early CG approaches were typically parameterized based on comparison to AT simulations by using inverted Monte Carlo schemes [46–48] or force matching [49], which aims to repro‐ duce the structural details at a particular state for a particular system. Instead, the MARTINI (note: MARTINI is a CG force field developed by Marrink and coworkers [44]) CG force field calibrated the building blocks of biomolecules against thermodynamic data, particularly the oil/water partitioning coefficients so that there is no need to re-parameterize the model each time. A consistent, atomic level compatible CG force field will be beneficial for multi-scale applications. In MARTINI, an average of four heavy atoms were represented by a single interaction site, with the exception of ring structures which has 2 or 3 ring atoms mapped to a CG bead. The beads are then distinguished into four main types, i.e., polar, nonpolar, apolar, and charged. MARTINI was able to reproduce the properties of lipid bilayers on a semiquan‐ titative level. These are including the APL, the distribution of groups across the membrane

MARTINI 2.0 further improved the stress profile across the lipid bilayer and its tendency to form pores. Additionally, the free energy of lipid desorption and lipid flip-flop across the bilayer agreed well with AT simulations and the condensing effect of cholesterol on the APL can be reproduced. MARTINI achieved 5- to 10-fold faster sampling of the configurational space of liquid hydrocarbons and the lipid tails inside a bilayer compared to AT force fields [44]. Among others, MARTINI has allowed applications in simulations of vesicle formation and fusion [50, 51], phase transition of lipid bilayers [52], and the structure and dynamics of membrane-protein assemblies [53, 54]. Currently, MARTINI has also been supplemented with an implicit solvent force field, named Dry MARTINI, which provided 1–2 order speed up and is expected to find application in simulations of large membranes containing millions of lipids

The ELBA (an acronym for "electrostatic-based" by Orsi and Essex [56]) force field, so named because it is electrostatics based, offers another alternative for CG simulations of lipids and lipid bilayers. The ELBA model features two approaches in contrast to other CG methods. Notably, LJ interactions are treated using standard Lorentz-Berthelot mixing rules, similar to AT force fields. This is due to the explicit treatment of lipid electrostatics and water dipoles whereby a relative dielectric constant of unity (ɛ<sup>r</sup> = 1) was used to model their interactions. In addition, a realistic dipolar water model based on a simpler soft sticky dipole potential, i.e. the Stockmayer potential is used in ELBA, which helped provide a correct diffusion coefficient of lipids in the liquid phase, a point often overestimated by other CG models. It also shows 15 and 200 times speed improvement over UA and AT models, respectively. Validation has been CG models for proteins are available and have been adopted into protein-bilayer and peptidebilayer systems [58, 59]. The advantage in employing CG models is the improved speed and size. As CG force fields have a reduced degree of freedom and removed high-frequency motions such as hydrogen bond vibrations, integration time step could be pushed up to 20– 40 fs and highly increases the accessible timescale by 100-fold. Therefore, it is possible to access phenomenon occurring at timescales not accessible to classical simulations, for example, membrane protein aggregation, demonstrated by the formation and domain-specific distri‐ bution of Ras proteins in plasma membranes [60]. CG simulation has also been used to selfassemble lipid bilayers around membrane proteins, providing information into protein positioning in the bilayer [61]. Comparison of membrane positioning for six proteins in the CG approach with available experimental data showed high qualitative similarity, even though there are discrepancies from different lipid composition used in the simulation and experi‐ ments [61].

Nevertheless, due to simplification of the representations, loss of details is inevitable. Two approaches have been employed to probe for detailed interactions while simulating in CG membrane models is by mixed AT-CG simulation and multiscaling. To elucidate small molecules, membrane peptides and proteins at high resolution, some approaches have used mixed CG-AT systems by simulating the ligand and protein atomistically in a CG membrane [62, 63]. Such mixed descriptions can be likened to quantum mechanical/molecular mechanical (QM/MM) simulations that have achieved considerable success in the past decade (see [64, 65] for reviews regarding usage of QM/MM with proteins). A major consideration in building a mixed AT-CG system is the parameterization and definition of the interactions at the AT-CG interface. As many CG force fields are derived from AT simulations, force-matching proce‐ dures can be used to derive effective pairwise CG force field from AT simulations. In AT-CG approach, all atoms simulation of the whole system is first carried out, the system will be subsequently divided into AT and CG parts whereby the most interesting part of the system will retain the atomistic details. The effective AT-CG force field is subsequently obtained by treating the AT and CG parts equally in the force matching procedure [63].

CHARMM-GUI PACE CG Builder [66] was developed for the modeling of large and complex biological system. It utilizes the mixed UA/CG where the PACE force field was used for protein in UA format and Martini CG force field for the water, ions and lipids. Thus, the number of atoms in a system can be reduced by the factor of 10. Analysis showed the PACE/MARTINI hybrid simulation has most of the proteins in the root means square deviation (RMSD) of less than 3 Å.

On the other hand, several methods have been developed for conversion of CG system to AT representation through fragment-based approach [67], simulated annealing [68], and forcematching [49]. These methods, termed the multiscale approach allows the use of CG simula‐ tions to explore membrane-lipid interactions, which, after a sufficient equilibration period is converted back to AT model simulation for detailed characterization [67].

### **3. Tools for membrane simulation setup and analysis**

### **3.1. Automated setup of membrane simulation systems**

The availability of many different force fields and parameters for a range of lipid molecules has made it easier to construct systems consisting of mixed lipids. Therefore, the lipidconverter tool is beneficial to easily adapt a system between force fields [69]. The lipidconverter tool can be used in command line by defining a PDB or Gromacs coordinate file and is also available as a web server [70]. The tool currently supports the Berger, GROMOS 43A1- S3, GROMOS 53A6, GROMOS 53A7, CHARMM36, OPLS-UA, and Lipid11. Thus, Stockholm lipids that are compatible with AMBER force field and used CHARMM nomenclature are also supported by this extension. Moreover, lipid converter may be useful to build non-conven‐ tional systems as it can generate asymmetric lipid distribution and even label leaflets in curved systems like vesicles.

Towards the end of automating the process of building heterogeneous membrane, the CHARMM-GUI Membrane Builder [71] is a useful tool to generate coordinates for membrane models and protein/membrane systems [72–74]. The membrane builder offers a selection of commonly used lipid models in addition to cholesterol which can be customized according to concentration, APL, hydration number, and thickness of the water layer [73].

Alternatively, a new web server MemGen is able to automatically set up lipid membrane simulation systems without restrictions in force fields, lipid types, or MD simulation software [75]. The user can upload one or more lipid structure files as well as amphiphilic molecules such as alcohol or detergent. A compact representation of each lipid aligned along the *z*-axis is generated by building GAFF topology of each lipid using ACPYPE, then applying simulated annealing with constant-force pulling on the head group and tail atoms as well as positionrestraining potentials with Gromacs. The server subsequently hydrates the membrane with a number of water molecules, which can also be specified by the user. After the addition of counter ions or sodium chloride, a PDB format of the final structure is available for download. However, it must be noted that MemGen provides highly ordered, unphysical configurations which requires careful equilibration of at least 10 ns. In addition, it is also unable to produce asymmetric bilayers with different composition of lipids in the two monolayers.

iMembrane is another useful web-based tool which can predict the orientation of a membrane protein within the membrane [76]. Early approaches use a two-state membrane model or a simple hydrophobic slab to model the orientation of a membrane protein in the membrane. Scott and colleagues developed a CG MD to simulate membrane proteins in the presence of membrane lipids which self-assemble into a lipid bilayer [61]. Using the simulation results, iMembrane can predict the orientation of proteins of homologous structure or sequence. BLAST is first performed against the CGDB database for any input sequence or structure. Matches are subsequently realigned to the query using MUSCLE [77] for sequence realignment or "MAMMOTH" [78] for structure super-positioning. Residues in the query are then anno‐ tated as N (not in contact with the membrane), H (in contact with polar head group of the membrane lipids), or T (in contact with the lipid hydrophobic tails).

### **3.2. Tools for membrane MD simulation analysis**

On the other hand, several methods have been developed for conversion of CG system to AT representation through fragment-based approach [67], simulated annealing [68], and forcematching [49]. These methods, termed the multiscale approach allows the use of CG simula‐ tions to explore membrane-lipid interactions, which, after a sufficient equilibration period is

The availability of many different force fields and parameters for a range of lipid molecules has made it easier to construct systems consisting of mixed lipids. Therefore, the lipidconverter tool is beneficial to easily adapt a system between force fields [69]. The lipidconverter tool can be used in command line by defining a PDB or Gromacs coordinate file and is also available as a web server [70]. The tool currently supports the Berger, GROMOS 43A1- S3, GROMOS 53A6, GROMOS 53A7, CHARMM36, OPLS-UA, and Lipid11. Thus, Stockholm lipids that are compatible with AMBER force field and used CHARMM nomenclature are also supported by this extension. Moreover, lipid converter may be useful to build non-conven‐ tional systems as it can generate asymmetric lipid distribution and even label leaflets in curved

Towards the end of automating the process of building heterogeneous membrane, the CHARMM-GUI Membrane Builder [71] is a useful tool to generate coordinates for membrane models and protein/membrane systems [72–74]. The membrane builder offers a selection of commonly used lipid models in addition to cholesterol which can be customized according to

Alternatively, a new web server MemGen is able to automatically set up lipid membrane simulation systems without restrictions in force fields, lipid types, or MD simulation software [75]. The user can upload one or more lipid structure files as well as amphiphilic molecules such as alcohol or detergent. A compact representation of each lipid aligned along the *z*-axis is generated by building GAFF topology of each lipid using ACPYPE, then applying simulated annealing with constant-force pulling on the head group and tail atoms as well as positionrestraining potentials with Gromacs. The server subsequently hydrates the membrane with a number of water molecules, which can also be specified by the user. After the addition of counter ions or sodium chloride, a PDB format of the final structure is available for download. However, it must be noted that MemGen provides highly ordered, unphysical configurations which requires careful equilibration of at least 10 ns. In addition, it is also unable to produce

concentration, APL, hydration number, and thickness of the water layer [73].

asymmetric bilayers with different composition of lipids in the two monolayers.

iMembrane is another useful web-based tool which can predict the orientation of a membrane protein within the membrane [76]. Early approaches use a two-state membrane model or a simple hydrophobic slab to model the orientation of a membrane protein in the membrane. Scott and colleagues developed a CG MD to simulate membrane proteins in the presence of

converted back to AT model simulation for detailed characterization [67].

**3. Tools for membrane simulation setup and analysis**

**3.1. Automated setup of membrane simulation systems**

94 Bioinformatics - Updated Features and Applications

systems like vesicles.

As MD simulations for membrane and membrane protein systems became widespread, many groups began developing tools to allow more efficient analysis of the MD trajectories. APL is an important indicator of the membrane phase and stability of the simulation. "GridMAT-MD" is a Perl program which can calculate the APL as well as thickness of a membrane [79]. For bilayer thickness calculation, the user can define a reference atom (such as the lipid phosphate, P atom) and the program first uses the upper leaflet as a reference and assign a paired lipid in the lower leaflet with the upper leaflet based on proximity in the *x*- and *y*direction. The *z*-distance between the two points is calculated. The program then repeats the same step using the bottom leaflet as reference, and the two results are averaged, and written to a generic ASCII.dat file. Meanwhile, APL calculation of lipid-only systems can be as simple as taking the box size divided by the number of lipids in the upper or lower leaflet. Calculation of APL for membrane-protein systems are not as simple, and "GridMAT-MD" solves the problem by assigning protein atoms found within the lipid head groups to grid points then subtracts the total protein area from the size of the system. As of version 2.0, "GridMAT-MD" can now calculate the bilayer thickness and APL for multi ".pdb" or ".gro" files.

Mori and colleagues proposed a more sophisticated method for calculating the APL using Voronoi tessellation and Monte Carlo simulation [80]. Coordinates of center of mass for each lipid molecules and coordinates for protein atoms located between the maximum and minimum *z*-coordinates for the monolayer are projected onto the *XY* plane. Two-dimensional Voronoi analysis is subsequently performed for the lipids only. The APL for non-boundary lipids is the area of the Voronoi polygon where the lipid center of mass is located. The APL for boundary lipids can be determined by using a Monte Carlo integration method where the lipid region is probed by randomly shooting a pseudoparticle into the lipid Voronoi polygon. Thus, the APL for the boundary lipid is the product of the area of the Voronoi polygon, and the probability of the shot missing a protein atom. This method finds application in analysis of membrane-protein system and can differentiate between boundary and non-boundary lipids.

Not only that, analysis of other membrane properties became easier with the development of "Membrainy," an intelligent membrane analysis tool that can provide the calculation of various membrane-specific properties for planar bilayer trajectories [81]. This include APL, order parameter, head group orientation, lipid mixing/demixing entropy, time evolution of the transmembrane voltage, 2D surface map generation, gel percentage, membrane thickness, detection of lipid flip-flop and annular shell lipid analysis. While the program has been primarily designed for use with Gromacs MD package, it is also compatible with pdb trajectory from other MD packages. Currently, it is implemented with CHARMM36, Berger/GROMOS87 and Martini v2.0 force fields, but is also expandable to include other force fields and trajectory formats. Output graphs can be readable by the Grace plotting software.

Other than that, MEMBPLUGIN [82] is another tool to study the MD trajectories of membraneprotein and complex membrane structures. This is a plugin in Visual MD package to measure biophysical properties in the simulated membranes.

### **4. Towards realistic bilayer simulations**

Beyond the improvements in computational power, force field developments, and CG methodologies, more accurate representations of the membrane continued to evolve. The biological membrane is a complex entity composed of numerous lipid species such as phos‐ phatidylcholines, phosphatidylethanolamines, and phosphatidylserine. Other molecules such as cholesterols, sphingomyelins, and cardiolipins also play a role in regulating membrane structure and function. In the outer membrane of many species of Gram-negative bacteria, the presence of lipopolysaccharide in the upperleaflet modulates the insertion, folding, and dynamics of outer membrane proteins within the membrane. Available tools for generate mixed membrane bilayers are including CHARMM-GUI Membrane Builder [73] and Mem‐ Builder [83]. CHARMM-GUI Membrane Builder and MemBuilder supports a total of 32 and 18 different lipids types, respectively.

Straatsma and Soares first reported the simulation of the outer membrane protein OprF in an asymmetric outer membrane with a lipopolysaccharide and phospholipids, describing the saccharides component using GLYCAM parameters [84]. Holdbrook et al*.* performed simula‐ tions of the *Haemphilus influenza* Hia autotransporter domain in LPS and a realistic outer membrane inner leaflet which comprises 1-myristoyl 2-palmitoleoyl phosphatidylethanola‐ mine (DMPE) lipid [85]. Comparison with simulations of the autotransporter in simpler, single species DMPC lipid model showed that the DMPC membrane accurately replicated the membrane thickness of the outer membrane and reproduced similar dynamics of the protein in asymmetric LPS/MPoPE membrane [85]. The realistic bilayer, however, revealed a patch of positive lysine and arginine residues on the extracellular mouth of Hia that interact regularly with phosphate and sugar groups of the LPS and are suggested to anchor Hia within the outer membrane [85].

### **5. Conclusion**

The continuous update and improvement of atomistic force fields expanded the types of lipid molecules which could be simulated and increased the accuracy to better match experimental data. Depending on the level of detail of the simulation, UA force fields are an excellent alternative to balance between accuracy and speed. By using CG force fields approaches, sampling and size limitations may be tackled efficiently. Ultimately, advances in computa‐ tional power and hardware have improved the timescale and system size where MD can be employed. In membrane simulations, the microsecond mark has been reached and simulations are slowly becoming routine work to complement experimental results. In addition, various web server tools and useful analysis programs have been developed to aid membrane simulation analysis. Further advances in lipid force fields will make it possible to characterize membrane structures in greater time and physical scale.

### **Acknowledgements**

primarily designed for use with Gromacs MD package, it is also compatible with pdb trajectory from other MD packages. Currently, it is implemented with CHARMM36, Berger/GROMOS87 and Martini v2.0 force fields, but is also expandable to include other force fields and trajectory

Other than that, MEMBPLUGIN [82] is another tool to study the MD trajectories of membraneprotein and complex membrane structures. This is a plugin in Visual MD package to measure

Beyond the improvements in computational power, force field developments, and CG methodologies, more accurate representations of the membrane continued to evolve. The biological membrane is a complex entity composed of numerous lipid species such as phos‐ phatidylcholines, phosphatidylethanolamines, and phosphatidylserine. Other molecules such as cholesterols, sphingomyelins, and cardiolipins also play a role in regulating membrane structure and function. In the outer membrane of many species of Gram-negative bacteria, the presence of lipopolysaccharide in the upperleaflet modulates the insertion, folding, and dynamics of outer membrane proteins within the membrane. Available tools for generate mixed membrane bilayers are including CHARMM-GUI Membrane Builder [73] and Mem‐ Builder [83]. CHARMM-GUI Membrane Builder and MemBuilder supports a total of 32 and

Straatsma and Soares first reported the simulation of the outer membrane protein OprF in an asymmetric outer membrane with a lipopolysaccharide and phospholipids, describing the saccharides component using GLYCAM parameters [84]. Holdbrook et al*.* performed simula‐ tions of the *Haemphilus influenza* Hia autotransporter domain in LPS and a realistic outer membrane inner leaflet which comprises 1-myristoyl 2-palmitoleoyl phosphatidylethanola‐ mine (DMPE) lipid [85]. Comparison with simulations of the autotransporter in simpler, single species DMPC lipid model showed that the DMPC membrane accurately replicated the membrane thickness of the outer membrane and reproduced similar dynamics of the protein in asymmetric LPS/MPoPE membrane [85]. The realistic bilayer, however, revealed a patch of positive lysine and arginine residues on the extracellular mouth of Hia that interact regularly with phosphate and sugar groups of the LPS and are suggested to anchor Hia within the outer

The continuous update and improvement of atomistic force fields expanded the types of lipid molecules which could be simulated and increased the accuracy to better match experimental data. Depending on the level of detail of the simulation, UA force fields are an excellent alternative to balance between accuracy and speed. By using CG force fields approaches,

formats. Output graphs can be readable by the Grace plotting software.

biophysical properties in the simulated membranes.

96 Bioinformatics - Updated Features and Applications

**4. Towards realistic bilayer simulations**

18 different lipids types, respectively.

membrane [85].

**5. Conclusion**

This work was funded by the Malaysia Ministry of Higher Education Fundamental Research Grant Scheme (FRGS) (203/CIPPM/6711439) and the Higher Institutions Center of Excellence (HICoE) Grant. S. W. Leong would also like to thank the Malaysia Ministry of Higher Educa‐ tion for MyBrain Science scholarship.

### **Author details**

S. W. Leong, T. S. Lim and Y. S. Choong\*

\*Address all correspondence to: yeesiew@usm.my

Institute for Research in Molecular Medicine, Universiti Sains Malaysia, Minden, Malaysia

### **References**


[18] Côté S, Binette V, Salnikov ES, Bechinger B, Mousseau N. Probing the huntingtin 1-17 membrane anchor on a phospholipid bilayer by using all-atom simulations. Biophys. J. 2015;108:1187–1198. doi: 10.1016/j.bpj.2015.02.001

[6] Tien HT, Diana AL. Bimolecular lipid membranes: a review and a summary of some recent studies. Chem. Phys. Lipids 1968;2:55–101. doi: 10.1016/0009-3084(68)90035-2

[7] Marsh D. Renormalization of the tension and area expansion modulus in fluid mem‐

[8] Klauda JB, Venable RM, Freites JA, O'Connor JW, Tobias DJ, Mondragon-Ramirez C, Vorobyov I, MacKerell AD, Pastor RW, Connor JWO. Update of the CHARMM allatom additive force field for lipids: validation on six lipid types. J. Phys. Chem. B

[9] Piggot TJ, Piñeiro Á, Khalid S. Molecular dynamics simulations of phosphatidylcholine membranes: a comparative force field study. J. Chem. Theory Comput. 2012;8:4593–

[10] Jójárt B, Martinek TA, Jo ZS. Performance of the general AMBER force field in modeling aqueous POPC membrane bilayers. J. Comput. Chem. 2007;28:2051–2058. doi: 10.1002/

[11] Dickson CJ, Rosso L, Betz RM, Walker RC, Gould IR. GAFFlipid: a General AMBER force field for the accurate molecular dynamics simulation of phospholipid. Soft Matter.

[12] Rosso L, Gould IR. Structure and dynamics of phospholipid bilayers using recently developed general all-atom force fields. J. Comput. Chem. 2007;29:24–37. doi: 10.1002/

[13] Siu SWI, Vácha R, Jungwirth P, Böckmann R. Biomolecular simulations of membranes: physical properties from different force fields. J. Chem. Phys. 2008;128:125103–125103.

[14] Dickson CJ, Madej BD, Skjevik ÅA, Betz RM, Teigen K, Gould IR, Walker RC. Lipid14: the AMBER lipid force field. J. Chem. Theory Comput. 2014;10:865–879. doi: 10.1021/

[15] Rand RP, Parsegian VA. Hydration forces between phospholipid bilayers. BBA Rev.

[16] Rappolt M, Hickel A, Bringezu F, Lohner K. Mechanism of the lamellar/inverse hexagonal phase transition examined by high resolution X-ray diffraction. Biophys. J.

[17] Bruno A, Scrima M, Novellino E, D'Errico G, D'Ursi AM, Limongelli V. The glycan role in the glycopeptide immunogenicity revealed by atomistic simulations and spectro‐ scopic experiments on the multiple sclerosis biomarker CSF114(Glc). Sci. Rep. 2015;5:

Biomembranes 1989;988:351–376. doi: 10.1016/0304-4157(89)90010-5

2003;84:3111–3122. doi: 10.1016/S0006-3495(03)70036-8

branes. Biophys. J. 1997;73:865–869. doi: 10.1016/S0006-3495(97)78119-0

2010;114:7830–7843. doi: 10.1021/jp101759q

2012;8:9617–9627. doi: 10.1039/c2sm26007g

4609. doi: 10.1021/ct3003157

98 Bioinformatics - Updated Features and Applications

jcc

jcc.20675

ct4010307

doi: 10.1063/1.2897760

9200. doi: 10.1038/srep09200


[43] Chakrabarti N, Neale C, Payandeh J, Pai EF, Pomès R. An iris-like mechanism of pore dilation in the CorA magnesium transport system. Biophys. J. 2010;98:784–792. doi: 10.1016/j.bpj.2009.11.009

[31] Cordomí A, Caltabiano G, Pardo L. Membrane protein simulations using AMBER force field and Berger lipid parameters. J. Chem. Theory Comput. 2012;8:948–958. doi:

[32] Jerabek H, Pabst G, Rappolt M, Stockner T. Membrane-mediated effect on ion channels induced by the anesthetic drug ketamine. J. Am. Chem. Soc. 2010;132:7990–7997. doi:

[33] Lemkul JA, Bevan DR. Characterization of interactions between PilA from *Pseudomonas aeruginosa* strain K and a model membrane. J. Phys. Chem. B 2011;115:8004–8008. doi:

[34] Lensink MF, Govaerts C, Ruysschaert J-M. Identification of specific lipid-binding sites in integral membrane proteins. J. Biol. Chem. 2010;285:10519–10526. doi: 10.1074/

[35] Braun AR, Sachs JN, Nagle JF. Comparing simulations of lipid bilayers to scattering data: the GROMOS 43A1-S3 force field. J. Phys. Chem. B 2013;117:5065–5072. doi:

[36] Oostenbrink C, Villa A, Mark AE, Van Gunsteren WF. A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6. J. Comput. Chem. 2004;25:1656–1676. doi: 10.1002/jcc.20090 [37] Kukol A. Lipid models for united-atom molecular dynamics simulations of proteins. J.

[38] Scott WRP, Hünenberger PH, Tironi IG, Mark AE, Billeter SR, Fennen J, Torda AE, Huber T, Krüger P, van Gunsteren WF. The GROMOS biomolecular simulation program package. J. Phys. Chem. A 1999;103:3596–3607. doi: 10.1021/jp984217f [39] Chiu SW, Pandit SA, Scott HL, Jakobsson E. An improved united atom force field for simulation of mixed lipid bilayers. J. Phys. Chem. B 2009;113:2748–2763. doi: 10.1021/

[40] Poger D, Van Gunsteren WF, Mark AE. A new force field for simulating phosphati‐ dylcholine bilayers. J. Comput. Chem. 2010;31:1117–1125. doi: 10.1002/jcc.21396 [41] Poger D, Mark AE. On the validation of molecular dynamics simulations of saturated and cis-monounsaturated phosphatidylcholine lipid bilayers: a comparison with

experiment. J. Chem. Theory Comput. 2010;6:325–336. doi: 10.1021/ct900487a

[42] Bjelkmar P, Niemelä PS, Vattulainen I, Lindahl E. Conformational changes and slow dynamics through microsecond polarized atomistic molecular simulation of an integral Kv1.2 ion channel. PLoS Comput. Biol. 2009;5:e1000289. doi: 10.1371/journal.pcbi.

Chem. Theory Comput. 2009;5:615–626. doi: 10.1021/ct8003468

10.1021/ct200491c

100 Bioinformatics - Updated Features and Applications

10.1021/ja910843d

10.1021/jp202217f

jbc.M109.068890

10.1021/jp401718k

jp807056c

1000289


[69] Larsson P, Kasson PM. Lipid converter, a framework for lipid manipulations in molecular dynamics simulations. J. Membr. Biol. 2014;247:1137–1140. doi: 10.1007/ s00232-014-9705-5

[56] Orsi M, Essex JW. The ELBA force field for coarse-grain modeling of lipid membranes.

[57] Plimpton S. Fast parallel algorithms for short-range molecular dynamics. J. Comput.

[58] Monticelli L, Kandasamy SK, Periole X, Larson RG, Tieleman DP, Marrink SJ. The MARTINI coarse-grained force field: extension to proteins. J. Chem. Theory Comput.

[59] Spijker P, van Hoof B, Debertrand M, Markvoort AJ, Vaidehi N, Hilbers PJ. Coarse grained molecular dynamics simulations of transmembrane protein-lipid systems. Int.

[60] Janosi L, Li Z, Hancock JF, Gorfe AA. Organization, dynamics, and segregation of Ras nanoclusters in membrane domains. Proc. Natl. Acad. Sci. USA 2012;109:8097–8102.

[61] Scott KA, Bond PJ, Ivetac A, Chetwynd AP, Khalid S, Sansom MSP. Coarse-grained MD simulations of membrane protein-bilayer self-assembly. Structure 2008;16:621–30.

[62] Orsi M, Essex JW. Permeability of drugs and hormones through a lipid bilayer: insights from dual-resolution molecular dynamics. Soft Matter. 2010;6:3797–3797. doi: 10.1039/

[63] Shi Q, Izvekov S, Voth G. Mixed atomistic and coarse-grained molecular dynamics: simulation of a membrane-bound ion channel. J. Phys. Chem. B 2006;110:15045–15048.

[64] Murphy RB, Philipp DM, Friesner RA. A mixed quantum mechanics/molecular mechanics (QM/MM) method for large-scale modeling of chemistry in protein envi‐ ronments. J. Comput. Chem. 2000;21:1442–1457. doi:

[65] Senn HM, Thiel W. QM/MM studies of enzymes. Curr. Op. Chem. Biol. 2007;11:182–

[66] Qi Y, Cheng X, Han W, Jo S, Schulten K, Im W. CHARMM-GUI PACE CG Builder for solution, micelle, and bilayer coarse-grained simulations. J. Chem. Inf. Model.

[67] Stansfeld PJ, Sansom MSPP. From coarse grained to atomistic: a serial multiscale approach to membrane protein simulations. J. Chem. Theory Comput. 2011;7:1157–

[68] Rzepiela AJ, Schäfer LV, Goga N, Risselada HJ, De Vries AH, Marrink SJ. Reconstruc‐ tion of atomistic details from coarse-grained structures. J. Comput. Chem.

10.1002/1096-987X(200012)21:16<1442::AID-JCC3>3.0.CO;2-O

PLoS One 2011;6:e28637. doi: 10.1371/journal.pone.0028637

J. Mol. Sci. 2010;11:2393–2420. doi: 10.3390/ijms11062393

Phys. 1995;117:1–19. doi: 10.1006/jcph.1995.1039

2008;4:819–834. doi: 10.1021/ct700324x

doi: 10.1073/pnas.1200773109

102 Bioinformatics - Updated Features and Applications

doi: 10.1016/j.str.2008.01.014

doi: 10.1021/jp062700h

187. doi: 10.1016/j.cbpa.2007.01.684

1166. doi: 10.1021/ct100569y

2014;54:1003–1009. doi: 10.1021/ci500007n

2010;31:1333–1343. doi: 10.1002/jcc.21415

c0sm00136h


### **Bioinformatics Approaches for Predicting Kinase– Substrate Relationships**

Daniel A. Bórquez and Christian González-Billault

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63761

### **Abstract**

[82] Guixa-Gonzalez R, Rodriguez-Espigares I, Ramirez-Anguita JM, Carrio-Gaspar P, Martinez-Seara H, Giorgino T, Selent J. MEMBPLUGIN: studying membrane complex‐ ity in VMD. Bioinformatics 2014;30:1478–1480. doi: 10.1093/bioinformatics/btu037 [83] Ghahremanpour MM, Arab SS, Aghazadeh SB, Zhang J, van der Spoel D. MemBuilder: a web-based graphical interface to build heterogeneously mixed membrane bilayers for the GROMACS biomolecular simulation program. Bioinformatics 2013;30:439–441.

[84] Straatsma TP, Soares TA. Characterization of the outer membrane protein OprF of *Pseudomonas aeruginosa* in a lipopolysaccharide membrane by computer simulation.

[85] Holdbrook DA, Piggot TJ, Sansom MSP, Khalid S. Stability and membrane interactions of an autotransport protein: MD simulations of the Hia translocator domain in a complex membrane environment. Biochim. Biophys. Acta 2013;1828:715–723. doi:

doi: 10.1093/bioinformatics/btt680

104 Bioinformatics - Updated Features and Applications

10.1016/j.bbamem.2012.09.002

Proteins 2009;74:475–488. doi: 10.1002/prot.22165

Protein phosphorylation, catalyzed by protein kinases, is the main posttranslational modification in eukaryotes, regulating essential aspects of cellular function. Using mass spectrometry techniques, a profound knowledge has been achieved in the localization of phosphorylated residues at proteomic scale. Although it is still largely unknown, the protein kinases are responsible for such modifications. To fill this gap, many computa‐ tional algorithms have been developed, which are capable to predict kinase–substrate relationships. The greatest difficulty for these approaches is to model the complex nature that determines kinase–substrate specificity. The vast majority of predictors is based on the linear primary sequence pattern that surrounds phosphorylation sites. However, in the intracellular environment the protein kinase specificity is influenced by contextual factors, such as protein–protein interactions, substrates co-expression patterns, and subcellular localization. Only recently, the development of phosphoryla‐ tion predictors has begun to incorporate these variables, significantly improving specificity of these methods. An accurate modeling of kinase–substrate relationships could be the greatest contribution of bioinformatics to understand physiological cell signaling and its pathological impairment.

**Keywords:** protein kinases, phosphorylation, machine learning methods, docking sites

### **1. Introduction**

Protein kinases is the second largest family of enzymes, composed by 518 members in the human genome [1]. These enzymes catalyze the transference of γ-phosphate moiety from adenosine triphosphate (ATP) to the hydroxyl group of serine, threonine, or tyrosine resi‐

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

dues present in substrate proteins. The transient nature of this modification (reversed by dephosphorylation reactions, catalyzed by protein phosphatases) generates the main molec‐ ular switch, regulating each aspect of protein function, including interactions, conforma‐ tions, subcellular localization, enzymatic activity, and turnover. Protein phosphorylation is also the most widespread post-translational modification, affecting at least three-quarters of the proteome [2].

The identification of phosphorylated sites (or phosphosites) has experienced an explosion with the utilization of mass spectrometry techniques. PhosphoSitePlus database [3] collects large part of the information obtained in these studies, including the localization of 144,899 serines, 61,654 threonines, and 41,273 phosphorylated tyrosines, but only 12,180 (5%) of them have annotated the protein kinase responsible for such modifications [3]. This is largely due to the expensive and time-consuming methodologies that need to be used in the identification of kinase–substrate relationships (KSRs).

This complex scenario has opened an important field for the development of computational strategies for phosphorylation site labeling with the specific protein kinase(s) responsible for its modifications in a whole proteome scale, in an effort to reconstruct the underlying regula‐ tory networks. These approaches must overcome several challenges including the complexity of the regulatory networks itself, and the scarce information available about the molecular mechanisms that ensure recognition between protein kinases and substrates.

The list of currently developed tools for KSRs prediction is shown in **Table 1**. Of note, most of these tools are mostly based on classifiers designed to assign a phosphorylation site to a particular protein kinase considering only the sequence pattern surrounding the phosphory‐ lation site, which provides an imperfect description of the kinase–substrate specificity. In this chapter, we will discuss the underlying biological rational of these tools and its potential for improvement.



dues present in substrate proteins. The transient nature of this modification (reversed by dephosphorylation reactions, catalyzed by protein phosphatases) generates the main molec‐ ular switch, regulating each aspect of protein function, including interactions, conforma‐ tions, subcellular localization, enzymatic activity, and turnover. Protein phosphorylation is also the most widespread post-translational modification, affecting at least three-quarters of

The identification of phosphorylated sites (or phosphosites) has experienced an explosion with the utilization of mass spectrometry techniques. PhosphoSitePlus database [3] collects large part of the information obtained in these studies, including the localization of 144,899 serines, 61,654 threonines, and 41,273 phosphorylated tyrosines, but only 12,180 (5%) of them have annotated the protein kinase responsible for such modifications [3]. This is largely due to the expensive and time-consuming methodologies that need to be used in the identification of

This complex scenario has opened an important field for the development of computational strategies for phosphorylation site labeling with the specific protein kinase(s) responsible for its modifications in a whole proteome scale, in an effort to reconstruct the underlying regula‐ tory networks. These approaches must overcome several challenges including the complexity of the regulatory networks itself, and the scarce information available about the molecular

The list of currently developed tools for KSRs prediction is shown in **Table 1**. Of note, most of these tools are mostly based on classifiers designed to assign a phosphorylation site to a particular protein kinase considering only the sequence pattern surrounding the phosphory‐ lation site, which provides an imperfect description of the kinase–substrate specificity. In this chapter, we will discuss the underlying biological rational of these tools and its potential for

> **Number of kinase families**

**References Website**

implementation available

implementation available

2 [8] https://github.com/PengyiYang/ KSP-PUEL

210 [6] No web

23 [7] No web

mechanisms that ensure recognition between protein kinases and substrates.

**Training data**

Yes Phospho. ELM

No Phospho. ELM

Yes PhosphoSite Plus + Literature

the proteome [2].

improvement.

HeteSim Heterogeneous information networks

SlapRLS Supervised

PUEL Positive-

Laplacian regularized least squares

unlabeled ensemble

**Method name**

kinase–substrate relationships (KSRs).

106 Bioinformatics - Updated Features and Applications

**Approach Contex**

**tual informa tion**



**Table 1.** Computational methods for kinase–substrate relationships prediction.

**Method name**

No name **Approach Contex**

108 Bioinformatics - Updated Features and Applications

BLOSUM62 similarity

ConDens Conservation of local motif density

CRPhos Conditional random fields

> Meta predictor

odds ratios

MetaPred PS

PhoScan Log

Kinase Phos 2.0 **tual informa tion**

**Training data**

No Unnecessary All

PrediKin PSSM No Unnecessary All [32]. Previous

+ Swiss-Prot + PhosphoPep

ELM

No Phospho. ELM

No Phospho. ELM

> ELM + Swiss-Prot

PPSP Bayesian No Phospho. 68 [46] Web implementation

SVM No Phospho.

No Phospho.ELM + PhosphoSite Plus + Swiss-Prot

Musite SVM No Phospho.ELM

Phos3D SVM No Phospho.

**Number of kinase families**

+ literature developments:

kinases with known motifs

NetphosK ANN No 17 [39, 40] http://www.cbs.dtu.dk/services/

PSSM No Unnecessary 492 [30, 31] No web implementation available

**References Website**

[25] (GPS 2.1) [26] (GPS 2.0) [27, 28] (GPS)

developments: [33–35]

[29] http://

andyl/

13 [36, 37] http://musite.sourceforge.net/

golm.mpg.de/

NetPhosK/

no longer available

kinasephos2.mbc.nctu.edu.tw/

18 [41] http://www.ptools.ua.ac.be/ CRPhos/

48 [43] http://bioinfo.au.tsinghua.edu.cn/ phoscan/

http://

15 [42] Web implementation

71 [44]. Previous

development: [45] (Kinase Phos 1.0)

6 [38] http://phos3d.mpimp-

www.moseslab.csb.utoronto.ca/

http://predikin.biosci.uq.edu.au/

### **2. Comparing prediction tools: data, metrics, and methods**

One of the most challenging problems in the field of prediction tools is to establish benchmarks between them, allowing a real assessment of the method itself. Each prediction tool requires for its testing (and often for training) a set of positive (actually phosphorylated sites) and negative data (sites actually not phosphorylated). The sources of phosphorylated sites for most of the predictors are limited to a few databases as Phospho.ELM [4] and PhosphositePlus [3]. These databases include information from different experimental approaches (*in vivo* and/or *in vivo*), which is processed homogeneously for training prediction algorithms. This can lead to a significant bias in the quality of predictions: protein kinases exhibit low specificity at *in vitro* experiments (which constitute the largest proportion of the information in databases), generating simpler motifs than those that may present in cells [5]. Moreover, using information derived only from *in vivo* experiments does not ensure that the observed phosphorylation site was directly phosphorylated by the protein kinase studied. Careful selection of the positive data set for training, including only the sites phosphorylated both *in vivo* (physiological) and *in vitro* (direct) by a protein kinase, can significantly improve the prediction [5].

Another problem is the construction of a negative data set used in machine-learning methods. Although experiments can verify that a protein kinase phosphorylates a given residue, it is very difficult to demonstrate that a particular residue in a protein is not phosphorylated at any situation. A good approximation is made by Neuberger et al. [48], which consider any residue present in a protein which is phosphorylated by a particular protein kinase and which has not been reported as phosphorylated in databases, as part of the negative data set.

The sensitivity (*S*n) and specificity (*S*p) are commonly used to assess the performance of prediction algorithms. For a set of data predicted as positive, real positive (previously experimentally determined as phosphorylated) are called true positives (TP), while the remaining are called false positives (FP). Concomitantly, for data predicted as no phosphory‐ lated sites, the real ones are called true negatives (TN) whereas phosphorylated sites are considered false negatives (FN). The ratio of positive sites can be correctly classified is named sensitivity (*S*n). On the other hand, proportion of negatives sites correctly identified, is called specificity (*S*p). Both parameters are calculated as follows (Eq. (1)):

$$\begin{aligned} S\_{\rm m} &= \frac{\rm TP}{\rm TP + FN} \\\\ S\_{\rm P} &= \frac{\rm TN}{\rm TN + FP} \end{aligned}$$

Other common parameter used to evaluate the predictor performance is the accuracy that denotes the percentage of correct prediction in both negative and positive data sets. Also, Matthews correlation coefficient (MCC) is widely used as a general estimator for the perform‐ ance of a predictor. It considers the four numbers described in Eq. (1), giving a balanced assessment of the performance of the predictor even if these parameters are very different. Both parameters are calculated as follows:

$$\text{Ac} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}}$$

$$\text{MCC} = \frac{(\text{TP} \times \text{TN}) - (\text{FN} \times \text{FP})}{\sqrt{(\text{TP} + \text{FN}) \times (\text{TN} + \text{FP}) \times (\text{TP} + \text{FP}) \times (\text{TN} + \text{FN})}}$$

The ROC (receiver operating characteristic) curve is commonly used for evaluation and comparison of classifiers. The true positive rate (sensitivity) is plotted against the false positive rate (1-specificity). For a perfect classifier the area under curve (AUC) is 1, while a poor classifier achieved values near 0.5 (which defines a random guess).

It may be considered that the parameters previously described can only be compared when the data sets of TP and TN are similar, which is relatively common for the former, but very uncommon for the later. It should be necessary to define standard data sets that are occupied by the research community or define benchmarks that are independent of the training data sets. One approach is to compare the predictors based on their ability to assign to known phosphorylation sites the lowest ranking in a proteomic search for phosphosites of a given protein kinase [5].

Apart from the difficulties to make quantitative comparisons between predictors, there is a perception that machine-learning algorithms, as artificial neural networks (ANN) or support vector machines (SVM), provide a better predictability of protein kinase substrates that simpler methods such as the position-specific scoring matrices (PSSM). Such an idea is substantiated in the assumption that machine-learning algorithms are capable of classifying highly complex sequences, in which correlations amongst positions are important. Such an assumption was recently questioned by studying the sequence interpositional dependence for ataxia telangiec‐ tasia mutated (ATM/ATR) kinase, casein kinase 2 (CK2), and cyclin-dependent kinase 2 (CDK2) substrates. Through statistical analysis, Joughin and colleagues [49] found few pairs of positions in the sequences of the phosphorylated sites that are significantly deviated from the positionwise independence. Accordingly, the predictors that incorporate second-order information were less accurate than those who consider only first-order information, over fitting the training data [49]. This strongly suggests that a good strategy to develop most accurate predictive tools is to integrate simple sequence models with contextual information, such as protein–protein interactions, subcellular localization, and distal recognition sites.

n

p

*S*

Both parameters are calculated as follows:

110 Bioinformatics - Updated Features and Applications

protein kinase [5].

*S*

TP TP FN <sup>=</sup> <sup>+</sup>

TN TN FP <sup>=</sup> <sup>+</sup>

Other common parameter used to evaluate the predictor performance is the accuracy that denotes the percentage of correct prediction in both negative and positive data sets. Also, Matthews correlation coefficient (MCC) is widely used as a general estimator for the perform‐ ance of a predictor. It considers the four numbers described in Eq. (1), giving a balanced assessment of the performance of the predictor even if these parameters are very different.

TP TN Ac

(TP TN) (FN FP) MCC

classifier achieved values near 0.5 (which defines a random guess).

TP FP TN FN <sup>+</sup> <sup>=</sup> ++ +

(TP FN) (TN FP) (TP FP) (TN FN) ´-´ <sup>=</sup> + ´ +´+´ +

The ROC (receiver operating characteristic) curve is commonly used for evaluation and comparison of classifiers. The true positive rate (sensitivity) is plotted against the false positive rate (1-specificity). For a perfect classifier the area under curve (AUC) is 1, while a poor

It may be considered that the parameters previously described can only be compared when the data sets of TP and TN are similar, which is relatively common for the former, but very uncommon for the later. It should be necessary to define standard data sets that are occupied by the research community or define benchmarks that are independent of the training data sets. One approach is to compare the predictors based on their ability to assign to known phosphorylation sites the lowest ranking in a proteomic search for phosphosites of a given

Apart from the difficulties to make quantitative comparisons between predictors, there is a perception that machine-learning algorithms, as artificial neural networks (ANN) or support vector machines (SVM), provide a better predictability of protein kinase substrates that simpler methods such as the position-specific scoring matrices (PSSM). Such an idea is substantiated in the assumption that machine-learning algorithms are capable of classifying highly complex sequences, in which correlations amongst positions are important. Such an assumption was recently questioned by studying the sequence interpositional dependence for ataxia telangiec‐ tasia mutated (ATM/ATR) kinase, casein kinase 2 (CK2), and cyclin-dependent kinase 2 (CDK2) substrates. Through statistical analysis, Joughin and colleagues [49] found few pairs of positions in the sequences of the phosphorylated sites that are significantly deviated from

### **3. Beyond the sequence: improving substrate prediction with contextual information**

There are two factors, which are important to determine the specific phosphorylation of substrates by protein kinases: recruitment and phosphorylation site recognition. Recruitment

**Figure 1.** A structural view of protein kinases specificity determinants. (A) Local interactions. Extensive contacts are established between the protein kinase active site and the surrounding region of phophosite, which partly defines the specificity of protein kinases. For example, a complex between Dual-specificity tyrosine-phosphorylation-regulated kinase (DYRK)-1A and a consensus substrate peptide (ARPGT\*PAL) is shown. Active site residues (F170, F196, Y246, D287, K289, E291, Y321, S324, Y327, E353; colored in yellow) establish contacts with the peptide substrate. (B) Distal docking sites. Often the interaction of the substrate with the active site of the protein kinase is not enough to ensure specificity. For example, Extracellular signal regulated kinase (ERK)-2 and its substrate Ribosomal S6 kinase (RSK)-1, establish additional contacts distal to the protein kinase active site, through a linear binding motif (colored in yellow). (C) Scaffolds proteins. Many kinases utilize scaffold proteins to be placed close to their substrates. For example, mito‐ gen-activated protein kinase (MAPK) kinase (MKK)-5 organizes MAPK kinase kinase (MEKK)-3, MKK5, and ERK5 in a signaling complex. (D) Subcellular location. Through protein–protein interactions, protein kinases are located in spe‐ cific subcellular structures, wherein phosphorylate specific substrates. For example, Polo-like kinase (PLK)-1 interacts through its Polo Box Domain with Cdc25C, allowing its centrosomal localization. Molecular graphics were performed with the UCSF Chimera package [56] based on the following structures: DYRK1A-substrate complex (PDB: 2WO6), ERK2-RSK1 complex (PDB: 2Y4I), ERK5-MKK5-MEKK3 ternary complex (made by superimposition of PDB: 4IC7 and PDB: 2O2V) and PLK1 kinase domain (PDB: 2OU7)-Polo box domain-Cdc25c complex (PDB: 2OJX).

is related with a number of determining factors that promote productive interaction between protein kinases and substrates. Phosphorylation site recognitions are related with the prefer‐ ence of individual residues surrounding the modified residue (**Figure 1**).

The relative importance of these factors has been rarely studied experimentally in the func‐ tional specificity of protein kinases. For example, in yeast, the high specificity of the mitogenactivated protein kinase kinase (MAPKK) is ensured mainly by the use of docking motifs and scaffolding interactions [50].

### **3.1. Distal docking sites**

A transient physical interaction between protein kinases and their substrates could place them in close proximity and in the correct orientation, creating the opportunity for post-translational modification. These interactions are based on short linear motifs, termed docking sites, that reside in disordered regions of the proteins, and that only adopt a defined structure upon binding. The utilization of docking sites seems to be a widespread strategy to improve the specificity of protein kinases to phosphorylate defined phosphosites, as evidenced by studies of SR protein-specific kinase-1 (SRPK1) [51], Cbk1 [52], Polo-like kinases [53], and Cdks [54, 55]. Due to the lack of structural models of interaction between complete substrates and protein kinases, it is not yet possible to measure the importance of distal interactions to the site of phosphorylation in specificity determination. However, by combining protein–protein docking and adaptive biasing force molecular dynamics simulations, Mottin et al. obtained a structural model of the interaction between an active protein kinase (the complex Cdk5/p25) and a complete substrate: Peroxisome proliferator-activated receptor ³ (PPAR³). This model suggests that the protein kinase establishes two distal docking sites with the substrate, pinpointing the importance of those contact sites for proper positioning of the phosphosite in the kinase active site [57].

Mitogen-activated protein kinases (MAPKs) are the prototypical example of using docking sites to enhance the specificity of their promiscuous active sites, which phosphorylate their substrates at most in a weak consensus site (Ser/Thr-Pro). Two types of docking sites have been characterized for MAPKs, called D-sites and F-sites. D-sites interacts with a D-recruitment site (DRS), consisting of a negatively charged region and a shallow hydrophobic pocket, located on the opposite side from the kinase active site. On the other hand, F-sites bind a hydrophobic docking groove (F-recruitment site or FRS). Although all MAPKs have a DRS (hence, it is referred, too, as Common Docking or CD), the FRS seems to be a characteristic only of ERK1/2 and p38α.

Structural studies of the interaction between extracellular-regulated kinase 2 (ERK2) and a peptide containing the F-site from Elk-1 suggests that the main effect is to increase the local phosphosite concentration, enhancing productive encounters that enhance phosphorylation [58].

Although systematic *in silico* exploration of docking sites could be helpful to find new putative substrates for MAPKs, it has been hampered by low stringency of the sequence, generating a high rate of false positives that can stochastically occur.

Whisenant et al. [59] developed D-finder, a computational tool that uses a hybrid pattern matching algorithm/hidden Markov model, to find sequences of docking sites at protein kinase substrates. Trained with 20 experimentally identified D-sites for a c-Jun N-terminal kinase (JNK), they identified 394 proteins with putative D-sites, and experimentally validated some of them [59]. In order to prioritize the predictions made by D-finder, this was combined with a predictor of JNK phosphorylation sites based on position weight matrices, achieving the identification and experimental validation of one additional substrate of JNK [60].

Significant efforts have been done to study the determinants of specificity of the DRS [61, 62], results that undoubtedly will improve the design of predictors. More recently, Zeke and colleagues [63] addressed the problem by separating the D-sites in 4 class of motifs based on experimental studies and structural and evolutionary analysis, generating PSSMs that allowed them to generate specific interactomes for each class of motif [63].

Utilization of docking motifs to enhance the accuracy of the predictions is still incipient for other kinases. Bibi and collaborators [64] conducted a search for new substrates of Plk1, using both the consensus sequence of phosphorylation site as the recognition motif of the Polo-box domain, responsible for the recruitment of the substrates and activation of the protein kinase [64].

### **3.2. Scaffold proteins**

is related with a number of determining factors that promote productive interaction between protein kinases and substrates. Phosphorylation site recognitions are related with the prefer‐

The relative importance of these factors has been rarely studied experimentally in the func‐ tional specificity of protein kinases. For example, in yeast, the high specificity of the mitogenactivated protein kinase kinase (MAPKK) is ensured mainly by the use of docking motifs and

A transient physical interaction between protein kinases and their substrates could place them in close proximity and in the correct orientation, creating the opportunity for post-translational modification. These interactions are based on short linear motifs, termed docking sites, that reside in disordered regions of the proteins, and that only adopt a defined structure upon binding. The utilization of docking sites seems to be a widespread strategy to improve the specificity of protein kinases to phosphorylate defined phosphosites, as evidenced by studies of SR protein-specific kinase-1 (SRPK1) [51], Cbk1 [52], Polo-like kinases [53], and Cdks [54, 55]. Due to the lack of structural models of interaction between complete substrates and protein kinases, it is not yet possible to measure the importance of distal interactions to the site of phosphorylation in specificity determination. However, by combining protein–protein docking and adaptive biasing force molecular dynamics simulations, Mottin et al. obtained a structural model of the interaction between an active protein kinase (the complex Cdk5/p25) and a complete substrate: Peroxisome proliferator-activated receptor ³ (PPAR³). This model suggests that the protein kinase establishes two distal docking sites with the substrate, pinpointing the importance of those contact sites for proper positioning of the phosphosite in

Mitogen-activated protein kinases (MAPKs) are the prototypical example of using docking sites to enhance the specificity of their promiscuous active sites, which phosphorylate their substrates at most in a weak consensus site (Ser/Thr-Pro). Two types of docking sites have been characterized for MAPKs, called D-sites and F-sites. D-sites interacts with a D-recruitment site (DRS), consisting of a negatively charged region and a shallow hydrophobic pocket, located on the opposite side from the kinase active site. On the other hand, F-sites bind a hydrophobic docking groove (F-recruitment site or FRS). Although all MAPKs have a DRS (hence, it is referred, too, as Common Docking or CD), the FRS seems to be a characteristic only of ERK1/2

Structural studies of the interaction between extracellular-regulated kinase 2 (ERK2) and a peptide containing the F-site from Elk-1 suggests that the main effect is to increase the local phosphosite concentration, enhancing productive encounters that enhance phosphorylation

Although systematic *in silico* exploration of docking sites could be helpful to find new putative substrates for MAPKs, it has been hampered by low stringency of the sequence, generating a

high rate of false positives that can stochastically occur.

ence of individual residues surrounding the modified residue (**Figure 1**).

scaffolding interactions [50].

112 Bioinformatics - Updated Features and Applications

**3.1. Distal docking sites**

the kinase active site [57].

and p38α.

[58].

Scaffold proteins are important to ensure the encounter of the protein kinases with its sub‐ strates, as suggested by a large number of these signaling proteins associated with phosphor‐ ylation [65], especially in the MAPK pathway, where a recent interactome analysis identified 10 scaffold proteins associated [66]. A canonical example of utilization of scaffold proteins is found in the signaling of cAMP-activating protein kinase (PKA), where more than 50 A-kinase anchoring proteins (AKAPs) are responsible for associating the protein kinase with its substrates [67].

The first attempt to integrate protein interactions with PhosphoSite sequence patterns to improve KSR prediction was the development of NetworKIN [16], a two-stage algorithm. In the first term, artificial neural networks (from NetPhosK) and PSSMs (from Scansite) are used to label a given phosphosite sequence with a kinase or kinase family. In a second stage, the contextual information is included, calculating the proximity of the substrate to all kinases (the most likely route connecting them) in a network of functional relationships extracted from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database [68].

For a group of well-studies kinase families (CDK, PKC, PIKK and INSR) NetwortKIN doubled prediction accuracy (to 64%) over only sequence-based methods. However, NetworKIN includes a circular logic system, which can overestimate the values of accuracy: the perform‐ ance assessment was performed with known phosphosites derived from the literature and the STRING database also includes information from text-mining (as co-occurrence in abstracts), therefore associates many kinases with its substrates. NetworKIN has recently been upgraded into KinomeXplorer platform [14]. This introduces improvements in the accuracy of prediction through a new scoring scheme based on naive Bayes method, to overcome the bias in the network structure caused by highly studied proteins. Another algorithm based on sequence and contextual information is PhosphoPICK. It integrates information from previously known KSRs, protein–protein interactions and protein abundance profiles through the cell cycle in a Bayesian network model. The average AUC for the 59 protein kinases evaluated was 0.86. The main determinant of this good performance was the inclusion of data from protein–protein interactions, while the protein expression throughout the cell cycle had a modest contribution [12].

To our knowledge, no predictor used only physical protein–protein interaction information, which better translates the concept that kinases phosphorylate proteins that are in close proximity. Recently, studies carried out by affinity purification coupled with mass spectrom‐ etry revealed the structure of many protein complexes in human cells [69, 70], including some specific for protein kinases [71, 72] or signaling pathways associated with them [73, 74], which can be exploited to improve sequence-only based predictors.

Contextual information can also be given by the location of the protein kinase in a specific subcellular structure, which would have privileged access to certain substrates. Alexander and colleagues elegantly demonstrated that substrate specificity between mitotic kinases, Aurora A/B, Cdk1, Plk1 and Nek2 are based on mutually exclusive motifs in the case of protein kinases presenting overlapping location and for similar motifs in protein kinases showing an exclusive distribution [75].

### **4. Structural aspects of phosphosites**

It has been traditionally assumed that the phosphorylation site can be described as a linear sequence and this assumption underlies almost all predictors developed to date. However, it has been recently identified that the Thr253 of the α-tubulin is located in a nonlinear motif, comprising residues distant in primary sequence but which are folded in a consensus site phosphorylated by PKC [76]. Durek and collaborators [38] had previously addressed this problem by characterizing the structural motifs (3D) present in the phosphorylation sites. Using only the radial distance from the phosphorylated residue and regardless of the angular information, they achieved a modest increase in performance over only linear sequencepredictions [38]. This can be explained if only a limited number of residues are recognized by protein kinases based on a structural epitope or by the low complexity of the model used. Currently, they have developed more sophisticated tools to find patterns in protein 3D structures that could be useful to identify KSRs based on nonlinear motifs. For example, Amino acid pattern Search for Substructures And Motifs (ASSAM) [77]) uses 3D coordinates of a motif comprising up to 12 amino acids, for matching in a structure database as the Protein Data Bank (PDB). ASSAM represents protein structures as a graph in which nodes consist of a vector between two pseudoatoms, representing the side chains of the individual amino acids, while the edges are the distances between the corresponding vectors, providing a more exhaustive representation that used by Durek and collaborators [38].

To date only one predictor, denominated MODPROPEP [78], is based exclusively on structural features of the interaction between the protein kinase and the phosphorylation site. MOD‐ PROPEP modulates putative substrate peptides into the active site of the protein kinases, using 49 kinase-peptide complexes with structure available in the Protein Data Bank (PDB) as templates. The scoring function is based on the binding energy of the peptides with the kinase calculated using a statistical-based residue pair potential [79]. Owing to the low accuracy in some families of kinases, a new scoring scheme based on molecular mechanics Poisson-Boltzmann Surface Area (MM-PBSA) binding energy values was generated. The performance with this improved scoring system was similar to predictors based only on sequence as GPS, PPSP, Scansite and NetPhosK, but with the advantage that MODPROPEP not require a training set of known sites, which is a remarkable advantage for prediction of previously uncharac‐ terized protein kinases.

### **5. The kinase side: determinants of specificity**

network structure caused by highly studied proteins. Another algorithm based on sequence and contextual information is PhosphoPICK. It integrates information from previously known KSRs, protein–protein interactions and protein abundance profiles through the cell cycle in a Bayesian network model. The average AUC for the 59 protein kinases evaluated was 0.86. The main determinant of this good performance was the inclusion of data from protein–protein interactions, while the protein expression throughout the cell cycle had a modest contribution

To our knowledge, no predictor used only physical protein–protein interaction information, which better translates the concept that kinases phosphorylate proteins that are in close proximity. Recently, studies carried out by affinity purification coupled with mass spectrom‐ etry revealed the structure of many protein complexes in human cells [69, 70], including some specific for protein kinases [71, 72] or signaling pathways associated with them [73, 74], which

Contextual information can also be given by the location of the protein kinase in a specific subcellular structure, which would have privileged access to certain substrates. Alexander and colleagues elegantly demonstrated that substrate specificity between mitotic kinases, Aurora A/B, Cdk1, Plk1 and Nek2 are based on mutually exclusive motifs in the case of protein kinases presenting overlapping location and for similar motifs in protein kinases showing an exclusive

It has been traditionally assumed that the phosphorylation site can be described as a linear sequence and this assumption underlies almost all predictors developed to date. However, it has been recently identified that the Thr253 of the α-tubulin is located in a nonlinear motif, comprising residues distant in primary sequence but which are folded in a consensus site phosphorylated by PKC [76]. Durek and collaborators [38] had previously addressed this problem by characterizing the structural motifs (3D) present in the phosphorylation sites. Using only the radial distance from the phosphorylated residue and regardless of the angular information, they achieved a modest increase in performance over only linear sequencepredictions [38]. This can be explained if only a limited number of residues are recognized by protein kinases based on a structural epitope or by the low complexity of the model used. Currently, they have developed more sophisticated tools to find patterns in protein 3D structures that could be useful to identify KSRs based on nonlinear motifs. For example, Amino acid pattern Search for Substructures And Motifs (ASSAM) [77]) uses 3D coordinates of a motif comprising up to 12 amino acids, for matching in a structure database as the Protein Data Bank (PDB). ASSAM represents protein structures as a graph in which nodes consist of a vector between two pseudoatoms, representing the side chains of the individual amino acids, while the edges are the distances between the corresponding vectors, providing a more exhaustive

can be exploited to improve sequence-only based predictors.

**4. Structural aspects of phosphosites**

representation that used by Durek and collaborators [38].

[12].

114 Bioinformatics - Updated Features and Applications

distribution [75].

The combinations of residues in the kinase domain that allow substrate specificity are known determinants of specificity (DoS). For example, discrimination between Ser and Thr by eukaryotic protein kinases appears to depend on the nature of a single residue immediately adjacent to the DFG motif (DFG+1). Whereas most of the Thr-specific kinases have a βbranched aliphatic residue at this position (as Ile, Val or Thr), Ser-specific kinases have large hydrophobic residues (predominantly Leu, Phe, and Met). All non-selective kinases have Leu or Ser as the DFG+1 residue. The mutation of this residue is enough to modify the amino acid preference of multiple protein kinases [80].

The idea of building predictors based on the characteristics of the kinase domain can overcome the problem of building a positive data set of phosphorylated sequences, an aspect especially complicated for poorly characterized protein kinases, and ideally would serve to perform explorations at whole-kinome level. The first DoS-based predictor was Predikin, which allows automatic prediction of peptide substrates using only the amino acid sequence of the protein kinase [33]. This algorithm has been continuously improved [34], achieving the most accurate prediction of experimentally obtained position weight matrices in the category of Domain Recognition Peptide/Kinase protein of Dialogue for Reverse Engineering Assessments and Methods (DREAM4) challenge [32].

On the other hand, Safaei and colleagues [30] using the primary sequence of the kinase catalytic domains, generate PSSMs describing substrate specificity for 492 protein kinases. For this purpose, residues act as DoS are identified using multiple alignments of the kinase domain and the correlation between these and those residues present in the consensus sequences derived from known KSRs [30]. Although the performance of this strategy was similar to NetPhorest, the advantage it does not require information about substrates.

A computational methodology called KINspect (based on learning classifier systems) was recently developed, in order to predict DoS through an iterative process based on randomly generated specificity masks, which are progressively improved in its predictive ability through mutation and crossover [81]. This sophisticated approach, transferred to a predictor of KSRs, possibly will overcome previous advances.

### **6. Conclusions**

*In silico* prediction of KSRs may become the most important contribution of bioinformatics toward the understanding of cell signaling. To date, there are no high-throughput experimen‐ tal techniques allowing pairing thousands of phosphorylation sites (identified by mass spectrometry) with protein kinases that catalyzing such reactions. Recent studies provide a deepest characterization of the determinants of specificity of protein kinases for their sub‐ strates, enabling a more realistically modeling of the KSRs. These models reduce the rate of false positives and support the construction of feasible regulatory networks.

### **Author details**

Daniel A. Bórquez1 and Christian González-Billault2,3\*

\*Address all correspondence to: chrgonza@uchile.cl


### **References**


[4] Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ et al. Phospho.ELM: a database of phosphorylation sites—update 2011. Nucleic Acids Research. 2011;39:D261–7. DOI:10.1093/nar/gkq1104.

mutation and crossover [81]. This sophisticated approach, transferred to a predictor of KSRs,

*In silico* prediction of KSRs may become the most important contribution of bioinformatics toward the understanding of cell signaling. To date, there are no high-throughput experimen‐ tal techniques allowing pairing thousands of phosphorylation sites (identified by mass spectrometry) with protein kinases that catalyzing such reactions. Recent studies provide a deepest characterization of the determinants of specificity of protein kinases for their sub‐ strates, enabling a more realistically modeling of the KSRs. These models reduce the rate of

1 Biomedical Research Center, School of Medicine, Diego Portales University, Santiago, Chile

[1] Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298:1912–34. DOI:10.1126/science.

[2] Sharma K, D'Souza RC, Tyanova S, Schaab C, Wisniewski JR, Cox J et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based

[3] Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E. Phospho‐ SitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Research.

signaling. Cell Reports. 2014;8:1583–94. DOI:10.1016/j.celrep.2014.07.036.

2 Department of Biology, Faculty of Sciences, University of Chile, Santiago, Chile

3 FONDAP Geroscience Center, Brain Health and Metabolism, Santiago, Chile

false positives and support the construction of feasible regulatory networks.

and Christian González-Billault2,3\*

\*Address all correspondence to: chrgonza@uchile.cl

2015;43:D512–20. DOI:10.1093/nar/gku1267.

possibly will overcome previous advances.

116 Bioinformatics - Updated Features and Applications

**6. Conclusions**

**Author details**

Daniel A. Bórquez1

**References**

1075762.


[28] Zhou FF, Xue Y, Chen GL, Yao X. GPS: a novel group-based phosphorylation predicting and scoring method. Biochemical and Biophysical Research Communications. 2004;325:1443–8. DOI:10.1016/j.bbrc.2004.11.001.

[16] Linding R, Jensen LJ, Ostheimer GJ, van Vugt MA, Jorgensen C, Miron IM et al. Systematic discovery of in vivo phosphorylation networks. Cell. 2007;129:1415–26.

[17] Huang KY, Wu HY, Chen YJ, Lu CT, Su MG, Hsieh YC et al. RegPhos 2.0: an updated resource to explore protein kinase–substrate phosphorylation networks in mammals. Database: the Journal of Biological Databases and Curation. 2014;2014:bau034. DOI:

[18] Lee TY, Bo-Kai Hsu J, Chang WC, Huang HD. RegPhos: a system to explore the protein kinase–substrate phosphorylation network in humans. Nucleic Acids Research.

[19] Suo SB, Qiu JD, Shi SP, Chen X, Liang RP. PSEA: Kinase-specific prediction and analysis of human phosphorylation substrates. Scientific Reports. 2014;4:4524. DOI:10.1038/

[20] Damle NP, Mohanty D. Deciphering kinase–substrate relationships by analysis of domain-specific phosphorylation network. Bioinformatics. 2014;30:1730–8. DOI:

[21] Xu X, Li A, Zou L, Shen Y, Fan W, Wang M. Improving the performance of protein kinase identification via high dimensional protein–protein interactions and substrate structure data. Molecular Biosystems. 2014;10:694–702. DOI:10.1039/c3mb70462a.

[22] Fan W, Xu X, Shen Y, Feng H, Li A, Wang M. Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and

random forest. Amino Acids. 2014;46:1069–78. DOI:10.1007/s00726-014-1669-3.

Bioinformatics. 2013;14:247. DOI:10.1186/1471-2105-14-247.

2008;7:1598–608. DOI:10.1074/mcp.M700574-MCP200.

10.1093/nar/gki393.

Proteomics: MCP. 2012;11:1070–83. DOI:10.1074/mcp.M111.012625.

[23] Zou L, Wang M, Shen Y, Liao J, Li A, Wang M. PKIS: computational identification of protein kinases for experimentally discovered protein phosphorylation sites. BMC

[24] Song C, Ye M, Liu Z, Cheng H, Jiang X, Han G et al. Systematic analysis of protein phosphorylation networks from phosphoproteomic data. Molecular & Cellular

[25] Xue Y, Liu Z, Cao J, Ma Q, Gao X, Wang Q et al. GPS 2.1: enhanced prediction of kinasespecific phosphorylation sites with an algorithm of motif length selection. Protein Engineering, Design & Selection: PEDS. 2011;24:255–60. DOI:10.1093/protein/gzq094.

[26] Xue Y, Ren J, Gao X, Jin C, Wen L, Yao X. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Molecular & Cellular Proteomics: MCP.

[27] Xue Y, Zhou F, Zhu M, Ahmed K, Chen G, Yao X. GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Research. 2005;33:W184–7. DOI:

DOI:10.1016/j.cell.2007.05.052.

118 Bioinformatics - Updated Features and Applications

10.1093/database/bau034.

10.1093/bioinformatics/btu112.

srep04524.

2011;39:D777–87. DOI:10.1093/nar/gkq970.


is regulated by a docking motif in ASF/SF2. Molecular Cell. 2005;20:77–89. DOI:10.1016/ j.molcel.2005.08.025.

[52] Gogl G, Schneider KD, Yeh BJ, Alam N, Nguyen Ba AN, Moses AM et al. The structure of an NDR/LATS Kinase-Mob complex reveals a novel kinase-coactivator system and substrate docking mechanism. PLoS Biology. 2015;13:e1002146. DOI:10.1371/jour‐ nal.pbio.1002146.

[40] Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S. Prediction of posttranslational glycosylation and phosphorylation of proteins from the amino acid

[41] Dang TH, Van Leemput K, Verschoren A, Laukens K. Prediction of kinase-specific phosphorylation sites using conditional random fields. Bioinformatics. 2008;24:2857–

[42] Wan J, Kang S, Tang C, Yan J, Ren Y, Liu J et al. Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucleic

[43] Li T, Li F, Zhang X. Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach. Proteins. 2008;70:404–14. DOI:10.1002/prot.

[44] Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH et al. KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Research. 2007;35:W588–94. DOI:

[45] Huang HD, Lee TY, Tzeng SW, Horng JT. KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Research. 2005;33:W226–9. DOI:

[46] Xue Y, Li A, Wang L, Feng H, Yao X. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics. 2006;7:163. DOI:

[47] Kim JH, Lee J, Oh B, Kimm K, Koh I. Prediction of phosphorylation sites using SVMs.

[48] Neuberger G, Schneider G, Eisenhaber F. pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase–substrate binding model. Biology

[49] Joughin BA, Liu C, Lauffenburger DA, Hogue CW, Yaffe MB. Protein kinases display minimal interpositional dependence on substrate sequence: potential implications for the evolution of signalling networks. Philosophical transactions of the Royal Society of London Series B, Biological sciences. 2012;367:2574–83. DOI:10.1098/rstb.2012.0010.

[50] Won AP, Garbarino JE, Lim WA. Recruitment interactions can override catalytic interactions in determining the functional identity of a protein kinase. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:9809–14.

[51] Ngo JC, Chakrabarti S, Ding JH, Velazquez-Dones A, Nolen B, Aubol BE et al. Interplay between SRPK and Clk/Sty kinases in phosphorylation of the splicing factor ASF/SF2

Bioinformatics. 2004;20:3179–84. DOI:10.1093/bioinformatics/bth382.

sequence. Proteomics. 2004;4:1633–49. DOI:10.1002/pmic.200300771.

64. DOI:10.1093/bioinformatics/btn546.

21563.

10.1093/nar/gkm322.

120 Bioinformatics - Updated Features and Applications

10.1093/nar/gki471.

10.1186/1471-2105-7-163.

Direct. 2007;2:1. DOI:10.1186/1745-6150-2-1.

DOI:10.1073/pnas.1016337108.

Acids Research. 2008;36:e22. DOI:10.1093/nar/gkm848.


[74] Hauri S, Wepf A, van Drogen A, Varjosalo M, Tapon N, Aebersold R et al. Interaction proteome of human Hippo signaling: modular control of the co-activator YAP1. Molecular Systems Biology. 2013;9:713. DOI:10.1002/msb.201304750.

[62] Garai A, Zeke A, Gogl G, Toro I, Fordos F, Blankenburg H et al. Specificity of linear motifs that bind to a common mitogen-activated protein kinase docking groove. Science

[63] Zeke A, Bastys T, Alexa A, Garai A, Meszaros B, Kirsch K et al. Systematic discovery of linear binding motifs targeting an ancient protein interaction surface on MAP

kinases. Molecular Systems Biology. 2015;11:837. DOI:10.15252/msb.20156269.

interactions. PloS One. 2013;8:e70843. DOI:10.1371/journal.pone.0070843.

tational Biology. 2015;11:e1004508. DOI:10.1371/journal.pcbi.1004508.

Nucleic Acids Research. 2015;43:D447–52. DOI:10.1093/nar/gku1003.

2015;163:712–23. DOI:10.1016/j.cell.2015.09.053.

2013;3:1306–20. DOI:10.1016/j.celrep.2013.03.027.

DOI:10.1016/j.cell.2015.06.043.

2004712.

[64] Bibi N, Parveen Z, Rashid S. Identification of potential Plk1 targets in a cell-cycle specific proteome through structural dynamics of kinase and Polo box-mediated

[65] Hu J, Neiswinger J, Zhang J, Zhu H, Qian J. Systematic prediction of scaffold proteins reveals new design principles in scaffold-mediated signal transduction. PLoS Compu‐

[66] Bandyopadhyay S, Chiang CY, Srivastava J, Gersten M, White S, Bell R et al. A human MAP kinase interactome. Nature Methods. 2010;7:801–5. DOI:10.1038/nmeth.1506 [67] Calejo AI, Tasken K. Targeting protein–protein interactions in complexes organized by A kinase anchoring proteins. Frontiers in Pharmacology. 2015;6:192. DOI:10.3389/

[68] Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J et al. STRING v10: protein–protein interaction networks, integrated over the tree of life.

[69] Hein MY, Hubner NC, Poser I, Cox J, Nagaraj N, Toyoda Y et al. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell.

[70] Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi MP, Szpyt J et al. The BioPlex Network: a systematic exploration of the human interactome. Cell. 2015;162:425–40.

[71] Varjosalo M, Keskitalo S, Van Drogen A, Nurkkala H, Vichalkovski A, Aebersold R et al. The protein interaction landscape of the human CMGC kinase group. Cell Reports.

[72] Varjosalo M, Sacco R, Stukalov A, van Drogen A, Planyavsky M, Hauri S et al. Inter‐ laboratory reproducibility of large-scale human protein-complex analysis by standar‐

[73] Couzens AL, Knight JD, Kean MJ, Teo G, Weiss A, Dunham WH et al. Protein interac‐ tion network of the mammalian Hippo pathway reveals mechanisms of kinasephosphatase interactions. Science Signaling. 2013;6:rs15. DOI:10.1126/scisignal.

dized AP-MS. Nature Methods. 2013;10:307–14. DOI:10.1038/nmeth.2400.

Signaling. 2012;5:ra74. DOI:10.1126/scisignal.2003004.

fphar.2015.00192.

122 Bioinformatics - Updated Features and Applications


### **Bioinformatics for RNA‐Seq Data Analysis**

Shanrong Zhao, Baohong Zhang, Ying Zhang, William Gordon, Sarah Du, Theresa Paradis, Michael Vincent and David von Schack

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63267

#### **Abstract**

While RNA sequencing (RNA‐seq) has become increasingly popular for transcrip‐ tome profiling, the analysis of the massive amount of data generated by large‐scale RNA‐seq still remains a challenge. RNA‐seq data analyses typically consist of (1) accurate mapping of millions of short sequencing reads to a reference genome, including the identification of splicing events; (2) quantifying expression levels of genes, transcripts, and exons; (3) differential analysis of gene expression among different biological conditions; and (4) biological interpretation of differentially expressed genes. Despite the fact that multiple algorithms pertinent to basic analyses have been developed, there are still a variety of unresolved questions. In this chapter, we review the main tools and algorithms currently available for RNA‐seq data analyses, and our goal is to help RNA‐seq data analysts to make an informed choice of tools in practical RNA‐seq data analysis. In the meantime, RNA‐seq is evolving rapidly, and newer sequencing technologies are briefly introduced, including stranded RNA‐seq, targeted RNA‐seq, and single‐cell RNA‐seq.

**Keywords:** data analysis, gene quantification, pipeline, RNA‐seq, workflow

### **1. Introduction**

In recent years, RNA sequencing (RNA‐seq) has emerged as a powerful technology for transcriptome profiling [1–4]. Compared with microarrays, it not only avoids some of the technical limitations of this approach including varying probe performance and nonspecific hybridization, and dynamic range issues, but can also detect alternative splicing isoforms and subtle changes of splicing under different conditions. The overview of current RNA‐seq

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

approaches using shotgun sequencing technologies such as Illumina and the corresponding data analysis workflow is summarized in **Figure 1**. Polyadenylated (Poly‐A) RNA transcripts (for the so‐called mRNA‐seq) are enriched with oligo (dT) primers and then fragmented. After size selection, millions or even billions of short sequence reads are generated from a random‐ lyfragmentedcDNAlibrary.FormostRNA‐seqstudies,thedataanalysesconsistofthefollowing key steps [5, 6]: (1) quality check and preprocessing of raw sequence reads, (2) mapping reads to a reference genome or transcriptome, (3) counting reads mapped to individual genes or transcripts, (4) identification of differential expression (DE) genes between different biologi‐ cal conditions, and(5) biologicalinterpretationofDEgenes andfunctional enrichment analysis. Despite the fact that a large number of algorithms [7] have been developed for RNA‐seq data analysis in recent years, there are still many open questions for accurate read mapping, gene quantification, and data normalization.

**Figure 1.** Overview of mRNA‐seq laboratory flowchart and data analysis pipeline.

In complex mammalian genomes, both DNA strands encode genes. As a result, if two genes transcribed from opposite strands overlap, nonstranded RNA‐seq cannot tell the true origin if a read falls into overlapping regions. Most recently, a stranded RNA‐seq protocol has been developed for more accurate gene quantification [8, 9]. RNA‐seq is a powerful tool when used to profile the entire transcriptome. However, this method can be inefficient when interested in only a small subset of genes that are involved in particular pathways, or associated with specific diseases. To meet this demand, targeted RNA‐seq technology has been developed. Traditionally, gene expression measurements were performed on "bulk" samples containing populations of thousands or millions of cells. Recent advances in genomic technologies have made it possible to measure gene expression in individual cells. Accordingly, cellular prop‐ erties that were previously masked in "bulk" measurements can now be observed directly [10– 12]. Single‐cell RNA‐seq (scRNA‐seq) introduces more challenges for data analyses due to technical noise, including coverage nonuniformity, data sparsity, amplification biases, and so on. In this chapter, we cover the topics related to RNA‐seq technology and data analyses.

### **2. RNA‐seq versus microarray for gene expression profiling analysis**

approaches using shotgun sequencing technologies such as Illumina and the corresponding data analysis workflow is summarized in **Figure 1**. Polyadenylated (Poly‐A) RNA transcripts (for the so‐called mRNA‐seq) are enriched with oligo (dT) primers and then fragmented. After size selection, millions or even billions of short sequence reads are generated from a random‐ lyfragmentedcDNAlibrary.FormostRNA‐seqstudies,thedataanalysesconsistofthefollowing key steps [5, 6]: (1) quality check and preprocessing of raw sequence reads, (2) mapping reads to a reference genome or transcriptome, (3) counting reads mapped to individual genes or transcripts, (4) identification of differential expression (DE) genes between different biologi‐ cal conditions, and(5) biologicalinterpretationofDEgenes andfunctional enrichment analysis. Despite the fact that a large number of algorithms [7] have been developed for RNA‐seq data analysis in recent years, there are still many open questions for accurate read mapping, gene

quantification, and data normalization.

126 Bioinformatics - Updated Features and Applications

**Figure 1.** Overview of mRNA‐seq laboratory flowchart and data analysis pipeline.

In complex mammalian genomes, both DNA strands encode genes. As a result, if two genes transcribed from opposite strands overlap, nonstranded RNA‐seq cannot tell the true origin if a read falls into overlapping regions. Most recently, a stranded RNA‐seq protocol has been developed for more accurate gene quantification [8, 9]. RNA‐seq is a powerful tool when used to profile the entire transcriptome. However, this method can be inefficient when interested in only a small subset of genes that are involved in particular pathways, or associated with specific diseases. To meet this demand, targeted RNA‐seq technology has been developed. Traditionally, gene expression measurements were performed on "bulk" samples containing populations of thousands or millions of cells. Recent advances in genomic technologies have made it possible to measure gene expression in individual cells. Accordingly, cellular prop‐ erties that were previously masked in "bulk" measurements can now be observed directly [10–

Microarrays and RNA‐seq have been the two technologies of choice for large‐scale studies of gene expression, and the side‐by‐side comparisons of RNA‐seq and hybridization‐based arrays have been performed [12–15]. Malone et al. [16] compared its ability to identify differentially expressed genes with existing array technologies and found that RNA‐seq data are highly reproducible with relatively little technical variation. The DE genes identified from RNA‐seq had a high overlapping with those identified by microarray. Fu et al. [17] designed a study in which they evaluated the accuracy of both microarrays and RNA‐seq for mRNA quantification by using protein expression measurements as ground truths. In that study, they assessed the relative accuracy of the two transcriptome quantification approaches with respect to absolute transcript level measurements and found that RNA‐seq provides better estimates of transcript expressions.

Previously, we performed a side‐by‐side comparison of RNA‐seq and microarrays in investi‐ gating T‐cell activation [18]. A comparison of data sets derived from RNA‐seq and Affymetrix platforms using the same set of samples revealed a very high concordance between expression profiles generated by the two platforms. In the meantime, it was also demonstrated that RNA‐ seq is superior in differentiating biologically critical isoforms, detecting low abundance transcripts, and allowing the identification of genetic variants. Analysis of the two data sets also showed the benefit derived from avoidance of technical issues inherent to microarray probe performance such as cross‐hybridization, nonspecific hybridization, and limited detection range of individual probes. In addition, RNA‐seq has a much broader dynamic range than microarray technologies, which allows for the detection of more differentially expressed genes with higher fold‐change. Thus, RNA‐seq delivers both less biased and previously unknown information about the transcriptome. Because RNA‐seq does not rely on a prede‐ signed complementary sequence detection probe, it is not limited to the interrogation of selected probes on an array and can also be applied in species, for which the whole reference genome is not yet assembled.

RNA‐seq allows for the detection of novel transcript species in well‐studied organisms, such as unique transcripts in certain tissues or in rare cell types, and has been instrumental to catalog the diversity of novel transcript species including long noncoding RNA, miRNA, siRNA, and other small RNA classes [19]. Additionally, RNA‐seq technology proves to be an invaluable tool for deciphering the extensive alternative splicing of the transcriptome [20, 21]. Alternative splicing creates two to potentially hundreds of variants in more than 90% of human genes. Furthermore, RNA‐seq can identify allele‐specific expression and gene fusion events [22, 23].

### **3. RNA‐seq library preparation and sequencing platforms**

### **3.1. General considerations for RNA‐seq**

The quality and quantity of the starting RNA material are likely the most important aspects to consider when deciding on the methods to generate RNA‐seq libraries. For high‐quality samples with mostly intact RNA, a wide variety of sequencing library preparation methods is available. For lower quality samples with partially or highly degraded RNA, there are considerably fewer methods to choose from. The amount of RNA may limit the choice of library preparation as well, since the majority of standard RNA‐seq kits require a minimum of 10–100 ng of the total RNA. If the RNA amount is below this threshold it will require the use of a more specialized kit and/or the sample may require some form of amplification.

To obtain reliable and reproducible RNA‐seq data, RNA quality is of paramount importance. Because of the inherent instability of the RNA molecule, the quality of RNA from samples collected in clinical settings and field studies is often impacted by tissue necrosis, which quickly degrades RNA if the sample is not frozen or chemically preserved within minutes after surgery. For assessing RNA quality, most labs utilize an electrophoretic‐based system, including the Agilent BioAnalyzer, the Agilent TapeStation, or the Advanced Analytics Fragment Analyzer. All of these instruments produce an RNA integrity score. On the BioAnalyzer, the score ranges from 1 to 10 (10 being perfect) and most labs would consider a RIN > 6 to be acceptable for standard RNA‐seq methods. For degraded samples where the RIN is below 6, it is beneficial to calculate the DV200 score (the percent of the sample that is larger than 200 bp in size). Illumina has shown this to be an important metric for successful library preparation with kits designed for degraded RNA.

After quality and quantity of the RNA samples have been addressed, one must then choose whether to profile the total RNA space or the mRNA space only. This choice determines the main split between most RNA‐seq library kit types, those which target the removal of ribosomal RNA (rRNA) versus those which utilize Poly‐T beads to isolate the Poly‐A tail of intact mRNAs. This choice will affect the downstream sequencing, influencing the depth to be targeted per sample, with rRNA depletion libraries requiring significantly deeper sequencing than Poly‐A‐based libraries in order to generate sufficient reads to capture all of the different RNA species in these samples. Additionally, if you are dealing with degraded RNA samples, you will not be able to utilize the Poly‐A method, but the rRNA depletion methodologies will work for these samples. However, if you still want to focus only on the mRNAs, another option would be the more recent exon capture kits such as Illuminas RNA Access, which depend on hybridization with probes designed against known exons.

When considering the sequencing depth, a few considerations should be taken into account. mRNA libraries can be sequenced shallower than the total RNA libraries as they have less diversity and are focused only on the mature transcripts. In general a total RNA prep will only have 15–30% of the sample as mRNA, thus to get the same level of read depth for the mRNAs you will need to sequence those libraries between 3 and 10× as deep as a mRNA only library. Additionally, whether the goal of the experiment is targeted toward DE, isoform discovery, or novel RNA discovery will impact the choice of read depth to target, with the latter requiring more reads. In general, targeting 30–40 million paired reads for an mRNA‐seq and 70–100 million for a total RNA‐seq would be the starting recommendation.

### **3.2. From RNA sample to raw sequence reads**

**3. RNA‐seq library preparation and sequencing platforms**

specialized kit and/or the sample may require some form of amplification.

The quality and quantity of the starting RNA material are likely the most important aspects to consider when deciding on the methods to generate RNA‐seq libraries. For high‐quality samples with mostly intact RNA, a wide variety of sequencing library preparation methods is available. For lower quality samples with partially or highly degraded RNA, there are considerably fewer methods to choose from. The amount of RNA may limit the choice of library preparation as well, since the majority of standard RNA‐seq kits require a minimum of 10–100 ng of the total RNA. If the RNA amount is below this threshold it will require the use of a more

To obtain reliable and reproducible RNA‐seq data, RNA quality is of paramount importance. Because of the inherent instability of the RNA molecule, the quality of RNA from samples collected in clinical settings and field studies is often impacted by tissue necrosis, which quickly degrades RNA if the sample is not frozen or chemically preserved within minutes after surgery. For assessing RNA quality, most labs utilize an electrophoretic‐based system, including the Agilent BioAnalyzer, the Agilent TapeStation, or the Advanced Analytics Fragment Analyzer. All of these instruments produce an RNA integrity score. On the BioAnalyzer, the score ranges from 1 to 10 (10 being perfect) and most labs would consider a RIN > 6 to be acceptable for standard RNA‐seq methods. For degraded samples where the RIN is below 6, it is beneficial to calculate the DV200 score (the percent of the sample that is larger than 200 bp in size). Illumina has shown this to be an important metric for successful library preparation with kits

After quality and quantity of the RNA samples have been addressed, one must then choose whether to profile the total RNA space or the mRNA space only. This choice determines the main split between most RNA‐seq library kit types, those which target the removal of ribosomal RNA (rRNA) versus those which utilize Poly‐T beads to isolate the Poly‐A tail of intact mRNAs. This choice will affect the downstream sequencing, influencing the depth to be targeted per sample, with rRNA depletion libraries requiring significantly deeper sequencing than Poly‐A‐based libraries in order to generate sufficient reads to capture all of the different RNA species in these samples. Additionally, if you are dealing with degraded RNA samples, you will not be able to utilize the Poly‐A method, but the rRNA depletion methodologies will work for these samples. However, if you still want to focus only on the mRNAs, another option would be the more recent exon capture kits such as Illuminas RNA Access, which depend on

When considering the sequencing depth, a few considerations should be taken into account. mRNA libraries can be sequenced shallower than the total RNA libraries as they have less diversity and are focused only on the mature transcripts. In general a total RNA prep will only have 15–30% of the sample as mRNA, thus to get the same level of read depth for the mRNAs you will need to sequence those libraries between 3 and 10× as deep as a mRNA only library. Additionally, whether the goal of the experiment is targeted toward DE, isoform discovery,

**3.1. General considerations for RNA‐seq**

128 Bioinformatics - Updated Features and Applications

designed for degraded RNA.

hybridization with probes designed against known exons.

The most common workflow for RNA‐seq is stranded mRNA‐seq. To generate a stranded mRNA‐seq library, RNA from different sources (blood, tissue, and cell lines) should be purified utilizing a consistent methodology (kits with columns or manually using Trizol are both acceptable). The RNA should be treated either in solution or on‐column (depending on your extraction choice) with DNASE to remove traces of genomic DNA and to prevent contamina‐ tion of RNA‐seq libraries by DNA. DNA‐free RNA should then be assessed for both quality and quantity. RNA of sufficient quality can then be passed over Oligo‐dT beads to capture mRNAs removing the non‐Poly‐A RNA species. The captured mRNA is then fragmented (enzymatically or mechanically) through ultrasonic shearing (with devices such as a Covaris ultrasonicator), followed by a two‐step conversion to cDNA using first‐strand synthesis and then second‐strand synthesis (during which the cDNA is marked retaining the strand infor‐ mation) protocols. After cDNA conversion, the ends are repaired making them amenable to adapter ligation. Indexing of the libraries can be utilized at this point allowing for sample pooling before sequencing and a final enrichment of the indexed cDNA fragments by PCR (Illumina recommends 10–15 cycles of PCR depending on the RNA input), which is performed to generate the final RNA‐seq library.

After the libraries are complete, their quality is assessed by an electrophoretic assay, if a peak at the size of dimerized adapters (80–100 bp) is observed, the libraries should be repurified before sequencing (adapter dimers will cluster very efficiently and many reads can be lost if they are not fully removed). The sequencing‐ready libraries are quantified to allow for equimolar pooling typically through quantitative PCR. Method such as KAPA Library Quantification kits are recommended for best results. After quantification, libraries with different indices are pooled. The number of samples in the pool will vary depending on the read depth desired and sequencer used. Sequencing is then performed on a final diluted sample (following the manufacturer's recommendations). After the run is complete, the raw reads can be converted to fastq files (if Illumina sequencing was performed, Illumina offers an algorithm for this step called bcltofastq) and passed on for QC, alignment, and analysis.

### **4. Algorithms for RNA‐seq data analysis**

Millions, or even billions, of short reads are the starting point of RNA‐seq computational data analyses [5, 6]. First, reads are QC checked and then mapped to a reference genome or transcriptome. The mapped reads for each sample are subsequently counted on gene, tran‐ script, or exon level to assess the abundance of each category depending on the experimental purpose. The summarized data are then assessed by statistical models to identify differentially expressed genes. Finally, pathway or network level analyses are performed to gain biological insight through systems biology approaches.

### **4.1. Quality check and preprocessing of raw reads**

Poor‐quality read data can arise from problems in the library preparation or from sequencing itself. Additionally, PCR artifacts, untrimmed adapter sequences, sequence‐specific bias, and other possible contaminants can also lead to poor data quality. The presence of poor quality or technical sequences can affect the downstream analysis and data interpretation, and thus give inaccurate results. In order to assess quality of raw sequenced data, several tools such as PRINSEQ [24] and FastQC [25] have been developed. FastQC aims to provide a simple way to do some quality control checks on raw sequence data. It provides a modular set of analyses that can be used to give a quick impression of whether raw sequencing data have any problems that one should be aware of before doing any further analysis. Once the data are checked for quality, they should be processed to remove reads with low‐quality bases, adapter sequences, and other contaminating sequences. Tools such as Cutadapt [26] and Trimmomatic [27], which trim adapter or other contaminating sequences based upon user‐provided parameters, can be used for performing these operations. After going through the aforementioned steps, the sequencing data are ready for downstream analysis.

### **4.2. Read mapping**

Short sequence reads generated by sequencers must first be mapped or aligned to a reference transcriptome or genome assembly to discover their true locations (origins) with respect to that reference. A large number of read mapping algorithms have been developed in recent years, including TopHat2 [28], STAR [29, 30], GSNAP [31], OSA [32], and MapSplice [33]. To assess the performance of current mapping software, Engström et al. [34] compared 26 mapping protocols based on 11 programs and pipelines and found there were major perform‐ ance differences among different methods on numerous benchmarks, including basewise accuracy, alignment yield, gap placement, mismatches, and exon junction sites.

Indeed, some features of a reference genome such as repetitive regions, assembly errors, and assembly gaps render this objective impossible for a subset of reads. Furthermore, because RNA‐seq libraries are constructed from transcribed RNA, intronic sequences are not present in exon‐exon spanning reads. Therefore, when aligning the sequences to a reference genome, reads that span exon‐exon junctions have to be split across potentially thousands of bases of intronic sequence. Many RNA‐seq alignment tools use reference transcriptomes to inform the alignment of junction reads. The benefits of using a reference transcriptome to map RNA‐seq reads have been demonstrated clearly in previous reports [35–37] and our own comprehensive evaluation [38] of RefGene (RefSeq Gene) [39], UCSC Known Genes [40], and Ensembl [41] in mapping of RNA‐seq reads and gene quantifications.

The benefits of using a reference transcriptome in mapping of RNA‐seq reads are illustrated in **Figure 2**. In **Figure 2A**, 19 junction reads can be uniquely mapped to gene HSP90AB1 when RefGene annotation [36] is provided in the alignment step. However, four reads indicated by the red arrow are mapped to the same gene HSP90AB1 as nonjunction reads with mismatches at one end without the assistance of a reference transcriptome. In **Figure 2B**, those exon‐exon spanning reads are mapped to gene TCEA3 with the exact same start and end positions regardless of the use of a transcriptome, but spliced differently. Both mappings are equal in terms of alignment scores and gaps between exons. It is therefore difficult, if not impossible, to tell which alignment is correct without the assistance of a reference transcriptome. Collec‐ tively, the two examples in **Figure 2** illustrate the importance of appropriate gene annotations in the correct alignment of junction reads.

expressed genes. Finally, pathway or network level analyses are performed to gain biological

Poor‐quality read data can arise from problems in the library preparation or from sequencing itself. Additionally, PCR artifacts, untrimmed adapter sequences, sequence‐specific bias, and other possible contaminants can also lead to poor data quality. The presence of poor quality or technical sequences can affect the downstream analysis and data interpretation, and thus give inaccurate results. In order to assess quality of raw sequenced data, several tools such as PRINSEQ [24] and FastQC [25] have been developed. FastQC aims to provide a simple way to do some quality control checks on raw sequence data. It provides a modular set of analyses that can be used to give a quick impression of whether raw sequencing data have any problems that one should be aware of before doing any further analysis. Once the data are checked for quality, they should be processed to remove reads with low‐quality bases, adapter sequences, and other contaminating sequences. Tools such as Cutadapt [26] and Trimmomatic [27], which trim adapter or other contaminating sequences based upon user‐provided parameters, can be used for performing these operations. After going through the aforementioned steps, the

Short sequence reads generated by sequencers must first be mapped or aligned to a reference transcriptome or genome assembly to discover their true locations (origins) with respect to that reference. A large number of read mapping algorithms have been developed in recent years, including TopHat2 [28], STAR [29, 30], GSNAP [31], OSA [32], and MapSplice [33]. To assess the performance of current mapping software, Engström et al. [34] compared 26 mapping protocols based on 11 programs and pipelines and found there were major perform‐ ance differences among different methods on numerous benchmarks, including basewise

Indeed, some features of a reference genome such as repetitive regions, assembly errors, and assembly gaps render this objective impossible for a subset of reads. Furthermore, because RNA‐seq libraries are constructed from transcribed RNA, intronic sequences are not present in exon‐exon spanning reads. Therefore, when aligning the sequences to a reference genome, reads that span exon‐exon junctions have to be split across potentially thousands of bases of intronic sequence. Many RNA‐seq alignment tools use reference transcriptomes to inform the alignment of junction reads. The benefits of using a reference transcriptome to map RNA‐seq reads have been demonstrated clearly in previous reports [35–37] and our own comprehensive evaluation [38] of RefGene (RefSeq Gene) [39], UCSC Known Genes [40], and Ensembl [41] in

The benefits of using a reference transcriptome in mapping of RNA‐seq reads are illustrated in **Figure 2**. In **Figure 2A**, 19 junction reads can be uniquely mapped to gene HSP90AB1 when RefGene annotation [36] is provided in the alignment step. However, four reads indicated by

accuracy, alignment yield, gap placement, mismatches, and exon junction sites.

insight through systems biology approaches.

130 Bioinformatics - Updated Features and Applications

**4.1. Quality check and preprocessing of raw reads**

sequencing data are ready for downstream analysis.

mapping of RNA‐seq reads and gene quantifications.

**4.2. Read mapping**

**Figure 2.** The impact of a reference transcriptome on the mapping of junction reads. Some exon‐exon spanning reads are mapped incorrectly without the help of a reference transcriptome. Note the reads colored in blue are mapped to "+" strand, and in green when mapped to "-" strand. The mismatch nucleotide bases are colored in red.

Another problem in read mapping is that of polymorphisms, which occur when sequence reads align to multiple locations of the genome, abbreviated as multireads. Polymorphisms are especially common for large and complex transcriptomes, and the number of multireads for a mammalian genome is estimated to be between 10 and 40% [1, 3]. Generally speaking, there are three common strategies to deal with multireads in practice. The first strategy is to ignore or discard them completely, which we have demonstrated is not ideal for accurate gene quantification [36]. This practice not only discards potentially useful information, but also introduces an underestimation bias in quantifying expression of genes with highly redundant sequences (e.g., young duplicated genes). The second strategy implemented in most mapping software is to randomly assign a position from the possible matches. This practice assumes a multiread can originate from these genomic locations equally, but this assumption is often not valid. The third strategy is to report all mapped locations for a multiread as long as the number of possible matches is below a user‐defined cutoff, let's say 10. The problem with this strategy is that the cutoff is somewhat arbitrary. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for RNA‐seq reads that can be mapped to multiple loci [42].

### **4.3. Read counting and gene quantification**

Since RNA‐seq has become a common technology in molecular biology laboratories, a number of methods have been developed for the inference of gene and isoform abundance, including RSEM [43], Cufflinks [44], IsoEM [45], featureCounts [46], and HTSeq [47]. The algorithms featureCounts [46] and HTSeq [47] are comparable in terms of counting results, but feature‐ Counts is considerably faster than HTSeq by an order of magnitude for gene‐level summari‐ zation and requires far less compute memory. However, neither featureCounts nor HTSeq can count reads at the transcript level due to their implementation. Andelini et al. [48] carried out a simulation study to assess the performance of five widely used counting tools and concluded that performance was heavily dependent upon the true abundance of the isoforms. Lowly expressed isoforms are poorly detected regardless of the methods.

Most recently, Kanitz et al. [49] have evaluated the accuracy of 14 methods for estimating isoform abundance and found that these tools vary widely in memory and runtime require‐ ments. The algorithms for gene quantification can be broadly divided into two categories: transcript‐based approaches (such as RSEM [43]) and "union‐exon"‐based approaches (such as featureCounts [46]). Because different isoforms of a gene typically have a high proportion of genomic overlap, it is intrinsically more difficult to estimate the expression of individual isoforms. Union‐exon‐based methods are much simpler, in which all overlapping exons of the same gene are merged into union exons. A read is counted to the gene as long as it has sufficient overlap with any of its union exons. Compared with isoforms, reads can be assigned to genes with much higher confidence. Therefore, the union‐exon‐based counting method is commonly used in RNA‐seq, though gene‐level counts cannot distinguish Isoforms [50].

A gene can be expressed in one or more transcript isoforms; accordingly, its expression level should be represented as the sum of its isoforms. We carried out a side‐by‐side comparison between union‐exon‐based approach and transcript‐based method in RNA‐seq gene quanti‐ fication [40], and found that gene expression levels were significantly underestimated when the union‐exon‐based approach was used. We also discovered that the quantification of gene expression is more accurate if gene expression levels are computed by cumulating expression levels of transcript isoforms than by ignoring the transcript structures.

A gene model that hypothesizes the structure of transcripts produced by a gene also affects the analysis. Among multiple genome annotation databases, RefGene, Ensembl, and the UCSC annotation databases are the most popular. The choice of genome annotation directly affects gene expression estimation. Recently, we systematically characterized the impact of genome annotation on read mapping and transcriptome quantification [38]. Surprisingly, among the 21,958 common genes shared between RefGene and Ensembl annotations, only 16.3% of genes obtained identical quantification results. Approximately 28.1% of genes' expression levels differed by >5% when using different annotation, and of those, the relative expression levels for 9.3% of genes differed by at least 50%. Our study revealed that the difference in gene definition frequently results in inconsistency in gene quantification (**Figure 3**). In Ensembl, the annotation of PIK3CA is much longer than its corresponding definition in RefGene. As a result, much more reads are counted toward PIK3CA if Ensembl annotation is used. According to the mapping profile of RNA‐seq reads in **Figure 3**, the PIK3CA gene definition in Ensembl should be more accurate than the one in RefGene.

**Figure 3.** Different gene definitions for PIK3CA give rise to differences in gene quantification. PIK3CA in the Ensembl annotation is much longer than its definition in RefGene, which explains why 1094 reads are mapped to PIK3CA in Ensembl while only 492 reads are mapped using RefGene.

### **4.4. Data normalization and differential analysis**

multiread can originate from these genomic locations equally, but this assumption is often not valid. The third strategy is to report all mapped locations for a multiread as long as the number of possible matches is below a user‐defined cutoff, let's say 10. The problem with this strategy is that the cutoff is somewhat arbitrary. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for RNA‐seq reads that can be

Since RNA‐seq has become a common technology in molecular biology laboratories, a number of methods have been developed for the inference of gene and isoform abundance, including RSEM [43], Cufflinks [44], IsoEM [45], featureCounts [46], and HTSeq [47]. The algorithms featureCounts [46] and HTSeq [47] are comparable in terms of counting results, but feature‐ Counts is considerably faster than HTSeq by an order of magnitude for gene‐level summari‐ zation and requires far less compute memory. However, neither featureCounts nor HTSeq can count reads at the transcript level due to their implementation. Andelini et al. [48] carried out a simulation study to assess the performance of five widely used counting tools and concluded that performance was heavily dependent upon the true abundance of the isoforms. Lowly

Most recently, Kanitz et al. [49] have evaluated the accuracy of 14 methods for estimating isoform abundance and found that these tools vary widely in memory and runtime require‐ ments. The algorithms for gene quantification can be broadly divided into two categories: transcript‐based approaches (such as RSEM [43]) and "union‐exon"‐based approaches (such as featureCounts [46]). Because different isoforms of a gene typically have a high proportion of genomic overlap, it is intrinsically more difficult to estimate the expression of individual isoforms. Union‐exon‐based methods are much simpler, in which all overlapping exons of the same gene are merged into union exons. A read is counted to the gene as long as it has sufficient overlap with any of its union exons. Compared with isoforms, reads can be assigned to genes with much higher confidence. Therefore, the union‐exon‐based counting method is commonly

A gene can be expressed in one or more transcript isoforms; accordingly, its expression level should be represented as the sum of its isoforms. We carried out a side‐by‐side comparison between union‐exon‐based approach and transcript‐based method in RNA‐seq gene quanti‐ fication [40], and found that gene expression levels were significantly underestimated when the union‐exon‐based approach was used. We also discovered that the quantification of gene expression is more accurate if gene expression levels are computed by cumulating expression

A gene model that hypothesizes the structure of transcripts produced by a gene also affects the analysis. Among multiple genome annotation databases, RefGene, Ensembl, and the UCSC annotation databases are the most popular. The choice of genome annotation directly affects gene expression estimation. Recently, we systematically characterized the impact of genome annotation on read mapping and transcriptome quantification [38]. Surprisingly, among the 21,958 common genes shared between RefGene and Ensembl annotations, only 16.3% of genes

mapped to multiple loci [42].

132 Bioinformatics - Updated Features and Applications

**4.3. Read counting and gene quantification**

expressed isoforms are poorly detected regardless of the methods.

used in RNA‐seq, though gene‐level counts cannot distinguish Isoforms [50].

levels of transcript isoforms than by ignoring the transcript structures.

After calculating the read counts, data normalization is one of the most crucial steps of data processing, and this process must be carefully considered, as it is essential to ensure accurate inference of gene expression and subsequent analyses thereof. First, the *sequencing depths* or *library sizes* (the total number of mapped reads) typically vary for different samples, which means that the observed counts are not directly comparable between samples. The most straightforward way of normalizing the difference in sequencing library sizes is to rescale the total read counts. However, such normalization is quite often not enough because RNA‐seq counts inherently represent *relative* abundances of genes in a sample. The number of reads mapped to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. A few highly expressed genes may consume a very large portion of the total reads in a sample, and accordingly, the counts for all other genes are repressed. As a result, in comparison to a sample where the reads are more evenly distributed, those repressed genes seem to have lower expression which could give rise to a lot of falsely "differentially expressed" genes.

A fundamental research aim in many RNA‐seq studies is to identify differentially expressed genes between distinct sample groups. Many algorithms have recently been introduced specifically for the identification of differentially expressed genes (DEGs) from RNA‐seq data, including DESeq [51], edgeR [52, 53], GENE‐Counter [54], NOISeq [55], NBPSeq [56], and Cuffdiff2 [57]. However, there is a lack of consensus on how to approach an optimal study design and choice of suitable software for the analysis of an RNA‐seq data sets. Recently, numerous groups [58–66] have performed a variety of comprehensive comparisons of different statistical methods for differential RNA‐seq data analysis. Still, no general consensus has been reached after so many head to head comparisons and evaluations.

Zhang et al. [61] evaluated the performance of three widely used software tools: DESeq, edgeR and Cuffdiff2. They took a number of important metrics into consideration, including se‐ quencing depth and the number of replicates, and the set of identified DEGs was evaluated with ground truths from either quantitative RT‐PCR or microarray. They concluded that no single method is always superior in all DE analyses. It was noted that edgeR performs slightly better than DESeq and Cuffdiff2 in terms of the ability to uncover true positives and that Cuffdiff2 is not recommended for gene‐level DE analysis, particularly if sequencing depth is low. Seyednasrollah et al. [65] also carried out a systematic comparison of the state‐of‐the‐art methods in RNA‐seq differential analysis to guide the selection of a suitable package. In general, there can be large differences between the algorithms. Similar to the evaluation performed by Zhang et al. [63], it was observed that no single method is likely to be optimal under all circumstances. They also demonstrated how the data analysis tool utilized can markedly affect the outcome of a differential analysis and highlighted the importance of the choice of software. Soneson and Delorenzi [61] have conducted extensive comparisons of 11 methods for DE analysis of RNA‐seq data and concluded that very small sample size, which is still common in RNA‐seq experiments, imposes problems for all evaluated methods and any results obtained under such conditions should be interpreted with caution.

We have applied edgeR to several whole blood RNA‐seq data sets (unpublished results), and the calculated normalization factors might range from 0.4 to 1.6. We were puzzled by such an unreasonably low or high scaling factor. The normalization method implemented in edgeR is based on the premise that most genes are not differentially expressed. Instead of normalizing the raw counts directly, edgeR scales the library size by a factor so that the adjusted abundance (i.e., the ratio of read counts divided by the scaled library size) for many genes is not DE. TMM (Trimmed Mean of M‐values) is calculated for each sample in a data set with one sample being considered as the reference sample and the others as test samples. For each test sample, TMM is computed between this test and the reference after exclusion of the most highly expressed genes (5% by default) and genes with the largest fold‐changes (30% on each side of up and down by default). Ideally, TMM should be close to 1, but in cases where it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (but not the raw counts) for normalization. However, TMM becomes problematic when the library composition between test and reference samples differs significantly. On the other hand, whether the default parameters in edgeR are appropriate for a given RNA‐seq data set is difficult to determine. Moreover, different sets of genes are used for calculation of scaling factors across the entire data set since the genes excluded can vary from sample to sample.

### **4.5. Pathway enrichment analysis**

specifically for the identification of differentially expressed genes (DEGs) from RNA‐seq data, including DESeq [51], edgeR [52, 53], GENE‐Counter [54], NOISeq [55], NBPSeq [56], and Cuffdiff2 [57]. However, there is a lack of consensus on how to approach an optimal study design and choice of suitable software for the analysis of an RNA‐seq data sets. Recently, numerous groups [58–66] have performed a variety of comprehensive comparisons of different statistical methods for differential RNA‐seq data analysis. Still, no general consensus has been

Zhang et al. [61] evaluated the performance of three widely used software tools: DESeq, edgeR and Cuffdiff2. They took a number of important metrics into consideration, including se‐ quencing depth and the number of replicates, and the set of identified DEGs was evaluated with ground truths from either quantitative RT‐PCR or microarray. They concluded that no single method is always superior in all DE analyses. It was noted that edgeR performs slightly better than DESeq and Cuffdiff2 in terms of the ability to uncover true positives and that Cuffdiff2 is not recommended for gene‐level DE analysis, particularly if sequencing depth is low. Seyednasrollah et al. [65] also carried out a systematic comparison of the state‐of‐the‐art methods in RNA‐seq differential analysis to guide the selection of a suitable package. In general, there can be large differences between the algorithms. Similar to the evaluation performed by Zhang et al. [63], it was observed that no single method is likely to be optimal under all circumstances. They also demonstrated how the data analysis tool utilized can markedly affect the outcome of a differential analysis and highlighted the importance of the choice of software. Soneson and Delorenzi [61] have conducted extensive comparisons of 11 methods for DE analysis of RNA‐seq data and concluded that very small sample size, which is still common in RNA‐seq experiments, imposes problems for all evaluated methods and any

reached after so many head to head comparisons and evaluations.

134 Bioinformatics - Updated Features and Applications

results obtained under such conditions should be interpreted with caution.

We have applied edgeR to several whole blood RNA‐seq data sets (unpublished results), and the calculated normalization factors might range from 0.4 to 1.6. We were puzzled by such an unreasonably low or high scaling factor. The normalization method implemented in edgeR is based on the premise that most genes are not differentially expressed. Instead of normalizing the raw counts directly, edgeR scales the library size by a factor so that the adjusted abundance (i.e., the ratio of read counts divided by the scaled library size) for many genes is not DE. TMM (Trimmed Mean of M‐values) is calculated for each sample in a data set with one sample being considered as the reference sample and the others as test samples. For each test sample, TMM is computed between this test and the reference after exclusion of the most highly expressed genes (5% by default) and genes with the largest fold‐changes (30% on each side of up and down by default). Ideally, TMM should be close to 1, but in cases where it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (but not the raw counts) for normalization. However, TMM becomes problematic when the library composition between test and reference samples differs significantly. On the other hand, whether the default parameters in edgeR are appropriate for a given RNA‐seq data set is difficult to determine. Moreover, different sets of genes are used for calculation of scaling factors across the entire data set since the genes excluded can vary from sample to sample.

To obtain a list of DE genes is only the starting point of gaining biological insights into experimental systems, developmental stages, or understanding of disease or molecular mechanisms. To understand the biological context of DE genes, pathway enrichment analysis ensues. Functional enrichment analyses rely upon annotation databases such as Gene Ontol‐ ogy (GO) [67], Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [68], DAVID [69], and other commercial knowledge systems such as Ingenuity Pathway Analysis (IPA). One traditional analysis starts with a gene list of interest, identified from differential RNA‐seq or microarray analyses, and applies statistical methods, such as the Fisher's exact test to test for enrichment of each annotated gene set, network, and pathway.

Gene set enrichment (GSE) analysis transforms information from gene expression profiling into a pathway summary. DE genes are quite often involved in the same biological pathways, and GSE results offer greater biological interpretability over individual gene analysis. GSVA [70] extends the current GSE methods to RNA‐seq data, and provides increased power to detect subtle pathway activity changes, and constitutes a starting point to build pathway‐centric models of system biology. SeqGSEA [71] is a new open‐source Bioconductor package for GSVA and can detect more biologically meaningful gene sets without biases toward longer and highly expressed genes.

Previous pathway analysis methods had been developed based on algorithms considering pathways as simple gene lists and ignoring pathway structures. Recently methods have been developed to incorporate various aspects of pathway topology. For example SPIA captures pathway topology through its scoring system, in which the positions and the interactions of the genes in the pathway are considered [72]. Accordingly, interacting DEG pairs are prefer‐ entially weighted over two noninteracting genes. Similarly, TAPPA [73] is a scoring method in which higher weights are automatically assigned to hub genes and interacting gene pairs. DE analysis for pathways (DEAP) [74] makes significant improvements over existing ap‐ proaches by including information about pathway topological structure. It was demonstrated [74] DEAP identified 14 more important chronic obstructive pulmonary disease–related pathways that existing approaches omitted.

### **4.6. QuickRNASeq—an integrated pipeline for efficient RNA‐seq data analysis and interactive visualization**

Although the time and cost for generating RNA‐seq data are decreasing, the analysis of massive amounts of RNA‐seq data still remains challenging. Numerous software packages and algorithms have been developed, which has led to the need to apply these tools efficiently to obtain results within a reasonable timeframe, especially for large data sets. Based on our own experience with analyses of multiple in‐house RNA‐seq data sets of varying size using open source tools, the main challenges, gaps, and bottlenecks for large‐scale RNA‐seq data analyses can be summarized as follows:

**1.** It is not trivial to select appropriate software packages and set software‐specific parame‐ ters since it requires an in‐depth understanding of the algorithms.


**Figure 4.** Representative entry webpage for a QuickRNAseq project report. The page layout and printable version of the page can be controlled by the top icons. The QC Metrics section provides QC results in plain text, static plot, and interactive plot formats accessible by clicking on the corresponding hyperlinked texts, the iconized figures, and point‐ ing hand, respectively. The parallel plot of QC offers an integrated view of linked QC measures for a single sample or a group of samples. The expression tables section provides links to raw read counts, a normalized RPKM table, and in‐ teractive displays of gene expression levels.

To address these challenges, we have implemented a pipeline named QuickRNASeq to advance the automation and interactive visualizations of RNA‐seq data analysis results [75]. QuickRNASeq significantly reduces data analysts' hands‐on time, which results in a substan‐ tial decrease in the time and effort needed for the primary analyses of RNA‐seq data before proceeding to further downstream analysis and interpretation. Additionally, QuickRNASeq provides a dynamic data sharing and interactive visualization environment for end users. All the results are accessible from a web browser without the need to set up a web server and database. The rich visualization features implemented in QuickRNASeq enable nonexpert end users to interact easily with the RNA‐seq data analyses results and to drill down into specific aspects to gain insights into often complex data sets simply through a point‐and‐click ap‐ proach. A representative entry webpage for a QuickRNASeq project report is shown in **Figure 4**. All the result files and figures are directly accessible by "point and click" from the entry webpage, which makes data navigation and visualization more convenient and intuitive, especially for experimental scientists.

### **5. New RNA sequencing technologies**

### **5.1. Stranded RNA‐seq**

**2.** Additional bridging scripts are often necessary to make different components work

**3.** In general, most algorithms are implemented to process an individual sample, and thus

**4.** Due to sample quality and the complicated multistep processes in RNA‐seq, it is required to establish stringent RNA‐seq data quality metrics to identify outliers that should be

**5.** Nearly all RNA‐seq data analyses are performed using Linux workstations; however, analysis results in Linux are often inaccessible to most experimental scientists. Thus,

**Figure 4.** Representative entry webpage for a QuickRNAseq project report. The page layout and printable version of the page can be controlled by the top icons. The QC Metrics section provides QC results in plain text, static plot, and interactive plot formats accessible by clicking on the corresponding hyperlinked texts, the iconized figures, and point‐ ing hand, respectively. The parallel plot of QC offers an integrated view of linked QC measures for a single sample or a group of samples. The expression tables section provides links to raw read counts, a normalized RPKM table, and in‐

it is necessary to integrate and summarize results from individual samples.

excluded from further downstream data analysis.

sharing results with scientists is a practical challenge.

seamlessly in a pipeline.

136 Bioinformatics - Updated Features and Applications

teractive displays of gene expression levels.

One significant shortcoming of the first‐generation RNA‐seq protocol was that it did not retain the strand information for each transcript. Recently, strand‐specific or stranded RNA‐seq protocols have been developed [76]. Previous reports [8] demonstrated that data from stranded libraries are more reliable than data from nonstranded libraries and can correctly evaluate the expression of both antisense RNA and overlapping genes. The ability to capture the relative abundance of both sense and antisense expression provides insight into regulatory interactions that might otherwise be missed [9]. With the ability to unlock new information on global gene expression, stranded RNA‐seq holds the key to a deeper understanding of the transcriptome. We performed a side‐by‐side comparison of stranded and nonstranded RNA‐seq in our whole blood RNA‐seq data set, and demonstrated that stranded RNA‐seq provides a more accurate estimate of gene expression compared with nonstranded RNA‐seq and is therefore the recommended RNA‐seq approach for all future mRNA‐seq studies [77].

The advantages of stranded RNA‐seq are illustrated in **Figure 5**. ICAM4 (intercellular adhesion molecule 4) shows moderate expression in whole blood. However, nonstranded RNA‐seq reports no expression for this gene. As observed in **Figure 5**, ICAM4 is 100% contained within CTD‐2369P2.8. In nonstranded RNA‐seq, a read mapped to ICAM4 is simultaneously aligned to CTD‐2369P2.8 as well. The ambiguous reads in overlapping regions are thus excluded from counting, which explains the lack of expression for ICAM4 with nonstranded RNA‐seq. The ambiguous reads in overlapping genes in **Figure 5** can be perfectly resolved using stranded RNA‐seq. By considering the read direction, all reads are assigned to ICAM4 (but not CTD‐ 2369P2.8), because they are all reverse complementary to ICAM4. According to a stranded sequencing protocol, it is impossible for such reads to originate from CTD‐2369P2.8.

**Figure 5.** (A) Gene expression of ICAM4 in stranded and nonstranded RNA‐seq. (B) Mapping profiles for ICAM4 (in‐ tercellular adhesion molecule 4). ICAM4 is on the "+" strand, and 100% contained within CTD‐2369P2.8 in the "-" strand. In nonstranded RNA‐seq, the ambiguous reads in overlapping regions are excluded from counting, which ex‐ plains why there is no expression for ICAM4. However, the ambiguous reads can be perfectly resolved in stranded RNA‐seq. By considering the read direction, all reads can be counted to ICAM4 but not CTD‐2369P2.8. *Note:* (1) RNAs were extracted from pooled whole blood samples, and four replicates were pair‐end sequenced using both stranded and nonstranded protocols. The unit of *y*‐axis is RPKM in the plot to the left. (2) All genes, transcripts, and sequence reads are colored in blue if they are in the "+" strand and colored in green if in the "-" strand. According to our strand‐ ed sequencing protocol, a sequence read should be reversely complementary to its transcript origin.

### **5.2. Targeted RNA‐seq**

RNA‐seq can be a powerful tool to measure gene expression, detect novel transcripts, charac‐ terize transcript isoforms, and identify sequence polymorphisms. However, this unbiased RNA‐seq method can be costly and yields complex data sets that are time consuming to analyze. Often one is interested in only a small subset of genes or the goal is to study only one component of the transcriptome, such as long noncoding RNAs (lncRNAs), which constitute only a small fraction of transcripts in a total RNA sample [78–80]. A targeted quantitative RNA‐ seq method that is reproducible and reduces the number of sequencing reads required to measure key transcripts would be better suited to these purposes. Most recently, Tan et al. [80] describes a targeted enrichment method for the analysis of lncRNAs. Targeted RNA‐seq can measure dozens to hundreds of targets simultaneously. Targeted RNA‐seq gives an econom‐ ical way to focus on genes of interest, and provides enhanced coverage for sensitive gene discovery, robust transcript assembly and accurate gene quantification. Common uses of this method include:


There are several approaches for target RNA enrichment, either by hybridization only (Agilent SureSelect and Roche SeqCap system), hybridization followed by extension (Illumina Targeted RNA‐seq; **Figure 6**) or PCR amplification (Thermo Fisher Ampliseq and BD Cellular Research Precise Assays). Genes of interest are enriched by custom‐designed probes (**Figure 6**). How‐ ever, these targeted sequencing procedures can potentially introduce biases caused by nonuniform hybridization as well as variation in amplification efficiencies across different genes and transcripts. According to our in‐house pilot study (unpublished data), gene expression results are very sensitive to the choice of probes and targeted regions. Cellular Research Precise Assays [81] allow researchers to introduce so‐called unique‐sequence barcodes into the samples to overcome possible amplification bias and thus produces more accurate estimates of transcript abundance. By counting unique barcodes instead of reads, researchers can determine how many copies of each transcript are expressed with very high accuracy. In the meantime, targeted RNA‐seq poses new challenges for data analysis. For example, most differential analysis packages assume the majority of genes are not differentially expressed, but this assumption is most likely not valid in targeted RNA‐seq data sets.

**Figure 5.** (A) Gene expression of ICAM4 in stranded and nonstranded RNA‐seq. (B) Mapping profiles for ICAM4 (in‐ tercellular adhesion molecule 4). ICAM4 is on the "+" strand, and 100% contained within CTD‐2369P2.8 in the "-" strand. In nonstranded RNA‐seq, the ambiguous reads in overlapping regions are excluded from counting, which ex‐ plains why there is no expression for ICAM4. However, the ambiguous reads can be perfectly resolved in stranded RNA‐seq. By considering the read direction, all reads can be counted to ICAM4 but not CTD‐2369P2.8. *Note:* (1) RNAs were extracted from pooled whole blood samples, and four replicates were pair‐end sequenced using both stranded and nonstranded protocols. The unit of *y*‐axis is RPKM in the plot to the left. (2) All genes, transcripts, and sequence reads are colored in blue if they are in the "+" strand and colored in green if in the "-" strand. According to our strand‐

RNA‐seq can be a powerful tool to measure gene expression, detect novel transcripts, charac‐ terize transcript isoforms, and identify sequence polymorphisms. However, this unbiased RNA‐seq method can be costly and yields complex data sets that are time consuming to analyze. Often one is interested in only a small subset of genes or the goal is to study only one component of the transcriptome, such as long noncoding RNAs (lncRNAs), which constitute only a small fraction of transcripts in a total RNA sample [78–80]. A targeted quantitative RNA‐ seq method that is reproducible and reduces the number of sequencing reads required to measure key transcripts would be better suited to these purposes. Most recently, Tan et al. [80] describes a targeted enrichment method for the analysis of lncRNAs. Targeted RNA‐seq can measure dozens to hundreds of targets simultaneously. Targeted RNA‐seq gives an econom‐ ical way to focus on genes of interest, and provides enhanced coverage for sensitive gene discovery, robust transcript assembly and accurate gene quantification. Common uses of this

**•** Profiling expression of select target genes, to assess disease‐associated variants and

**•** Analyzing gene fusions and gene expression alterations to provide a focused view of

**•** Studying genes associated with a variety of key signal transduction pathways, such as NF‐

ed sequencing protocol, a sequence read should be reversely complementary to its transcript origin.

**5.2. Targeted RNA‐seq**

138 Bioinformatics - Updated Features and Applications

method include:

epigenetic alterations.

functionally relevant changes occurring in cancer.

kB, P450, IFN response, apoptosis pathways, and many more.

**Figure 6.** TruSeq targeted RNA expression workflow. Two custom‐designed oligonucleotide probes with adapter se‐ quences hybridize up and downstream of the region of interest. ULSO stands for upstream locus‐specific oligo and DLSO stands for downstream locus‐specific oligo.

### **5.3. Single‐cell RNA‐seq**

Cell identity and function can be characterized at the molecular level by unique transcriptomic signatures. At the organismal level, different tissues have distinct gene expression profiles, and even cells in consecutive stages of embryonic development have highly divergent transcriptomic landscapes. Until recently, molecular 'fingerprints' were generated using profiling of gene expression levels from bulk populations of millions of input cells. These Ensemble‐based approaches meant that the resulting expression value for each gene was an average of its expression levels across a large population of input cells. In many contexts, such bulk expression profiles are sufficient. However, there are also important questions for which bulk measures of gene expression are insufficient, for instance heterogeneity in immune cells. Besides, Ensemble measures do not provide insights into the stochastic nature of gene expression.

Recently, RNA‐seq has achieved single‐cell resolution, and scRNA‐seq enables unbiased, high‐ throughput, and high‐resolution transcriptomic analysis of individual cells [82, 83]. This provides an additional dimension to transcriptomic information relative to traditional methods that profile bulk populations of cells. What is currently still missing is an effective way to routinely isolate and process large numbers of individual cells for quantitative in‐depth RNA‐seq. Klein et al. [84] and Macosko et al. [85] have independently developed a high‐ throughput droplet‐microfluidic approach for barcoding of RNA from thousands of individ‐ ual cells for subsequent analysis by next‐generation sequencing. Droplet‐based scRNA‐seq will be an attractive method for many laboratories because of its seemingly unlimited scala‐ bility and relatively low cost. By combining sophisticated RNA‐seq technology with a new device that isolates single cells and their progeny, MIT researchers can now trace detailed family histories for several generations of cells descended from one "ancestor" [86].

Alongside the large‐scale generation of single‐cell transcriptomic data, it is important to consider the specific computational and analytical challenges that still have to be overcome [87]. Although some tools for bulk RNA‐seq analysis can be readily applied to single‐cell data, many new computational strategies are required to fully exploit this new data type. For instance, many genes at single cells are expressed in a stochastically‐bursting fashion and their abundance exhibits a bimodal distribution in cell populations. Another main problem we face is that each cell can be in a different cell cycle phase, and they might vary in size and RNA content. The traditional RNA‐seq data analysis does not take transcriptional bimodality into consideration, and implicitly assume the total RNA amount is the same or at least comparable. Additionally, compared with bulk RNA‐seq, scRNA‐seq data have much larger technical variations mainly because starting amount of RNA in a single cell is much lower and requires many more cycles of amplification. Recently, Korthauer et al. [88] have developed scDD, a differential analysis method particularly suited for scRNA‐seq, to characterize differences in expression in the presence of distinct expression states within and among biological conditions. Using simulated and case study data, they demonstrate that the modeling framework is able to detect DE patterns of interest under a wide range of settings. Compared with existing approaches, scDD has higher power to detect subtle differences in gene expression distribu‐ tions that are more complex than a mean shift, and is able to characterize those differences.

### **6. Concluding remarks**

**5.3. Single‐cell RNA‐seq**

140 Bioinformatics - Updated Features and Applications

expression.

Cell identity and function can be characterized at the molecular level by unique transcriptomic signatures. At the organismal level, different tissues have distinct gene expression profiles, and even cells in consecutive stages of embryonic development have highly divergent transcriptomic landscapes. Until recently, molecular 'fingerprints' were generated using profiling of gene expression levels from bulk populations of millions of input cells. These Ensemble‐based approaches meant that the resulting expression value for each gene was an average of its expression levels across a large population of input cells. In many contexts, such bulk expression profiles are sufficient. However, there are also important questions for which bulk measures of gene expression are insufficient, for instance heterogeneity in immune cells. Besides, Ensemble measures do not provide insights into the stochastic nature of gene

Recently, RNA‐seq has achieved single‐cell resolution, and scRNA‐seq enables unbiased, high‐ throughput, and high‐resolution transcriptomic analysis of individual cells [82, 83]. This provides an additional dimension to transcriptomic information relative to traditional methods that profile bulk populations of cells. What is currently still missing is an effective way to routinely isolate and process large numbers of individual cells for quantitative in‐depth RNA‐seq. Klein et al. [84] and Macosko et al. [85] have independently developed a high‐ throughput droplet‐microfluidic approach for barcoding of RNA from thousands of individ‐ ual cells for subsequent analysis by next‐generation sequencing. Droplet‐based scRNA‐seq will be an attractive method for many laboratories because of its seemingly unlimited scala‐ bility and relatively low cost. By combining sophisticated RNA‐seq technology with a new device that isolates single cells and their progeny, MIT researchers can now trace detailed

family histories for several generations of cells descended from one "ancestor" [86].

Alongside the large‐scale generation of single‐cell transcriptomic data, it is important to consider the specific computational and analytical challenges that still have to be overcome [87]. Although some tools for bulk RNA‐seq analysis can be readily applied to single‐cell data, many new computational strategies are required to fully exploit this new data type. For instance, many genes at single cells are expressed in a stochastically‐bursting fashion and their abundance exhibits a bimodal distribution in cell populations. Another main problem we face is that each cell can be in a different cell cycle phase, and they might vary in size and RNA content. The traditional RNA‐seq data analysis does not take transcriptional bimodality into consideration, and implicitly assume the total RNA amount is the same or at least comparable. Additionally, compared with bulk RNA‐seq, scRNA‐seq data have much larger technical variations mainly because starting amount of RNA in a single cell is much lower and requires many more cycles of amplification. Recently, Korthauer et al. [88] have developed scDD, a differential analysis method particularly suited for scRNA‐seq, to characterize differences in expression in the presence of distinct expression states within and among biological conditions. Using simulated and case study data, they demonstrate that the modeling framework is able to detect DE patterns of interest under a wide range of settings. Compared with existing approaches, scDD has higher power to detect subtle differences in gene expression distribu‐ tions that are more complex than a mean shift, and is able to characterize those differences.

In much the same way that the advent of NGS technologies transformed our approaches in DNA‐sequencing, RNA‐seq approaches have dramatically changed our abilities to analyze the transcriptome of cells and tissues with a new level of detail and sensitivity. But we must also recognize that RNA‐seq analysis is vulnerable to the general biases and errors inherent to next‐ generation sequencing (NGS) technology upon which it is based.

While RNA‐seq technology is considered unbiased, it is important to note that the preparation and fragmentation of RNA and the library construction (which includes size selection) can introduce biases. Fragments are not uniformly sampled and sequenced, as there is variability in sequencing depth across the transcriptome due to preferential sites of fragmentation, variable primer, and tag nucleotide composition effects [89, 90]. RNA‐seq is a complicated multistep process that involves sample collection and stabilization, RNA extraction, fragmen‐ tation, cDNA synthesis, adapter ligation, amplification, purification, and sequencing. Any step in this complex sequence of protocols can result in biased data.

Data normalization is one of the most crucial steps of RNA‐seq data processing and has a profound effect on the results of the analysis. In practice, normalization of high‐throughput data still remains an important topic and has received a lot of attention in the literature. The increasing number of normalization methods makes it difficult for scientists to decide which method should be used for which particular data set [58–66]. Even worse, different research groups draw contradictory conclusions. For example, Zyprych‐Walczak et al. [66] concluded that the use of TMM method in most cases is displayed poorly, whereas the study by Dillies et al. [64] indicated that the use of TMM method led to good performance on simulated RNA‐ seq data sets. Considering the confusion and the numbers of methods available, we believe RNA‐seq data normalization is a fundamental question that is far from being solved, and RNA‐ seq community will have to develop more sophisticated and robust algorithms to tackle this problem.

Additionally, for RNA‐seq data collected with multiple sequencing platforms or at multiple sites, other normalization methods are required to remove site‐ or technology‐specific effects [91]. Several methods for such normalization have been developed, including sva [92], RUV2 [93], and PEER [94], but they vary in their ability to remove these systematic biases [91]. As the cost of NGS continues to decrease, it is likely that additional studies will be conducted to readdress the same biological question. Therefore, there is an increasing need for analyzing data from multiple studies and different labs [95].

In this chapter, we focused on data analysis workflow for mRNA‐seq, but have not covered RNA‐seq experimental design. However, a crucial prerequisite for any successful RNA‐seq study is a good experimental design, making sure the data generated have the potential to answer the biological questions of interest. For other aspects of RNA‐seq data analysis, including experimental design, alternative splicing, gene fusion, and eQTL mapping, please refer to the most recent review by Conesa et al. [96]. In practice, RNA‐seq is not used as a stand‐ alone technology platform in research. The integration of RNA‐seq data with other types of genome‐wide data allows us to connect the regulation of gene expression with specific aspects of molecular physiology and functional genomics. Integrative analyses that incorporate RNA‐ seq data with other genomic or proteomic experiments are becoming increasingly prevalent. For instance, the combination of RNA and DNA sequencing can be used to explore RNA‐ editing or expression quantitative trait loci (eQTL) mapping. Pairwise DNA‐methylation and RNA‐seq integration can reveal the correlation between gene expression and methylation patterns. Additionally, integration of RNA‐seq and miRNA‐seq data has the potential to unravel the regulatory effects of miRNAs on transcript steady‐state levels. All these integrative analyses pose additional challenges that are beyond the scope of this chapter.

### **Author details**

Shanrong Zhao\* , Baohong Zhang, Ying Zhang, William Gordon, Sarah Du, Theresa Paradis, Michael Vincent and David von Schack

\*Address all correspondence to: shanrong.zhao@pfizer.com

PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development, Cambridge, MA, USA

### **References**


[7] Han Y, Gao S, Muegge K, Zhang W, Zhou B. Advanced applications of RNA sequencing and challenges. Bioinform Biol Insights. 2015;9(Suppl 1):29–46. DOI: 10.4137/ BBI.S28991

genome‐wide data allows us to connect the regulation of gene expression with specific aspects of molecular physiology and functional genomics. Integrative analyses that incorporate RNA‐ seq data with other genomic or proteomic experiments are becoming increasingly prevalent. For instance, the combination of RNA and DNA sequencing can be used to explore RNA‐ editing or expression quantitative trait loci (eQTL) mapping. Pairwise DNA‐methylation and RNA‐seq integration can reveal the correlation between gene expression and methylation patterns. Additionally, integration of RNA‐seq and miRNA‐seq data has the potential to unravel the regulatory effects of miRNAs on transcript steady‐state levels. All these integrative

, Baohong Zhang, Ying Zhang, William Gordon, Sarah Du, Theresa Paradis,

analyses pose additional challenges that are beyond the scope of this chapter.

PharmaTherapeutics Clinical R&D, Pfizer Worldwide Research and Development,

[1] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA‐seq. Nat Methods. 2008;5:621–628. DOI: 10.1038/

[2] Wang Z, Gerstein M, Snyder M. RNA‐seq: a revolutionary tool for transcriptomics. Nat

[3] Costa V, Angelini C, De Feis I, Ciccodicola A. Uncovering the complexity of transcrip‐ tomes with RNA‐seq. J Biomed Biotechnol. 2010;2010:853916. DOI: 10.1155/2010/853916

[4] Mutz KO, Heilkenbrinker A, Lönne M, Walter JG, Stahl F. Transcriptome analysis using next‐generation sequencing. Curr Opin Biotechnol. 2013;24:22–30. DOI: 10.1016/

[5] Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for tran‐ scriptome annotation and quantification using RNA‐seq. Nat Methods. 2011;8:469–477.

[6] Capobianco E. RNA‐seq data: a complexity journey. Comput Struct Biotechnol J. 2014;

**Author details**

Shanrong Zhao\*

Cambridge, MA, USA

nmeth.1226

j.copbio.2012.09.004

DOI: 10.1038/nmeth.1613

11:123–130. DOI: 10.1016/j.csbj.2014.09.004

**References**

Michael Vincent and David von Schack

142 Bioinformatics - Updated Features and Applications

\*Address all correspondence to: shanrong.zhao@pfizer.com

Rev Genet. 2009;10:57–63. DOI: 10.1038/nrg2484


[35] Wu P‐Y, Phan JH, Wang MD. Assessing the impact of human genome annotation choice on RNA‐seq expression estimates. BMC Bioinformatics. 2013;14:S8. DOI: 10.1186/1471‐ 2105‐14‐S11‐S8

[20] Wang ET, Sandberg R, Luo S, et al. Alternative isoform regulation in human tissue

[21] Griffith M, Griffith OL, Mwenifumbo J, et al. Alternative expression analysis by RNA

[22] Picardi E, Horner DS, Chiara M, Schiavon R, Valle G, Pesole G. Large‐scale detection and analysis of RNA editing in grape mtDNA by RNA deep‐sequencing. Nucleic Acids

[23] Maher CA, Kumar‐Sinha C, Cao X, Kalyana‐Sundaram S, Han B, Jing X et al. Tran‐ scriptome sequencing to detect gene fusions in cancer. Nature. 2009;458:97–101. DOI:

[24] Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets.

[25] FastQC [Internet]. 2016. Available from: http://www.bioinformatics.babraham.ac.uk/

[26] Martin M. Cutadapt removes adapter sequences from high‐throughput sequencing

[27] Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–2120. DOI: 10.1093/bioinformatics/btu170

[28] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

[29] Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA‐seq aligner. Bioinformatics. 2013;29:15–21. DOI: 10.1093/bioinformat‐

[30] Dobin A, Gingeras TR. Mapping RNA‐seq Reads with STAR. Curr Protoc Bioinfor‐

[31] Wu TD, Nacu S. Fast and SNP‐tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26:873–881. DOI: 10.1093/bioinformatics/btq057

[32] Hu J, Ge H, Newman M, Liu K. OSA: a fast and accurate alignment tool for RNA‐seq.

[33] Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA‐seq reads for splice junction discovery. Nucl Acids Res. 2010;38:e178.

[34] Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, et al. Systematic evaluation of spliced alignment programs for RNA‐seq data. Nat Methods. 2013;10:1185–1191. DOI:

matics. 2015;51:11.14.1–11.14.19. DOI: 10.1002/0471250953.bi1114s51

Bioinformatics. 2012;28:1933–1934. DOI: 10.1093/bioinformatics/bts294

transcriptomes. Nature. 2008;456:470–476. DOI: 10.1038/nature07509

sequencing. Nat Methods. 2010;7:843–847. DOI: 10.1038/nmeth.1503

Bioinformatics. 2011;27:863–864. DOI: 10.1093/bioinformatics/btr026

reads. EMBnet. J. 2011;17:10–12. DOI: 10.14806/ej.17.1.200

Genome Biol. 2013;14:R36. DOI: 10.1186/gb‐2013‐14‐4‐r36

Res. 2010;38:4755–4767. DOI: 10.1093/nar/gkq202

projects/fastqc/ [Accessed: 2016‐02‐23].

10.1038/nature07638

144 Bioinformatics - Updated Features and Applications

ics/bts635

DOI: 10.1093/nar/gkq622

10.1038/nmeth.2722


differential expression analysis on RNA‐seq data. PLoS One. 2014;9:e103207. DOI: 10.1371/journal.pone.0103207

[62] Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA‐seq data. BMC Bioinformatics. 2013;14:91. DOI: 10.1186/1471‐2105‐14‐91

[48] Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics. 2014;15:135. DOI:

[49] Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abun‐ dance from RNA‐seq data. Genome Biol. 2015;16:150. DOI: 10.1186/s13059‐015‐0702‐5

[50] Zhao S, Xi L, Zhang B. Union exon based approach for RNA‐seq gene quantification: to be or not to be? PLoS One*.* 2015;10(11):e0141910. DOI: 10.1371/journal.pone.0141910

[51] Anders S, Huber W. Differential expression analysis for sequence count data. Genome

[52] Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA‐seq data. Genome Biol. 2010;11:R25. DOI: 10.1186/gb‐2010‐11‐3‐r25 [53] Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140.

[54] Cumbie JS, Kimbrel JA, Di Y, Schafer DW, Wilhelm LJ, Fox SE, et al. GENE‐Counter: a computational pipeline for the analysis of RNA‐seq data for gene expression differen‐

[55] Tarazona S, Garcia‐Alcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNA‐seq: a matter of depth. Genome Res. 2011; 21:2213–2223. DOI: 10.1101/gr.

[56] Di Y, Schafer D, Cumbie J, Chang J. The NBP negative binomial model for assessing differential gene expression from RNA‐seq. Stat Appl Genet Mol Biol. 2011;10: Article

[57] Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA‐seq. Nat Biotechnol.

[58] Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA‐seq experiments. BMC Bioinfor‐

[59] Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA‐seq data. Am J Bot 2012;99:248–256. DOI: 10.3732/ajb.

[60] Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient experi‐ mental design and analysis strategies for the detection of differential expression using

RNA‐sequencing. BMC Genomics. 2012;13:484. DOI: 10.1186/1471‐2164‐13‐484 [61] Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, Robinson GJ, Lundberg AE, Bartlett PF, Wray NR, Zhao QY. A comparative study of techniques for

ces. PLoS One. 2011;6:e25279. DOI: 10.1371/journal.pone.0025279

Biol. 2010;11:R106. DOI: 10.1186/gb‐2010‐11‐10‐r106

DOI: 10.1093/bioinformatics/btp616

24. DOI: 10.2202/1544‐6115.1637

2013;31:46–53. DOI: 10.1038/nbt.2450.

matics. 2010;11:94. DOI: 10.1186/1471‐2105‐11‐94

124321.111

1100340

10.1186/1471‐2105‐15‐135

146 Bioinformatics - Updated Features and Applications


[88] Korthauer KD, Chu LF, Newton MA, Li Y, Thomson J, Stewart R, Kendziorski C. scDD: a statistical approach for identifying differential distributions in single‐cell RNA‐seq experiments. bioRXiv. 2016 [Epub ahead of print]. DOI: 10.1101/035501

[75] Zhao S, Xi L, Quan J, Xi H, Zhang Y, von Schack D, Vincent M, Zhang B. QuickRNASeq lifts large‐scale RNA‐seq data analyses to the next level of automation and interactive

[76] Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A. Comprehensive comparative analysis of stranded RNA sequencing methods.

[77] Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du, S, von Schack D, Zhang B. Comparison of stranded and non‐stranded RNA‐seq transcriptome profiling and investigation of

[78] Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL. Targeted RNA sequencing reveals the deep complexity of the human tran‐

[79] Mercer TR, Clark MB, Crawford J, Brunck ME, Gerhardt DJ, Taft RJ, Nielsen LK, Dinger ME, Mattick JS. Targeted sequencing for gene discovery and quantification using RNA

[80] Tan JC, Bouriakov VD, Feng L, Richmond TA, Burgess D. Targeted lncRNA sequencing with the SeqCap RNA enrichment system. Methods Mol Biol.2016;1402:73–100. DOI:

[81] Cellular Research Precise Assays [Internet]. 2016. Available from: http://www.cellular‐

[82] Chattopadhyay PK, Gierahn TM, Roederer M, Love JC. Single‐cell technologies for monitoring immune systems. Nat Immunol. 2014;15:128–135. DOI: 10.1038/ni.2796

[83] Avital G, Hashimshony T, Yanai I. Seeing is believing: new methods for in situ single‐

[84] Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single‐cell transcriptomics applied to embryonic

[85] Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, et al. Highly parallel genome‐wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–

[86] Kimmerling RJ, Lee Szeto G, Li JW, Genshaft AS, Kazer SW, Payer KR, de Riba Borrajo J, Blainey PC, Irvine DJ, Shalek AK, Manalis SR. A microfluidic platform enabling single‐cell RNA‐seq of multigenerational lineages. Nat Commun. 2016;7:10220. DOI:

[87] Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single‐cell transcriptomics. Nat Rev Genet. 2015;16:133–145. DOI: 10.1038/nrg3833

gene overlap. BMC Genomics. 2015;16:487. DOI: 10.1186/s12864‐015‐1876‐7

scriptome. Nat Biotechnol. 2011;30:99–104. DOI: 10.1038/nbt.2024

CaptureSeq. Nat Protoc. 2014;9:989–1009. DOI: 10.1038/nprot.2014.058

research.com/products/precise‐assays.html [Accessed: 2016‐02‐23].

cell transcriptomics. Genome Biol. 2014;15:110. DOI: 10.1186/gb4169

stem cells. Cell. 2015;161:1187–1201. DOI: 10.1016/j.cell.2015.04.044

visualization. BMC Genomics. 2016;17:39. DOI: 10.1186/s12864‐015‐2356‐9

Nat Methods. 2010;7:709–715. DOI: 10.1038/nmeth.1491

10.1007/978‐1‐4939‐3378‐5\_8

148 Bioinformatics - Updated Features and Applications

1214. DOI: 10.1016/j.cell.2015.05.002

10.1038/ncomms10220


### **Medical Bioinformatics**

### **Application of Bioinformatics Methodologies in the Fields of Skin Biology and Dermatology**

Sidra Younis, Valeriia Shnayder and Miroslav Blumenberg

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63799

#### **Abstract**

Bioinformatics is a research field that uses computer‐based tools to investigate life sciences questions employing "big data" results from large‐scale DNA sequencing, whole genomes, transcriptomes, metabolomes, populations, and biological systems, which can only be comprehensively viewed *in silico*. The epidermis was among the earliest targets of bioinformatics studies because it represents one of the most accessi‐ ble targets for research. An additional advantage of working with the epidermis is that the sample can even be recovered using tape stripping, an easy, noninvasive protocol. Consequently, bioinformatics methods in the fields of skin biology and dermatology generated a fairly large volume of bioinformatics data, which led us to originate the term "skinomics." Skinomics data are directed toward epidermal differentiation, malignancies, inflammation, allergens, and irritants, the effects of ultraviolet (UV) light, wound healing, the microbiome, stem cells, etc. Cultures of cutaneous cell types, keratinocytes, fibroblasts, melanocytes, etc., as well as skin from human volunteers and from animal models, have been extensively experimented on. Here, we review the development of the skinomics, its methodology, current achievements, and future potentials.

**Keywords:** epidermal differentiation, inflammation, microbiome, noninvasive, psoriasis

### **1. Introduction: both wide angle and focused view of skin biology**

Bioinformatics is an umbrella term for a wide range of methodologies, studies that generate large datasets [1]. The term refers to the methodology, rather than a subject matter (akin to "microscopy"). Omics techniques, rather than focusing on a single protein, gene, metabolite,

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

microorganism, etc., comprehensively deal with the entire collection of all proteins, the whole genome, the complete metabolic array, or the full microbiome of a given biological system. "Omics" methodology has reached nowadays full maturity and recognition of the research community. The methodology comprises very large datasets thatrequire sophisticated *in silico* analyses. Accordingly, it uses, on the one hand, large central databanks, repositories of raw and "pre‐processed" data, and complex suites of analysis programs developed by multidis‐ ciplinary teams that include statisticians, graphic designers, etc., and on the other hand, many individual laboratories and groups providing discrete pieces of the large "omics" puzzles and applying the algorithms to their specific objectives [2].

Bioinformatics approaches received a major impetus with the development of "omics" techniques. Arguably, DNA microarrays are the most widely used omics technology [3]. In microarrays, the DNA probes are immobilized on solid supports, and the samples, such as total bulk DNA or RNA from the specimen, are labeled and then hybridized to the arrays in order to measure individual genes. Requiring only minute amounts of input DNA or RNA, such microarrays probe simultaneously, in hugely parallel experiments, many genes, for example, all the genes in the human genome. These experiments create very large volumes of data. Strangely enough, microarrays allow not only a very broad but also a very detailed insight into the biological function, mechanisms, and diseases of interest to dermatology. Microarrays empower us to see both the "forest and the trees."

Bioinformatics is a very rapidly developing science that constantly improves its methodology, microarrays, sequencing aparati, data repositories, hardware, and software. To keep up with the field, we have found very useful the special database issue of the *Nucleic Acids Research* [4], which is published every January. In these issues, we find description of the functions and roles of various data repositories. Another invaluable resource is the *Bioconductor* [5] an ever‐ expanding collection of bioinformatics algorithms developed by computer scientists and programmers from all over the world. The *Bioconductor* analysis packages are freely available to all. They are well described and annotated, and usually the program developers are helpful in troubleshooting and even extensive hand‐holding.

Directly accessible, skin was among the first organs analyzed using omics approaches. As a result, dermatology was one of the first medical disciplines to welcome and support omics results. Name "skinomics" has been proposed to designate specifically the bioinformatics studies in dermatology and skin biology [6]. The objectives of skinomics are to provide, enlarge, and buildup our knowledge of skin biology, to improve function of the healthy skin, and to assist in treating pathological skin conditions.

Skinomics studies focused to a significant extent, understandably, on skin cancers [6, 7]. For example, melanoma has been arguably the most studied skin disease. Microarray analyses identified markers of melanoma progression and of its metastatic potential. Similar studies targeted basal and squamous cell carcinomas. Specific for dermatology, noninvasive method using simple tape stripping can provide adequate material for transcriptional profiling of melanoma, psoriasis, and other skin diseases. The molecular changes in psoriatic plaques, that is, differences between uninvolved and involved skin and interestingly, the healed lesions of psoriatics, have also been defined using very large cohorts of patients. The psoriatic patients analyzed using this technology by now number in several hundreds [8]. Cutting‐edge state‐ of‐the‐art international skinomics studies, involving almost 20 different countries, have characterized many of the psoriasis susceptibility loci in the human genome, and identified the genes with putative roles on the pathology of this disease. These genes represent potential targets for intervention [8, 9]. DNA microarrays have been used to follow the course of psoriasis treatment and to predict responses or resistance to specific treatment modalities. The characteristic changes in the microbiomes of psoriasis and of atopic dermatitis patients have been correlated with the progression of each disease.

Arguably the most frequent and continual methodology in skinomics is the use of DNA microarrays, such as those from Affymetrix and Illumina. The DNA microarrays are a perfect medium because they simultaneously measure the expression of the entire genome [10]. Printed cDNA arrays, originated by Brown at Stanford [11], are often homemade, inexpensive, and two color, that is, compare two samples on the same chip. They are easy to customize in‐ house for specific applications. The commercial synthetic oligonucleotide microarrays are pricier, but tend to be more reliable. The microarray community has established a set of guidelines known as "MIAMI" rules (minimal information about microarrays) to allow comparisons of data obtained using different microarrays, sample handling techniques, quality of data, etc.

In the field of skin biology, DNA microarrays have been used to identify the genes specific for epidermal stem cells. Moreover, the transcriptional changes occurring during the process of epidermal differentiation have been characterized. The consequences of epidermal and barrier disruption have been defined. Importantly, cultures of epidermal keratinocytes *in vitro* have been used by our group as well as many others in many studies because these cells respond to ultraviolet (UV) light, hormones and vitamins, inflammatory and immunomodulating cytokines, chemokines and growth factors, environmental toxins, microbes, physical injury, etc.

As the medical field presses forward in the direction of personalized medicine, we can anticipate that the skinomics approaches will be shortly applied at the bedside, directly to the personalized dermatology practice of the future.

### **2. Historical perspective**

microorganism, etc., comprehensively deal with the entire collection of all proteins, the whole genome, the complete metabolic array, or the full microbiome of a given biological system. "Omics" methodology has reached nowadays full maturity and recognition of the research community. The methodology comprises very large datasets thatrequire sophisticated *in silico* analyses. Accordingly, it uses, on the one hand, large central databanks, repositories of raw and "pre‐processed" data, and complex suites of analysis programs developed by multidis‐ ciplinary teams that include statisticians, graphic designers, etc., and on the other hand, many individual laboratories and groups providing discrete pieces of the large "omics" puzzles and

Bioinformatics approaches received a major impetus with the development of "omics" techniques. Arguably, DNA microarrays are the most widely used omics technology [3]. In microarrays, the DNA probes are immobilized on solid supports, and the samples, such as total bulk DNA or RNA from the specimen, are labeled and then hybridized to the arrays in order to measure individual genes. Requiring only minute amounts of input DNA or RNA, such microarrays probe simultaneously, in hugely parallel experiments, many genes, for example, all the genes in the human genome. These experiments create very large volumes of data. Strangely enough, microarrays allow not only a very broad but also a very detailed insight into the biological function, mechanisms, and diseases of interest to dermatology. Microarrays

Bioinformatics is a very rapidly developing science that constantly improves its methodology, microarrays, sequencing aparati, data repositories, hardware, and software. To keep up with the field, we have found very useful the special database issue of the *Nucleic Acids Research* [4], which is published every January. In these issues, we find description of the functions and roles of various data repositories. Another invaluable resource is the *Bioconductor* [5] an ever‐ expanding collection of bioinformatics algorithms developed by computer scientists and programmers from all over the world. The *Bioconductor* analysis packages are freely available to all. They are well described and annotated, and usually the program developers are helpful

Directly accessible, skin was among the first organs analyzed using omics approaches. As a result, dermatology was one of the first medical disciplines to welcome and support omics results. Name "skinomics" has been proposed to designate specifically the bioinformatics studies in dermatology and skin biology [6]. The objectives of skinomics are to provide, enlarge, and buildup our knowledge of skin biology, to improve function of the healthy skin,

Skinomics studies focused to a significant extent, understandably, on skin cancers [6, 7]. For example, melanoma has been arguably the most studied skin disease. Microarray analyses identified markers of melanoma progression and of its metastatic potential. Similar studies targeted basal and squamous cell carcinomas. Specific for dermatology, noninvasive method using simple tape stripping can provide adequate material for transcriptional profiling of melanoma, psoriasis, and other skin diseases. The molecular changes in psoriatic plaques, that is, differences between uninvolved and involved skin and interestingly, the healed lesions of psoriatics, have also been defined using very large cohorts of patients. The psoriatic patients

applying the algorithms to their specific objectives [2].

154 Bioinformatics - Updated Features and Applications

empower us to see both the "forest and the trees."

in troubleshooting and even extensive hand‐holding.

and to assist in treating pathological skin conditions.

The first microarrays were developed at Stanford University by Dr. Pat Brown and his group [11]. Soon thereafter, they applied this methodology to skin biology [12]. Specifically, using the well‐known model to achieve cell cycle synchronization by serum deprivation and then re‐introduction of serum to the culture medium, Iyer et al. [12] characterized the timing and choreography of cell cycle gene regulation in synchronized cultures. Unexpectedly, Iyer et al. [12] also found that dermal fibroblasts respond to signals from serum by inducing specifically the wound‐healing responses. In retrospect, this result makes perfect sense because dermal fibroblasts are not exposed to serum, except right after wounding, when they are required to mount an appropriate response.

Another Stanford team, the group of Dr. Paul Khavari, was the first to use microarrays in dermatology [13, 14]. They used microarrays to follow the outcomes of gene therapy for junctional epidermolysis bullosa, a lethal genetic skin disorder. The microarray analysis showed that the normal gene expression has not been completely reestablished, although the replacement of the affected gene restored cell growth and adhesion.

Subsequently, melanomas and skin carcinomas have been investigated using microarrays, as have certain inflammatory diseases, such as psoriasis and eczema, as well as responses to allergens and irritants, effects of UV light, skin aging, wound healing, keloid formation, etc. The large body of skinomics data was used in meta‐analyses. For example, Dr. Noh and his team were the first to use meta‐analysis of microarray data in dermatology [15]. Such meta‐ analyses include our own work as well [16, 17].

### **3. Noninvasive sample acquisition**

A very important advantage of skin‐oriented research is that the samples can be acquired from skin completely noninvasively and almost painlessly. Based on the work of Drs. Morhenn, Benson, and others, it was demonstrated that simple tape stripping provides sufficient quantity and quality of RNA for use in microarrays [18]. The methodology has been useful in studies of psoriasis, melanoma, etc. [19, 20]. Because of the noninvasive access to tissue, dermatology can be expected to lead further advance toward "omics" techniques. These will be directly applicable to the personalized medicine in the future.

### **4. Epidermal differentiation**

Epidermal keratinocytes are "multifunctional" cells: on the one hand, they must differentiate through a tightly choreographed, multistage process in order to create cornified envelopes, unique three‐dimensional structures in the stratum corneum (**Figure 1**); on the other hand, keratinocytes must respond to very many extracellular environmental stimuli, ranging from UV light and chemical irritants to bacteria and viruses. Keratinocytes also must communicate with nearby cells including other keratinocytes, melanocytes, dendritic cells, and others, both sending signals to these cells and receiving signals from them. As a result, keratinocytes have a large transcriptome, and they express relatively many genes. To determine which of the expressed genes are inherent to all keratinocytes, specific for the layers of epidermal differen‐ tiation and induced extracellularly, we compared the transcriptomes of harvested skin from human subjects in artificially and three‐dimensionally cultured, differentiated, and recon‐ structed epidermis *in vitro* as well as in keratinocytes cultured as monolayers, and in nonker‐ atinocyte cell types.

Under all conditions, the keratinocytes express many proteases and protease inhibitors. Skin and the three‐dimensional constructs, but not keratinocyte monolayers, express epidermal differentiation markers, including filaggrin, involucrin, loricrin, and other cornified envelope components. Skin specifically expresses a large number of transcription factors, cell surface receptors, and secreted proteins. Surprisingly, the mitochondrial genes were significantly suppressed in skin, which suggests a low metabolic rate. In culture, keratinocytes amply express the cell cycle and DNA replication machinery, and also integrins and extracellular matrix proteins. These results define the expressed and regulated genes in epidermal kerati‐ nocytes [21].

Another Stanford team, the group of Dr. Paul Khavari, was the first to use microarrays in dermatology [13, 14]. They used microarrays to follow the outcomes of gene therapy for junctional epidermolysis bullosa, a lethal genetic skin disorder. The microarray analysis showed that the normal gene expression has not been completely reestablished, although the

Subsequently, melanomas and skin carcinomas have been investigated using microarrays, as have certain inflammatory diseases, such as psoriasis and eczema, as well as responses to allergens and irritants, effects of UV light, skin aging, wound healing, keloid formation, etc. The large body of skinomics data was used in meta‐analyses. For example, Dr. Noh and his team were the first to use meta‐analysis of microarray data in dermatology [15]. Such meta‐

A very important advantage of skin‐oriented research is that the samples can be acquired from skin completely noninvasively and almost painlessly. Based on the work of Drs. Morhenn, Benson, and others, it was demonstrated that simple tape stripping provides sufficient quantity and quality of RNA for use in microarrays [18]. The methodology has been useful in studies of psoriasis, melanoma, etc. [19, 20]. Because of the noninvasive access to tissue, dermatology can be expected to lead further advance toward "omics" techniques. These will be directly

Epidermal keratinocytes are "multifunctional" cells: on the one hand, they must differentiate through a tightly choreographed, multistage process in order to create cornified envelopes, unique three‐dimensional structures in the stratum corneum (**Figure 1**); on the other hand, keratinocytes must respond to very many extracellular environmental stimuli, ranging from UV light and chemical irritants to bacteria and viruses. Keratinocytes also must communicate with nearby cells including other keratinocytes, melanocytes, dendritic cells, and others, both sending signals to these cells and receiving signals from them. As a result, keratinocytes have a large transcriptome, and they express relatively many genes. To determine which of the expressed genes are inherent to all keratinocytes, specific for the layers of epidermal differen‐ tiation and induced extracellularly, we compared the transcriptomes of harvested skin from human subjects in artificially and three‐dimensionally cultured, differentiated, and recon‐ structed epidermis *in vitro* as well as in keratinocytes cultured as monolayers, and in nonker‐

Under all conditions, the keratinocytes express many proteases and protease inhibitors. Skin and the three‐dimensional constructs, but not keratinocyte monolayers, express epidermal differentiation markers, including filaggrin, involucrin, loricrin, and other cornified envelope components. Skin specifically expresses a large number of transcription factors, cell surface

replacement of the affected gene restored cell growth and adhesion.

analyses include our own work as well [16, 17].

156 Bioinformatics - Updated Features and Applications

**3. Noninvasive sample acquisition**

applicable to the personalized medicine in the future.

**4. Epidermal differentiation**

atinocyte cell types.

**Figure 1.** Cross‐section through skin. Only the top portion of the dermis is shown, the rest of the dermis extends far below. The layers of the epidermis are marked on the left.

The regulatory circuits that control epidermal differentiation have been a focus of significant research efforts. Inhibition of Jun N‐terminal kinase, JNK, in keratinocytes *in vitro* stimulates virtually all aspects of *in vivo* epidermal differentiation, including withdrawal from the cell cycle, cessation of motility, stratification, and production of cornified envelopes [22]. Inhibiting JNK also induces the expression of genes responsible for lipid and steroid metabolism, mitochondrial proteins, and histones. Simultaneously, the transcripts for basal cell markers are suppressed, including those for integrins, hemidesmosomes, and ECM components. We found that in the promoter sequences of JNK‐regulated genes, the forkhead family binding sites and the c‐Fos binding sites are overrepresented [22].

Vitamin D and calcium promote epidermal differentiation [23]. Specifically, kallikreins, serpins, and c‐Fos were found to be vitamin D‐responsive genes with roles in epidermal differentiation. Seo et al. [24] identified a subset of calcium‐regulated genes in human kerati‐ nocytes [24]. Conversely, retinoic acid and its analogs inhibit differentiation of epidermal keratinocytes [25]. Microarray analysis identified Rho as another signaling molecule that suppresses differentiation‐associated genes [26]. The papillomavirus type 16, HPV‐16, E6 oncoprotein inhibits keratinocyte differentiation and suppresses transglutaminase, involucrin, elafin, and keratins [27]. Different classes of HPVs have different effects on cellular transcrip‐ tion patterns [28].

TGFβ promotes the basal, undifferentiated phenotype in keratinocytes via the SMAD4 transcription factor [29]. Interestingly, TGFβ‐induced cell cycle arrest and migration genes via SMAD4, but not the epithelial–mesenchymal transition associated genes; these were not SMAD4‐dependent, which suggest that a loss of SMAD4 in human carcinomas may interfere with the tumor‐suppression, while maintaining the tumor‐promoting responses to TGFβ.

To study epidermal differentiation transcriptome *in vivo*, we have separated the basal and suprabasal layers of skin and compared the transcriptomes of the two cell populations [28]. The human skin samples otherwise discarded after reduction mammoplasty are obtained usually within 2–6 h after surgery. The adipocytes and most of the dermis are physically removed leaving ∼0.2 mm of mostly the epidermis. After enzymatic treatment, the epidermis is gently separated and single cell suspension derived using trypsin [16, 30, 31]. Magnetic beads attached to integrin β4 antibody are used to collect basal keratinocytes while the non‐adherent, β4‐cells represent the suprabasal cell populations. We disrupted the epidermal cells and isolated the RNA by using Trizol reagent. Next we use Qiashredders to homogenize cell extracts, remove DNA using on‐column DNAse digestion RNeasy prepare the RNA using kits from Qiagen. As quality control, we visualize 28S and 18S ribosomal bands and check that the OD260/280 spectrophotometric ratio is at least 1.8. In the next step, ∼5 µg of RNA is labeled according to the Affymetrix‐suggested protocols.

### **5. The skin microbiome**

Skin, our outermost layer, represents the first line of defense against pathogenic microbes. The intimate contact between the skin and the infectious microbial world has been known since the biblical days, was already discussed by Hippocrates, and has been studied for a very long time [32–34]. Microbes were perceived primarily as pathogens, which, fulfilling the Koch's postulates, can cause acne, impetigo, folliculitis, etc. As a result, skin microbes have been treated with disinfectants and antibiotics [35–37]. Knowledge of skin microorganisms was deficient because this knowledge was based on *in vitro* culturing of these microbes. While a few cutaneous bacteria and fungi could be grown in laboratories, the vast majority of micro‐ organisms known to reside on human skin was, and still is, uncultivatable [38]. Recent advances in large‐scale DNA sequencing led to major breakthroughs in defining the cutaneous microbiome. Specifically, the 16S small subunit ribosomal RNA in prokaryotes and 18S in eukaryotes are encoded in genomes of all living organisms; genes encoding these RNAs are closely related so that a set of PCR primers can be used for multi‐taxa amplification, providing unambiguous identification of individual species [39].

The major breakthrough in defining the cutaneous microbiome occurred in 2007 when Dr. M. Blaser's laboratory at the New York University, USA, published the molecular analysis of superficial skin bacterial biota of human forearm [40]. In this work, Gao et al. found that the cutaneous microbiome consists predominantly of six bacterial genera, *Propionibacterium*, *Corynebacterium*, *Staphylococcus*, *Streptococcus*, *Acinetobacter*, and *Finegoldia* were present in all subjects and represented 63% of all DNA clones analyzed. Approximately 8% of clones represented previously unknown organisms. Some 300 different bacteria inhabit our skin, although we humans demonstrate remarkable interpersonal variation in our cutaneous microbiota. Different body sites harbor specific microbial patterns that characteristically change in skin diseases [40]. Four phyla, *Actinobacteria*, *Firmicutes*, *Proteobacteria*, and *Bacteroi‐ detes*, constitute vast majority of skin bacteria, while *Malassezia* dominates the skin fungi (**Figure 2**). The somewhat more complex fungal 18S RNA genes can be used to identify fungi, yeasts and other eukaryotes, and they have been used to confirm the abundance of *Malasse‐ zia* species on human skin [41, 42]. Current understanding of the cutaneous microbes has shown that they are, for the most part, commensal and beneficial, useful and protective, and only rarely dysbiosis and infection of pathogens occur.

TGFβ promotes the basal, undifferentiated phenotype in keratinocytes via the SMAD4 transcription factor [29]. Interestingly, TGFβ‐induced cell cycle arrest and migration genes via SMAD4, but not the epithelial–mesenchymal transition associated genes; these were not SMAD4‐dependent, which suggest that a loss of SMAD4 in human carcinomas may interfere with the tumor‐suppression, while maintaining the tumor‐promoting responses to TGFβ.

To study epidermal differentiation transcriptome *in vivo*, we have separated the basal and suprabasal layers of skin and compared the transcriptomes of the two cell populations [28]. The human skin samples otherwise discarded after reduction mammoplasty are obtained usually within 2–6 h after surgery. The adipocytes and most of the dermis are physically removed leaving ∼0.2 mm of mostly the epidermis. After enzymatic treatment, the epidermis is gently separated and single cell suspension derived using trypsin [16, 30, 31]. Magnetic beads attached to integrin β4 antibody are used to collect basal keratinocytes while the non‐adherent, β4‐cells represent the suprabasal cell populations. We disrupted the epidermal cells and isolated the RNA by using Trizol reagent. Next we use Qiashredders to homogenize cell extracts, remove DNA using on‐column DNAse digestion RNeasy prepare the RNA using kits from Qiagen. As quality control, we visualize 28S and 18S ribosomal bands and check that the OD260/280 spectrophotometric ratio is at least 1.8. In the next step, ∼5 µg of RNA is labeled

Skin, our outermost layer, represents the first line of defense against pathogenic microbes. The intimate contact between the skin and the infectious microbial world has been known since the biblical days, was already discussed by Hippocrates, and has been studied for a very long time [32–34]. Microbes were perceived primarily as pathogens, which, fulfilling the Koch's postulates, can cause acne, impetigo, folliculitis, etc. As a result, skin microbes have been treated with disinfectants and antibiotics [35–37]. Knowledge of skin microorganisms was deficient because this knowledge was based on *in vitro* culturing of these microbes. While a few cutaneous bacteria and fungi could be grown in laboratories, the vast majority of micro‐ organisms known to reside on human skin was, and still is, uncultivatable [38]. Recent advances in large‐scale DNA sequencing led to major breakthroughs in defining the cutaneous microbiome. Specifically, the 16S small subunit ribosomal RNA in prokaryotes and 18S in eukaryotes are encoded in genomes of all living organisms; genes encoding these RNAs are closely related so that a set of PCR primers can be used for multi‐taxa amplification, providing

The major breakthrough in defining the cutaneous microbiome occurred in 2007 when Dr. M. Blaser's laboratory at the New York University, USA, published the molecular analysis of superficial skin bacterial biota of human forearm [40]. In this work, Gao et al. found that the cutaneous microbiome consists predominantly of six bacterial genera, *Propionibacterium*, *Corynebacterium*, *Staphylococcus*, *Streptococcus*, *Acinetobacter*, and *Finegoldia* were present in all subjects and represented 63% of all DNA clones analyzed. Approximately 8% of clones

according to the Affymetrix‐suggested protocols.

unambiguous identification of individual species [39].

**5. The skin microbiome**

158 Bioinformatics - Updated Features and Applications

**Figure 2.** Bacterial populations in the cutaneous microbiota. Note the predominance of *Actinobacteria*, *Firmicutes*, and *Proteobacteria.* The numbers refer to different species, or "operational taxonomic units" detected.

Different body sites harbor different bacterial complexes. In a very significant microbiome analysis, Dr. Segre and her collaborators described microbiomes from 20 different body sites from 10 healthy individuals [43]. From such data, the authors were able to make several important conclusions regarding our cutaneous microbiome. Intrapersonal variation between symmetrical sites, that is, left vs. right forearms was much less than the interpersonal variation. Perhaps unexpectedly, the protected sites, such as inguinal and alar creases, were more related than were the freely exposed sites, such as forearms. Re‐sampling resulted in higher intraper‐ sonal similarity, suggesting a fairly high consistency of the microbiome during the 4–6 months duration of the study.

The types of the microbiota were classified according to the type of skin as sebaceous, moist, or dry [43]. These sites had different proportions of *Actinobacteria, Firmicytes, Proteobacteria,* and *Bacteroidetes*. In addition to sebaceous/moist/dry, it is possible to construct additional classifications of the skin microbiota: while overall *Corynebacteria* and *Propionibacteria* are the most common genera, *Actinobacteria* are especially common on the UV‐exposed sites, glabella, nares, and occiput; these genera are not as common on sun‐protected skin. Moreover, the *Proteobacteria* are particularly common on human arms, that is, on axillae, forearm, palms, and the interdigital space. Skin sites that are commonly subject to stretching and flexing, including fingers, toe webs, popliteal and antecubital fossa, inguinal crease, and occiput are particularly rich in *Staphylococci*. The gluteal crease and the toe webs are particularly rich in *Micrococci*. As suggested by Segre et al. [43], the human skin contains multiple and varied niches that are host to multiple and varied microbiota.

### **6. UV damage**

UV light is a major environmental carcinogen. Photodamage of skin results in thinning, wrinkling, keratosis, and ultimately malignancy. Several groups, including ours, have analyzed the transcriptional responses to UV light in human epidermal keratinocytes [44–48]. Keratinocytes respond to UV by inducing a cell repair program to self‐repair autonomously. However, the keratinocytes also must protect the underlying organism (**Figure 3**). The early transcriptional changes, that is, in the first two hours after UV illumination, contain expression of transcription factors, signal transducing, and cytoskeletal proteins resulting in a "paused" phenotype. This allows keratinocytes to assess the damage and commence repair functions. After 4–8 h post‐irradiation, keratinocytes secrete signaling peptides, growth factors, cyto‐ kines, and chemokines; these serve to alert the underlying tissue to the UV damage. Subse‐ quently, 16–24 h after treatment, the cornified envelope proteins are produced, as keratinocytes terminally differentiate. This has two beneficial effects: it boosts the stratum corneum, the protective inert layer of the epidermis, and also removes the cells containing potentially UV‐ damaged DNA, a carcinogenic threat [44]. The results from several laboratories were quite congruent, especially considering the differences in experimental approaches, countries of origin, and the time frames of the experiments [44–48].

These *in vitro* studies have been followed up by studies of UV‐irradiated skin in human volunteers *in vivo* [49, 50]. *In vivo* in skin, markers of keratinocyte activation, such as keratin K6 and S100A proteins, were prominently induced by UV, as were the DNA repair proteins. Interestingly, keratinocytes exposed to gamma or X‐rays irradiation produce similar tran‐ scriptional changes to those in the UV‐treated cells [51, 52], specifically, inducing the genes involved in cell energy metabolism. An interesting study compared samples of lentigines with adjacent sun‐exposed skin and with matched samples of sun‐protected buttocks skin [53]; genes specifically upregulated in solar lentigo included melanocyte‐related genes, genes related to fatty acid metabolism and genes related to inflammation.


**Figure 3.** Transcriptional effects of UV light on epidermal keratinocytes. The time course was followed for the first 24 h post‐illumination. The UV‐treated keratinocytes repair the damage they suffered, as well as alert the underlying tissue that injury has occurred.

### **7. Skin aging**

sonal similarity, suggesting a fairly high consistency of the microbiome during the 4–6 months

The types of the microbiota were classified according to the type of skin as sebaceous, moist, or dry [43]. These sites had different proportions of *Actinobacteria, Firmicytes, Proteobacteria,* and *Bacteroidetes*. In addition to sebaceous/moist/dry, it is possible to construct additional classifications of the skin microbiota: while overall *Corynebacteria* and *Propionibacteria* are the most common genera, *Actinobacteria* are especially common on the UV‐exposed sites, glabella, nares, and occiput; these genera are not as common on sun‐protected skin. Moreover, the *Proteobacteria* are particularly common on human arms, that is, on axillae, forearm, palms, and the interdigital space. Skin sites that are commonly subject to stretching and flexing, including fingers, toe webs, popliteal and antecubital fossa, inguinal crease, and occiput are particularly rich in *Staphylococci*. The gluteal crease and the toe webs are particularly rich in *Micrococci*. As suggested by Segre et al. [43], the human skin contains multiple and varied niches that are host

UV light is a major environmental carcinogen. Photodamage of skin results in thinning, wrinkling, keratosis, and ultimately malignancy. Several groups, including ours, have analyzed the transcriptional responses to UV light in human epidermal keratinocytes [44–48]. Keratinocytes respond to UV by inducing a cell repair program to self‐repair autonomously. However, the keratinocytes also must protect the underlying organism (**Figure 3**). The early transcriptional changes, that is, in the first two hours after UV illumination, contain expression of transcription factors, signal transducing, and cytoskeletal proteins resulting in a "paused" phenotype. This allows keratinocytes to assess the damage and commence repair functions. After 4–8 h post‐irradiation, keratinocytes secrete signaling peptides, growth factors, cyto‐ kines, and chemokines; these serve to alert the underlying tissue to the UV damage. Subse‐ quently, 16–24 h after treatment, the cornified envelope proteins are produced, as keratinocytes terminally differentiate. This has two beneficial effects: it boosts the stratum corneum, the protective inert layer of the epidermis, and also removes the cells containing potentially UV‐ damaged DNA, a carcinogenic threat [44]. The results from several laboratories were quite congruent, especially considering the differences in experimental approaches, countries of

These *in vitro* studies have been followed up by studies of UV‐irradiated skin in human volunteers *in vivo* [49, 50]. *In vivo* in skin, markers of keratinocyte activation, such as keratin K6 and S100A proteins, were prominently induced by UV, as were the DNA repair proteins. Interestingly, keratinocytes exposed to gamma or X‐rays irradiation produce similar tran‐ scriptional changes to those in the UV‐treated cells [51, 52], specifically, inducing the genes involved in cell energy metabolism. An interesting study compared samples of lentigines with adjacent sun‐exposed skin and with matched samples of sun‐protected buttocks skin [53]; genes specifically upregulated in solar lentigo included melanocyte‐related genes, genes

duration of the study.

160 Bioinformatics - Updated Features and Applications

to multiple and varied microbiota.

origin, and the time frames of the experiments [44–48].

related to fatty acid metabolism and genes related to inflammation.

**6. UV damage**

Skin is the most obvious yardstick of aging. Molecular comparisons of transcriptional profiles of young vs. aged and of sun‐protected vs. sun‐exposed skin indicate that photoaging and chronical aging, although partly overlapping, have different, characteristic features [54]. Genes associated with skin aging were identified by comparing foreskin keratinocytes from young, 3–4 years of age, old, 68–72 years, subjects [55]. A total 105 genes changed; for example, epidermal differentiation and keratinocyte activation markers were overexpressed in the aged skin, while the immune response, cell cycle, and extracellular matrix associated genes were overexpressed in keratinocytes from young skin. Proteomic profiling using two‐dimensional gel electrophoresis to compare young and old foreskin samples identified additional markers of intrinsic aging, including aging‐related posttranslational protein modification [56].

One of the hallmarks of aging skin is impaired wound healing. Using microarrays to compare gene expression in wounds of elderly vs. young humans, it was found that the differences seem to be related to regulation by estrogen [57]. These results suggest that estrogen has a profound influence on skin aging in general, and particularly in the context of wound healing. Another hallmark of aging is the graying of hair; analysis of differential gene expressions between pigmented, gray, and white scalp hair follicles identified close to 200 upregulated and as many downregulated genes in human gray hair. As expected, the melanogenesis and structural genes of the melanosome are overrepresented among the regulated genes [58].

### **8. Skinomics genome‐wide association studies**

Genome‐wide association studies, GWAS, comprise examination of many common DNA polymorphisms in a large population cohort to detect association of polymorphisms with a given disease. Such polymorphisms can point to the genes where disease‐causing mutations may map. GWAS are particularly useful in the analysis of diseases, such as psoriasis, which are common and with a strong genetic component. Psoriasis is a hyperproliferative autoim‐ mune skin disease and involves keratinocytes and T cells [9]. A successful GWAS analysis of the psoriasis susceptibility loci in the human genome has been accomplished by an extensive multinational effort (**Figure 4**) and reported in a set of manuscripts [59–66]. A total to 36 loci have been associated with psoriasis in European populations, with additional ones detected in the Chinese population [67, 68]. Other skin diseases, for example, eczema, have also been studied using GWAS; eczema was associated several genetic loci, including major histocom‐ patibility complex (MHC) on chromosome 6 and the epidermal differentiation complex, EDC, on chromosome 1. The human filaggrin gene, known to be associated with eczema, is encoded within the EDC.

**Figure 4.** Genome‐wide association studies of psoriasis. While many loci have been identified, and additional ones keep being reported, the most prominent ones on chromosomes 1, 6, and 21 are marked with arrows.

Carcinomas and melanoma were also analyzed using GWAS. Basal cell and squamous cell carcinomas have both joint and specific susceptibility loci [69]. Carcinoma GWAS loci are not associated with melanoma risks. GWAS also identified the loci important for human skin pigmentation [70].

### **9. Transcriptional studies of melanoma**

**8. Skinomics genome‐wide association studies**

162 Bioinformatics - Updated Features and Applications

within the EDC.

pigmentation [70].

Genome‐wide association studies, GWAS, comprise examination of many common DNA polymorphisms in a large population cohort to detect association of polymorphisms with a given disease. Such polymorphisms can point to the genes where disease‐causing mutations may map. GWAS are particularly useful in the analysis of diseases, such as psoriasis, which are common and with a strong genetic component. Psoriasis is a hyperproliferative autoim‐ mune skin disease and involves keratinocytes and T cells [9]. A successful GWAS analysis of the psoriasis susceptibility loci in the human genome has been accomplished by an extensive multinational effort (**Figure 4**) and reported in a set of manuscripts [59–66]. A total to 36 loci have been associated with psoriasis in European populations, with additional ones detected in the Chinese population [67, 68]. Other skin diseases, for example, eczema, have also been studied using GWAS; eczema was associated several genetic loci, including major histocom‐ patibility complex (MHC) on chromosome 6 and the epidermal differentiation complex, EDC, on chromosome 1. The human filaggrin gene, known to be associated with eczema, is encoded

**Figure 4.** Genome‐wide association studies of psoriasis. While many loci have been identified, and additional ones

Carcinomas and melanoma were also analyzed using GWAS. Basal cell and squamous cell carcinomas have both joint and specific susceptibility loci [69]. Carcinoma GWAS loci are not associated with melanoma risks. GWAS also identified the loci important for human skin

keep being reported, the most prominent ones on chromosomes 1, 6, and 21 are marked with arrows.

Melanoma is one of the most aggressive human cancers, which is why it was among the earliest cancers studied using microarrays, already 1 year after the first report of cDNA microarrays [11, 71]. Great amount of attention has been devoted to melanoma diagnosis and the natural history of its progression. Multiple studies compared healthy melanocytes to melanoma tumor cells, other studies examined the differences between melanoma lines that differ in their metastatic potential [71–82]. Some of the differentially expressed genes are, as expected, encoded in the chromosomal regions identified by GWAS as commonly altered in melanomas. Osteopontin, for example, was identified as an overexpressed marker of invasive melanoma.

Microarrays have also been used to compare the transcriptomes of benign moles and mela‐ nomas. The metastatic samples exhibited two distinct patterns of gene expression, and these were similar to the microdissected nodular vs. flat components of large primary melanomas [74]. The epigenetic characteristics of melanomas, such as differences in the genome, its methylation, and the expression of miRNAs have also received significant attention. The abundance of primary, raw skinomics data enabled meta‐analysis approaches, which consoli‐ dated the findings from many studies and established a specific "melanoma signature" gene sets; these could become potentially useful in detection, classification, and outcome prediction for melanomas [83].

### **10. Wound healing studies**

A very active area of skinomics research relates to wound healing, a multi‐step process involving coordinated and interacting regulatory pathways [84–86]. For example, we have demonstrated that microarrays can be utilized at the bedside to guide surgical debridement of non‐healing wounds [87]. Specifically, the healing edges of ulcers express keratinocyte markers, whereas the non‐healing ones express the dermal and inflammatory markers [88]. Diabetic foot ulcers and chronic venous ulcers have distinct transcriptional profiles [88, 89]. The microbiomes of chronic wounds and wound healing have also received attention [90].

### **11. Inflammation, cytokines, and chemokines**

Psoriasis has been linked with multiple cytokines and chemokines, including IL‐1, IL‐12, IL‐ 17, tumor necrosis factor (TNF)α, interferon (INF)‐γ, and oncostatin M. Therefore, the transcriptional effects of these signaling proteins on epidermal keratinocytes in culture have been studied extensively [91–97]. The transcriptional profiling studies of the effects of corti‐ costeroids, anti‐inflammatory agents, demonstrated their inhibition of the TNFα, IFN‐γ, and IL‐1 pathways [98]. However, the anti‐inflammatory effects of corticosteroids have character‐ istic choreography and phasing: the earliest are the anti‐TNFα effects, clear already in the first hour. These are followed by the anti‐IL‐1 effects, peaking between 24 and 48 h. Finally, the anti‐IFN‐γ effects occur later, at 72 h.

In separate studies, comparisons of eczema and psoriasis showed that the expression of many antimicrobial proteins of the innate immune response genes are relatively decreased in eczema, which could explain the increased susceptibility to infection in eczema than in psoriasis [99– 103].

### **12. Psoriasis transcriptome**

The studies of psoriasis, a paradigmatic skin inflammatory disease, have provided several hundreds of patient samples from several laboratories [8, 9]. This provided data for meta‐ analysis of the psoriatic transcriptome [16]. Microarray data can be obtained from annotated and curated repositories. The two main data repositories that collect and annotate transcription profiling using microarrays are NIH‐GEO [104] and ArrayExpress [105]. The two largely overlap, but have a few differences, that is, studies present in one but not the other collection. Additional datasets exist in proprietary databases, but we found that searching these is time consuming and usually unproductive.

**Figure 5.** Quality control features of RMAExpress. Before removing a "bad" chip, left and after right.

We searched GEO Datasets for the key term "Psoriasis" and selecting "Homo sapiens" as the organism; from the results, we selected nine studies that met our criteria, namely comparing lesional and nonlesional psoriatic skin. These nine studies included seven Affymetrix‐based and one each Sentrix and Illumina microarrays, with a combined total of 645 samples. The Affymetrix .CEL or .TXT files from these studies were downloaded and unzipped, then log2 transformed. Datasets obtained were combined and analyzed using RMAExpress for quality control [106]. RMAExpress is a program that can compute gene expression values from Affymetrix chips using the Robust Multichip Average protocol; it is available [107]. RMA can also perform chip quality assessments.

hour. These are followed by the anti‐IL‐1 effects, peaking between 24 and 48 h. Finally, the

In separate studies, comparisons of eczema and psoriasis showed that the expression of many antimicrobial proteins of the innate immune response genes are relatively decreased in eczema, which could explain the increased susceptibility to infection in eczema than in psoriasis [99–

The studies of psoriasis, a paradigmatic skin inflammatory disease, have provided several hundreds of patient samples from several laboratories [8, 9]. This provided data for meta‐ analysis of the psoriatic transcriptome [16]. Microarray data can be obtained from annotated and curated repositories. The two main data repositories that collect and annotate transcription profiling using microarrays are NIH‐GEO [104] and ArrayExpress [105]. The two largely overlap, but have a few differences, that is, studies present in one but not the other collection. Additional datasets exist in proprietary databases, but we found that searching these is time

**Figure 5.** Quality control features of RMAExpress. Before removing a "bad" chip, left and after right.

We searched GEO Datasets for the key term "Psoriasis" and selecting "Homo sapiens" as the organism; from the results, we selected nine studies that met our criteria, namely comparing lesional and nonlesional psoriatic skin. These nine studies included seven Affymetrix‐based and one each Sentrix and Illumina microarrays, with a combined total of 645 samples. The Affymetrix .CEL or .TXT files from these studies were downloaded and unzipped, then log2 transformed. Datasets obtained were combined and analyzed using RMAExpress for quality control [106]. RMAExpress is a program that can compute gene expression values from

anti‐IFN‐γ effects occur later, at 72 h.

164 Bioinformatics - Updated Features and Applications

**12. Psoriasis transcriptome**

consuming and usually unproductive.

103].

We cannot overemphasize the importance of data normalization and selection, when con‐ ducting meta‐analyses. One of the most is important, but often overlooked task is to perform quality control on the microarrays deposited in the databanks. The quality control graphs can be viewed in the RMAExpress program, the best routine being the normalized unscaled standard error (NUSE). The gene chips with NUSE medians more than 5% different from other chips should be ignored (**Figure 5**) [31]. These microarrays chips usually stem from poor quality input RNA.

In the case of non‐Affymetrix studies, the simplest approach is to download the TXT data files directly from PubMed [108]. Conveniently, downloading all compressed files into same directory allows batch uncompressing of TXT files, which can be opened using Excel for further analysis. We often use AddIns DataLoader [109] in order to assign gene annotations to the expression data. A considerable stumbling block for meta‐analysis is the merging of data from different microarray platforms. The difficulties include different number of genes on different platforms, different levels of redundancies in probing the genes, and inappropriate differences in identifying genes.

To harmonize gene IDs for various platforms we find useful BioMart [110], the downloaded files can be opened using Excel. We also find very useful the AddIns DataLoader [109] for combining similar data from different spreadsheets. Where the smaller arrays do not value for a given gene, we simply added 1. This does not affect the subsequent steps of analysis.

To select differentially expressed genes, we use RankProd, a nonparametric method [30]. The RankProd method combines different datasets thus increasing the number of differentially expressed genes identified. Here we present a simple simulated RankProd set of commands an imaginary analysis of three datasets with 5 + 5, 6 + 24, and 2 + 2 microarrays (psoriatic lesional + nonlesional [14]):

```
memory.size(max = FALSE) 1
memory.limit(size = 24000)
library (RankProd)
data (Your_txt_file)
n1 <‐ 5
n2 <‐ 5
n3 <‐ 6
n4 <‐ 24
n5 <‐ 2
n6 <‐ 2
cl <‐ rep(c(0,1,0,1,0,1), c(n1,n2,n3,n4,n5,n6)) 2
```

```
cl
rownames(Your_txt_file)= Your_txt_file [,1] 3
Your_txt_file = Your_txt_file [,‐1]
origin <‐ c(rep(1,10), rep(2,30), rep(3,4)) 4
origin
RP.adv.out <‐ RPadvance(Your_txt_file, cl, origin, rand = 100) 5
plotRP(RP.adv.out, cutoff = 0.01) 6
topGene(RP.adv.out, cutoff = 0.01)
write.table(topGene(RP.adv.out, num.gene = 1000), row.names = TRUE, col.names = NA,
file = "Your_txt_file.txt") 7.
```
To annotate the differentially expressed genes, we find extremely useful and convenient the program Database for Annotation Visualization and Integrated Discovery (DAVID) [31, 32, 111]. Starting with the uploaded list, DAVID make available "tables" containing details known about the genes, "charts" which contain over‐represented pathways, ontological categories, etc., as well as "clusters" of such categories, eliminating some of the redundancies and overlaps. The transcription factor binding sites in the listed genes can also be evaluated using DAVID, although the oPOSSUM programs are much more comprehensive, sophisticated, and convenient [39, 40, 112]. Many microarray data clustering programs are available; our favorite is MEV [113]. Generally, we did not find them informative for the analysis of data of psoriasis patients.

### **13. Conclusions**

One clear advantage in dermatology over other medical specialties is that in the clinic, the noninvasive, painless sampling using the tape stripping methods can provide high‐quality samples for skinomics analysis. Already proven useful in diagnosis of certain diseases, the tape stripping will be also used in the microbiome analyses to provide samples of the cutaneous viruses, bacteria, and fungi, alerting to the presence of pathogens. Similar approaches can detect also cutaneous microbial imbalances. Moreover, disease treatments using microbes may be in our future! We cannot even imagine today the future developments in skinomics. In summary, great strides have been already achieved in skinomics, the omics technology applied in dermatology, and skin biology. Skinomics techniques will eventually provide individual‐ ized personalized treatments to dermatology patients. Exciting and wonderful times are ahead.

### **Author details**

*cl*

*origin*

patients.

ahead.

**13. Conclusions**

*rownames(Your\_txt\_file)= Your\_txt\_file [,1] 3*

*origin <‐ c(rep(1,10), rep(2,30), rep(3,4)) 4*

*RP.adv.out <‐ RPadvance(Your\_txt\_file, cl, origin, rand = 100) 5*

*write.table(topGene(RP.adv.out, num.gene = 1000), row.names = TRUE, col.names = NA,*

To annotate the differentially expressed genes, we find extremely useful and convenient the program Database for Annotation Visualization and Integrated Discovery (DAVID) [31, 32, 111]. Starting with the uploaded list, DAVID make available "tables" containing details known about the genes, "charts" which contain over‐represented pathways, ontological categories, etc., as well as "clusters" of such categories, eliminating some of the redundancies and overlaps. The transcription factor binding sites in the listed genes can also be evaluated using DAVID, although the oPOSSUM programs are much more comprehensive, sophisticated, and convenient [39, 40, 112]. Many microarray data clustering programs are available; our favorite is MEV [113]. Generally, we did not find them informative for the analysis of data of psoriasis

One clear advantage in dermatology over other medical specialties is that in the clinic, the noninvasive, painless sampling using the tape stripping methods can provide high‐quality samples for skinomics analysis. Already proven useful in diagnosis of certain diseases, the tape stripping will be also used in the microbiome analyses to provide samples of the cutaneous viruses, bacteria, and fungi, alerting to the presence of pathogens. Similar approaches can detect also cutaneous microbial imbalances. Moreover, disease treatments using microbes may be in our future! We cannot even imagine today the future developments in skinomics. In summary, great strides have been already achieved in skinomics, the omics technology applied in dermatology, and skin biology. Skinomics techniques will eventually provide individual‐ ized personalized treatments to dermatology patients. Exciting and wonderful times are

*Your\_txt\_file = Your\_txt\_file [,‐1]*

166 Bioinformatics - Updated Features and Applications

*plotRP(RP.adv.out, cutoff = 0.01) 6*

*topGene(RP.adv.out, cutoff = 0.01)*

*file = "Your\_txt\_file.txt") 7.*

Sidra Younis1,2, Valeriia Shnayder1 and Miroslav Blumenberg1\*

\*Address all correspondence to: miroslav.blumenberg@nyumc.org

1 The R.O.Perelman Department of Dermatology and Department of Biochemistry and Molecular Pharmacology, NYU Langone Medical Center, New York, USA

2 Department of Biochemistry, Quaid‐i‐Azam University, Islamabad, Pakistan

### **References**


[25] Baron JM, Heise R, Blaner WS, Neis M, Joussen S, Dreuw A, et al. Retinoic acid and its 4‐oxo metabolites are functionally active in human skin cells in vitro. J Invest Dermatol. 2005;125:143–53.

[12] Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, et al. The transcriptional program in the response of human fibroblasts to serum. Science. 1999;283:83–7.

[13] Robbins PB, Sheu SM, Goodnough JB, Khavari PA. Impact of laminin 5 beta3 gene versus protein replacement on gene expression patterns in junctional epidermolysis

[14] Hinata K, Gervin AM, Jennifer Zhang Y, Khavari PA. Divergent gene regulation and growth effects by NF‐kappa B in epithelial and mesenchymal cells of human skin.

[15] Noh M, Yeo H, Ko J, Kim HK, Lee CH. MAP17 is associated with the T‐helper cell cytokine‐induced down‐regulation of filaggrin transcription in human keratinocytes.

[16] Mimoso C, Lee DD, Zavadil J, Tomic‐Canic M, Blumenberg M. Analysis and meta‐ analysis of transcriptional profiling in human epidermis. Methods Mol Biol.

[17] Blumenberg M. Profiling and metaanalysis of epidermal keratinocytes responses to epidermal growth factor. BMC Genomics. 2013;14:85. DOI: 10.1186/471‐2164‐14‐85.

[18] Morhenn VB, Chang EY, Rheins LA. A noninvasive method for quantifying and distinguishing inflammatory skin reactions. J Am Acad Dermatol. 1999;41:687–92.

[19] Wong R, Tran V, Morhenn V, Hung SP, Andersen B, Ito E, et al. Use of RT‐PCR and DNA microarrays to characterize RNA recovered by non‐invasive tape harvesting of

[20] Wachsman W, Morhenn V, Palmer T, Walls L, Hata T, Zalla J, et al. Noninvasive genomic detection of melanoma. Br J Dermatol. 2011;164:797–806. DOI: 10.1111/j.365‐

[21] Gazel A, Ramphal P, Rosdy M, De Wever B, Tornier C, Hosein N, et al. Transcriptional profiling of epidermal keratinocytes: comparison of genes expressed in skin, cultured keratinocytes, and reconstituted epidermis, using large DNA microarrays. J Invest

[22] Gazel A, Banno T, Walsh R, Blumenberg M. Inhibition of JNK promotes differentiation

[23] Lu J, Goldstein KM, Chen P, Huang S, Gelbert LM, Nagpal S. Transcriptional profiling of keratinocytes reveals a vitamin D‐regulated epidermal differentiation network. J

[24] Seo EY, Namkung JH, Lee KM, Lee WH, Im M, Kee SH, et al. Analysis of calcium‐ inducible genes in keratinocytes using suppression subtractive hybridization and

normal and inflamed skin. J Invest Dermatol. 2004;123:159–67.

of epidermal keratinocytes. J Biol Chem. 2006;281:20530–41.

cDNA microarray. Genomics. 2005;86:528–38. Epub 2005 Aug 9.

bullosa. Hum Gene Ther. 2001;12:1443–8.

Exp Dermatol. 2010;19:355–62. Epub 2009 Jul 8.

2014;1195:61–97. DOI: 10.1007/7651\_2013\_60.

2133.011.10239.x. Epub 2011 Mar 25.

Dermatol. 2003;121:1459–68.

Invest Dermatol. 2005;124:778–85.

Oncogene. 2003;22:1955–64.

168 Bioinformatics - Updated Features and Applications


microbiology laboratories. Clin Microbiol Infect. 2008;14:908–34. DOI: 10.1111/j.469‐ 0691.2008.02070.x.


[52] Lamartine J, Franco N, Le Minter P, Soularue P, Alibert O, Leplat JJ, et al. Activation of an energy providing response in human keratinocytes after gamma irradiation. J Cell Biochem. 2005;95:620–31.

microbiology laboratories. Clin Microbiol Infect. 2008;14:908–34. DOI: 10.1111/j.469‐

[40] Gao Z, Tseng CH, Pei Z, Blaser MJ. Molecular analysis of human forearm superficial skin bacterial biota. Proc Natl Acad Sci U S A. 2007;104:2927–32. Epub 2007 Feb 9.

[41] Paulino LC, Tseng CH, Blaser MJ. Analysis of Malassezia microbiota in healthy superficial human skin and in psoriatic lesions by multiplex real‐time PCR. FEMS Yeast

[42] Paulino LC, Tseng CH, Strober BE, Blaser MJ. Molecular analysis of fungal microbiota in samples from healthy human skin and psoriatic lesions. J Clin Microbiol.

[43] Grice EA, Kong HH, Conlan S, Deming CB, Davis J, Young AC, et al. Topographical and temporal diversity of the human skin microbiome. Science. 2009;324:1190–2.

[44] Li D, Turi TG, Schuck A, Freedberg IM, Khitrov G, Blumenberg M. Rays and arrays: the transcriptional program in the response of human epidermal keratinocytes to UVB

[45] Sesto A, Navarro M, Burslem F, Jorcano JL. Analysis of the ultraviolet B response in primary human keratinocytes using oligonucleotide microarrays. Proc Natl Acad Sci

[46] Murakami T, Fujimoto M, Ohtsuki M, Nakagawa H. Expression profiling of cancer‐ related genes in human keratinocytes following non‐lethal ultraviolet B irradiation. J

[47] Takao J, Ariizumi K, Dougherty, II, Cruz PD, Jr. Genomic scale analysis of the human keratinocyte response to broad‐band ultraviolet‐B irradiation. Photodermatol Photo‐

[48] Howell BG, Wang B, Freed I, Mamelak AJ, Watanabe H, Sauder DN. Microarray analysis of UVB‐regulated genes in keratinocytes: downregulation of angiogenesis

[49] Enk CD, Shahar I, Amariglio N, Rechavi G, Kaminski N, Hochberg M. Gene expression profiling of in vivo UVB‐irradiated human epidermis. Photodermatol Photoimmunol

[50] Enk CD, Jacob‐Hirsch J, Gal H, Verbovetski I, Amariglio N, Mevorach D, et al. The UVB‐induced gene expression profile of human epidermis in vivo is different from that

[51] Koike M, Shiomi T, Koike A. Identification of skin injury‐related genes induced by ionizing radiation in human keratinocytes using cDNA microarray. J Radiat Res

inhibitor thrombospondin‐1. J Dermatol Sci. 2004;34:185–94.

of cultured keratinocytes. Oncogene. 2006;25:2601–14.

0691.2008.02070.x.

170 Bioinformatics - Updated Features and Applications

2006;44:2933–41.

Res. 2008;8:460–71. Epub 2008 Feb 20.

illumination. Faseb J. 2001;15:2533–5.

U S A. 2002;99:2965–70.

Dermatol Sci. 2001;27:121–9.

Photomed. 2004;20:129–37.

(Tokyo). 2005;46:173–84.

immunol Photomed. 2002;18:5–13.


[76] Gallagher WM, Bergin OE, Rafferty M, Kelly ZD, Nolan IM, Fox EJ, et al. Multiple markers for melanoma progression regulated by DNA methylation: insights from transcriptomic studies. Carcinogenesis. 2005;26:1856–67. Epub 2005 Jun 15.

[64] Ellinghaus E, Ellinghaus D, Stuart PE, Nair RP, Debrus S, Raelson JV, et al. Genome‐ wide association study identifies a psoriasis susceptibility locus at TRAF3IP2. Nat

[65] Sun LD, Cheng H, Wang ZX, Zhang AP, Wang PG, Xu JH, et al. Association analyses identify six new psoriasis susceptibility loci in the Chinese population. Nat Genet.

[66] Huffmeier U, Uebe S, Ekici AB, Bowes J, Giardina E, Korendowych E, et al. Common variants at TRAF3IP2 are associated with susceptibility to psoriatic arthritis and

[67] Tsoi LC, Spain SL, Knight J, Ellinghaus E, Stuart PE, Capon F, et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat Genet.

[68] Julia A, Tortosa R, Hernanz JM, Canete JD, Fonseca E, Ferrandiz C, et al. Risk variants for psoriasis vulgaris in a large case–control collection and association with clinical

[69] Nan H, Xu M, Kraft P, Qureshi AA, Chen C, Guo Q, et al. Genome‐wide association study identifies novel alleles associated with risk of cutaneous basal cell carcinoma and squamous cell carcinoma. Hum Mol Genet. 2011;20:3718–24. DOI: 10.1093/hmg/ddr287.

[70] Nan H, Kraft P, Qureshi AA, Guo Q, Chen C, Hankinson SE, et al. Genome‐wide association study of tanning phenotype in a population of European ancestry. J Invest

[71] DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet.

[72] Dooley TP, Curto EV, Davis RL, Grammatico P, Robinson ES, Wilborn TW. DNA microarrays and likelihood ratio bioinformatic methods: discovery of human melano‐

[73] Hoek K, Rimm DL, Williams KR, Zhao H, Ariyan S, Lin A, et al. Expression profiling reveals novel pathways in the transformation of melanocytes to melanomas. Cancer

[74] Haqq C, Nosrati M, Sudilovsky D, Crothers J, Khodabakhsh D, Pulliam BL, et al. The gene expression signatures of melanoma progression. Proc Natl Acad Sci U S A.

[75] Becker B, Roesch A, Hafner C, Stolz W, Dugas M, Landthaler M, et al. Discrimination of melanocytic tumors by cDNA array hybridization of tissues prepared by laser

subphenotypes. Hum Mol Genet. 2012;21:4549–57. Epub 2012 Jul 19.

Dermatol. 2009;129:2250–7. DOI: 10.1038/jid.2009.62. Epub Apr 2.

cyte biomarkers. Pigment Cell Res. 2003;16:245–53.

pressure catapulting. J Invest Dermatol. 2004;122:361–8.

Genet. 2010;42:991–5. Epub 2010 Oct 17.

psoriasis. Nat Genet. 2010;42:996–9. Epub 2010 Oct 17.

2012;44:1341–8. DOI: 10.038/ng.2467. Epub 012 Nov 11.

2010;42:1005–9. Epub 2010 Oct 17.

172 Bioinformatics - Updated Features and Applications

Epub 2011 Jun 23.

1996;14:457–60.

Res. 2004;64:5270–82.

2005;102:6092–7. Epub 2005 Apr 15.


[90] Grice EA, Segre JA. Interaction of the microbiome with the innate immune response in

[91] Finelt N, Gazel A, Gorelick S, Blumenberg M. Transcriptional responses of human

[92] Molenda M, Mukkamala L, Blumenberg M. Interleukin IL‐12 blocks a specific subset of the transcriptional profile responsive to UVB in epidermal keratinocytes. Mol

[93] Gazel A, Rosdy M, Bertino B, Tornier C, Sahuc F, Blumenberg M. A characteristic subset of psoriasis‐associated genes is induced by oncostatin‐M in reconstituted epidermis. J

[94] Banno T, Gazel A, Blumenberg M, Adachi M, Mukkamala L. Effects of tumor necrosis factor‐{alpha} (TNF{alpha}) in epidermal keratinocytes revealed using global tran‐

[95] Banno T, Gazel A, Blumenberg M. The use of DNA microarrays in dermatology

[96] Banno T, Adachi M, Mukkamala L, Blumenberg M. Unique keratinocyte‐specific effects of interferon‐gamma that protect skin from viruses, identified using transcriptional

[97] Haider AS, Peters SB, Kaporis H, Cardinale I, Fei J, Ott J, et al. Genomic analysis defines a cancer‐specific gene expression signature for human squamous cell carcinoma and distinguishes malignant hyperproliferation from benign hyperplasia. J Invest Derma‐

[98] Stojadinovic O, Lee B, Vouthounis C, Vukelic S, Pastar I, Blumenberg M, et al. Novel genomic effects of glucocorticoids in epidermal keratinocytes: inhibition of apoptosis, interferon‐gamma pathway, and wound healing along with promotion of terminal

[99] Nomura I, Goleva E, Howell MD, Hamid QA, Ong PY, Hall CF, et al. Cytokine milieu of atopic dermatitis, as compared to psoriasis, skin prevents induction of innate

[100] Nomura I, Gao B, Boguniewicz M, Darst MA, Travers JB, Leung DY. Distinct patterns of gene expression in the skin lesions of atopic dermatitis and psoriasis: a gene

[101] de Jongh GJ, Zeeuwen PL, Kucharekova M, Pfundt R, van der Valk PG, Blokx W, et al. High expression levels of keratinocyte antimicrobial proteins in psoriasis compared

differentiation. J Biol Chem. 2007;282:4021–34. Epub 2006 Nov 9.

microarray analysis. J Allergy Clin Immunol. 2003;112:1195–202.

with atopic dermatitis. J Invest Dermatol. 2005;125:1163–73.

immune response genes. J Immunol. 2003;171:326–9.

scriptional profiling. J Biol Chem. 2004;279:32633–42. Epub 2004 May 15.

epidermal keratinocytes to Oncostatin‐M. Cytokine. 2005;31:305–13.

chronic wounds. Adv Exp Med Biol. 2012;946:55–68.

Immunol. 2006;43:1933–40. Epub 2006 Feb 8.

Invest Dermatol. 2006;126:2647–57.

174 Bioinformatics - Updated Features and Applications

research. Retinoids. 2004;20:1–4.

tol. 2006;126:869–81.

profiling. Antivir Ther. 2003;8:541–54.


### **The Study of Hepatitis B Virus Using Bioinformatics**

Trevor Graham Bell and Anna Kramvis

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63076

#### **Abstract**

Hepatitis refers to the inflammation of the liver. A major cause of hepatitis is the hepatotropic virus, hepatitis B virus (HBV). Annually, more than 786,000 people die as a result of the clinical manifestations of HBV infection, which include cirrhosis and hepatocellular carcinoma. Sequence heterogeneity is a feature of HBV, because the viralencoded polymerase lacks proof-reading ability. HBV has been classified into nine genotypes, A to I, with a putative 10th genotype, "J," isolated from a single individual. Comparative analysis of HBV strains from various geographic regions of the world and from different eras can shed light on the origin, evolution, transmission and response to anti-HBV preventative, and treatment measures. Bioinformatics tools and databases have been used to better understand HBV mutations and how they develop, especially in response to antiviral therapy and vaccination. Despite its small genome size of ~3.2 kb, HBV presents several bioinformatic challenges, which include the circular genome, the overlapping open reading frames, and the different genome lengths of the genotypes. Thus, bioinformatics tools and databases have been developed to facilitate the study of HBV.

**Keywords:** alignments, computation, databases, genotypes, phylogenetics

### **1. Introduction**

Primarily, bioinformatics is the use of computational science to study biological and clinical data using statistics, mathematics, and information theory. This field is developing and evolving; thus, the definition cannot be precise. Moreover, the field is broad, ranging from the study of DNA and proteins, to structural biology, drug design and comparative genomics, transcrip‐ tomics, proteomics, and metagenomics. The optimization of computational technology is paramount in order to handle, store, manage, and analyze the large volumes of data generat‐

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

edinthe lastdecade.Thedataincludemolecular sequencingdataofhostandpathogengenomes and their associations to demographic and clinical records, laboratory test results, as well as information on treatment. Moreover, bioinformatics can aid in the investigation of virus–host genome and environmental interactions and in the identification of both host and viral biomarkers. This analysis can lead to a better understanding of clinical manifestation of disease and effective design of preventative and treatment measures [1].

In the first section, we describe the unique genomics and molecular biology of hepatitis B virus (HBV). Using illustrative examples, we showed how bioinformatics analyses can facilitate the understanding of the origin, evolution, transmission, and response to antiviral agents of HBV. Next, we described the bioinformatics challenges posed by HBV and present the public databases and tools currently available for the study of HBV.

### **2. Hepatitis B virus**

### **2.1. Hepatitis**

Hepatitis refers to the inflammation of the liver. A major cause of hepatitis is the hepatotropic virus, HBV. HBV infection is a public health problem of worldwide importance. Globally, 2 billion people have been exposed to this virus at some stage of their lives, and 240 million are chronic carriers of the virus [2].

This infection can lead to a spectrum of clinical consequences. In the majority of cases, the infection is subclinical and transient, whereas in 25% of cases, it can cause self-limited acute hepatitis and in 1% of these progress to acute liver failure. The virus can persist in 90% of neonates and 5–10% of adults, leading to chronic infection that can progress to either chronic hepatitis or an asymptomatic carrier state. Both of these states can ultimately develop liver cancer or hepatocellular carcinoma (HCC), with or without the intermediate cirrhotic stage. Annually, more than 786,000 people die as a result of these clinical manifestations of HBV infection [3].

### **2.2. Prevalence**

The prevalence of HBV in a community can be estimated by the proportion of the population, who are hepatitis B surface antigen (HBsAg)-positive carriers. HBV prevalence varies widely in the world [3]. The prevalence is low (<1%) in northern Europe, Australia, New Zealand, Canada, and the United States of America. Northern Asia, the Indian subcontinent, parts of Africa, Eastern and south-eastern Europe, and parts of Latin America are areas of intermediate prevalence (1–5%). The high prevalence areas (5–20%) include East and Southeast Asia, the Pacific Islands, and sub-Saharan Africa.

### **2.3. Classification and structure**

HBV, the prototype member of the family *Hepadnaviridae*, belongs to the genus *Orthohepadna‐ virus*. With a diameter of 42 nm and a DNA genome of ~3.2 kilobases (kb), it is the smallest DNA virus infecting man. The genome is circular and partially double stranded. One DNA strand is complete, except for a small nick (the minus strand), and the other is short and incomplete (the plus strand). The minus strand contains four overlapping open reading frames (ORFs; **Figure 1**) [4] that represent: (1) the *preS/S* gene that codes for the envelope proteins, large, middle, and small HBsAgs; (2) the *P* gene for DNA polymerase/reverse transcriptase (POL); (3) the *X* gene for the X protein, a key regulator during the natural infection process, which has transcriptional trans-activation activity and is required to initiate and maintain HBV replication [5]; and (4) the *precore/core* gene that codes for the HBcAg or core protein that forms the capsid and for an additional protein known as HBeAg, which is not incorporated into the virus itself but is expressed on the liver cells and secreted into the serum. **Figure 2** illustrates the structure of the hepatitis B virion.

edinthe lastdecade.Thedataincludemolecular sequencingdataofhostandpathogengenomes and their associations to demographic and clinical records, laboratory test results, as well as information on treatment. Moreover, bioinformatics can aid in the investigation of virus–host genome and environmental interactions and in the identification of both host and viral biomarkers. This analysis can lead to a better understanding of clinical manifestation of disease

In the first section, we describe the unique genomics and molecular biology of hepatitis B virus (HBV). Using illustrative examples, we showed how bioinformatics analyses can facilitate the understanding of the origin, evolution, transmission, and response to antiviral agents of HBV. Next, we described the bioinformatics challenges posed by HBV and present the public

Hepatitis refers to the inflammation of the liver. A major cause of hepatitis is the hepatotropic virus, HBV. HBV infection is a public health problem of worldwide importance. Globally, 2 billion people have been exposed to this virus at some stage of their lives, and 240 million are

This infection can lead to a spectrum of clinical consequences. In the majority of cases, the infection is subclinical and transient, whereas in 25% of cases, it can cause self-limited acute hepatitis and in 1% of these progress to acute liver failure. The virus can persist in 90% of neonates and 5–10% of adults, leading to chronic infection that can progress to either chronic hepatitis or an asymptomatic carrier state. Both of these states can ultimately develop liver cancer or hepatocellular carcinoma (HCC), with or without the intermediate cirrhotic stage. Annually, more than 786,000 people die as a result of these clinical manifestations of HBV

The prevalence of HBV in a community can be estimated by the proportion of the population, who are hepatitis B surface antigen (HBsAg)-positive carriers. HBV prevalence varies widely in the world [3]. The prevalence is low (<1%) in northern Europe, Australia, New Zealand, Canada, and the United States of America. Northern Asia, the Indian subcontinent, parts of Africa, Eastern and south-eastern Europe, and parts of Latin America are areas of intermediate prevalence (1–5%). The high prevalence areas (5–20%) include East and Southeast Asia, the

HBV, the prototype member of the family *Hepadnaviridae*, belongs to the genus *Orthohepadna‐ virus*. With a diameter of 42 nm and a DNA genome of ~3.2 kilobases (kb), it is the smallest

and effective design of preventative and treatment measures [1].

databases and tools currently available for the study of HBV.

**2. Hepatitis B virus**

178 Bioinformatics - Updated Features and Applications

chronic carriers of the virus [2].

Pacific Islands, and sub-Saharan Africa.

**2.3. Classification and structure**

**2.1. Hepatitis**

infection [3].

**2.2. Prevalence**

**Figure 1.** The genome of hepatitis B virus (HBV). The partially double-stranded DNA (dsDNA) with the complete mi‐ nus (−) strand and the incomplete (+) strand. The four open reading frames (ORFs) are shown: *precore/core (preC/C)* that encodes the e antigen (HBeAg) and core protein (HBcAg); *P* for polymerase (reverse transcriptase), *PreS1/PreS2/S* for surface proteins [three forms of HBsAg, small (S), middle (M), and large (L)] and *X* for a transcriptional trans-activator protein.

**Figure 2.** Schematic representation of hepatitis B virus (HBV), showing the structure of the virion, composed of a parti‐ ally double-stranded DNA genome, enclosed by a capsid, comprised of HBcAg and surrounded by a lipid envelope containing large (L)-HBsAg, middle (M)-HBsAg, and small (S)-HBsAg. The virus also expresses two non-particulate proteins X protein and HBeAg.

### **2.4. Regulatory elements of HBV**

Every single nucleotide of the HBV genome is necessary for the translation of a protein and may also be part of one of the regulatory elements of HBV, which overlap with protein expressing regions. The regulatory elements include the S1 and S2 promoters, which overlap both the preS region and polymerase ORFs; the preC/pregenomic promoter, which includes the basic core promoter (BCP) and overlaps the X and preC ORF; and the X promoter. There are two enhancers (enhancer I and enhancer II) as well as *cis*-acting negative regulatory elements (URR: upper regulatory region, CURS: core upstream regulatory sequence, NRE: negative regulatory element). These regulatory elements control transcription (reviewed in [6, 7]).

### **2.5. Replication of HBV**

HBV and other members of the family *Hepadnaviridae* have an unusual replication cycle. These DNA viruses replicate by reverse transcription of a RNA intermediate known as the prege‐ nomic RNA (pgRNA) [8]. Entry into the cell is via the sodium taurocholate cotransporting polypeptide (NTCP), a multiple transmembrane transporter predominantly expressed in the liver [9]. After entry, the virion is uncoated and the core particle is actively transported to the nucleus [10], where the partially double strand relaxed circular DNA molecule is released. The single-stranded gap is closed by the viral polymerase to yield a covalently closed circular molecule of DNA (cccDNA) [11], which is the template for transcription by the host RNA polymerase II [12]. The mRNAs are transported into the cytoplasm where they are translated into the seven viral proteins. In addition to being translated into the polymerase and the core protein, the pgRNA is packaged into immature core particles by the process known as encapsidation. In order to be encapsidated, the 5′ end of the pgRNA has to be folded into a particular secondary structure known as the encapsidation signal (ε) [13].

The encapsidation signal (ε) is a bipartite stem-loop structure, consisting of an upper and lower stem, the bulge, and an apical loop. Besides encapsidation, ε has a number of other functions (reviewed in [13]) and references therein. It acts in template restriction so that not any piece of RNA is encapsidated, and it also plays a role in the activation of the viral polymerase, so that there is no indiscriminate reverse transcription. It is also involved in the initiation of reverse transcription. The polymerase or reverse transcriptase acts as a primer of RNA-directed DNA synthesis by the binding of the polymerase to the bulge of ε. The first three nucleotides of the negative stand of DNA are synthesized at the bulge and are transferred to an acceptor site on the 3' end of the pgRNA, where DNA synthesis proceeds toward the 5′ end of the pgRNA [14], giving rise to the immature virion. The virus matures by acquiring its glycoprotein envelope, containing HBsAg, in the endoplasmic reticulum and is exported by vesicular transport from the cell [15].

### **2.6. Genotypes and subgenotypes of HBV**

**Figure 2.** Schematic representation of hepatitis B virus (HBV), showing the structure of the virion, composed of a parti‐ ally double-stranded DNA genome, enclosed by a capsid, comprised of HBcAg and surrounded by a lipid envelope containing large (L)-HBsAg, middle (M)-HBsAg, and small (S)-HBsAg. The virus also expresses two non-particulate

Every single nucleotide of the HBV genome is necessary for the translation of a protein and may also be part of one of the regulatory elements of HBV, which overlap with protein expressing regions. The regulatory elements include the S1 and S2 promoters, which overlap both the preS region and polymerase ORFs; the preC/pregenomic promoter, which includes the basic core promoter (BCP) and overlaps the X and preC ORF; and the X promoter. There are two enhancers (enhancer I and enhancer II) as well as *cis*-acting negative regulatory elements (URR: upper regulatory region, CURS: core upstream regulatory sequence, NRE: negative regulatory element). These regulatory elements control transcription (reviewed in [6,

HBV and other members of the family *Hepadnaviridae* have an unusual replication cycle. These DNA viruses replicate by reverse transcription of a RNA intermediate known as the prege‐ nomic RNA (pgRNA) [8]. Entry into the cell is via the sodium taurocholate cotransporting polypeptide (NTCP), a multiple transmembrane transporter predominantly expressed in the liver [9]. After entry, the virion is uncoated and the core particle is actively transported to the nucleus [10], where the partially double strand relaxed circular DNA molecule is released. The single-stranded gap is closed by the viral polymerase to yield a covalently closed circular molecule of DNA (cccDNA) [11], which is the template for transcription by the host RNA polymerase II [12]. The mRNAs are transported into the cytoplasm where they are translated

proteins X protein and HBeAg.

**2.5. Replication of HBV**

7]).

**2.4. Regulatory elements of HBV**

180 Bioinformatics - Updated Features and Applications

Sequence heterogeneity is a feature of HBV, because the viral-encoded polymerase lacks proofreading ability as mentioned above [16]. Using phylogenetic analysis of the complete genome of HBV and an intergroup divergence of greater than 7.5%, HBV has been classified into nine genotypes, A to I [17, 18, 19], with a putative 10th genotype, "J," isolated from a single individual [20]. With between ~4 and ~8% intergroup nucleotide difference across the complete genome and good bootstrap support, genotypes A–D, F, H, and I are classified further into at least 35 subgenotypes [21]. The genotypes differ in genome length, the size of ORFs and the proteins translated [17], as well as the development of various mutations [22]. Generally, the genotypes, and in some cases the subgenotypes, have a distinct geographic distribution (**Table 1**).




¶Summarizes data compiled from Kramvis [21] and references cited therein.

§ Earlier subgenotype designation.

**GenotypeLength Differentiating**

**D** 3182 33-nucleotide

deletion at the amino terminus of the preS1 region

**features**

182 Bioinformatics - Updated Features and Applications

**Subgenotypes Geographic**

**B** 3215 B1 Japan *adw*2

Quasisubgenotype B3(B3,B5,B7–B9,B6(China)§

**C** 3215 C1 Thailand/Myanmar/

Quasi-subgenotype C2(C2,C14, undefined sequences)§

**distribution**

B2 China *adw*2

B5(B6)§ Eskimos/Inuits *adw*2

Vietnam

Polynesia

Aborigines

Indonesia

Philippines

C13–C15 Indonesia *adr* C16 Indonesia *ayr*\*

D2 Europe/Japan/Lebanon *ayw*3 D3 Worldwide *ayw*2/*ayw*3

> Micronesians, Papua New Guineans,

Arctic Denes

D5 India *ayw*3/*ayw*2

D4 Australian aborigines,

D1 Middle East, Central Asia

Japan/China/Korea *adr*

B4 Vietnam/Cambodia/ France

C3 New Caledonia/

C4 Australian

C5 Philippines/

C6–C12 Indonesia/

Indonesia Philippines/China **Serological subtype**

*ayw*1/*adw*2

*adr*fntab1\_3

*ayw*2/*ayw*3

*adw*2

*adr*

*ayw*2

*adr*\*

*adw*2 Perinatal

*adr* Perinatal

*ayw*2 Horizontal:

parenteral with intravenous drug use being a risk factor

**Transmission route**

\*Rare serological subtype for that genotype.
