Introductory Chapter: Homology Modeling

*Rafael Trindade Maia, Magnólia de Araújo Campos and Rômulo Maciel de Moraes Filho*

## **1. Introduction**

Proteins are macromolecules present in all living beings and perform a huge variety of complex and diverse functions and structures. They are polymers of amino acids synthesized in the cell of living organisms, also called polypeptides. Determining the three-dimensional structure of a protein is crucial for understanding its function. However, experimental techniques for structural elucidation such as X-ray critalography and nuclear magnetic resonance (NMR) are complicated and expensive [1]. In this context, computational techniques for building structural models are a very useful and viable alternative for different situations. Among computational techniques, homology modeling, also known as comparative modeling, is the most used *in silico* tool for obtaining structural protein models, achieving excellent results [2].

Proteins are organized at different levels of structural complexity: 1) primary structure; 2) secondary structure; 3) tertiary structure; 4) quarternary structure (**Figure 1**). The primary structure of a protein comprises the linear sequence of the amino acids that compose it, with one end containing the carboxyl group of the first amino acid in the chain (C-terminal) and with one end containing the amino group of the last amino acid in the chain (N -terminal). The primary structure of a protein can be represented by a pattern of letters that represents its peptide constitution (amino acids). The secondary structure of a protein is determined by the primary sequence, which is decisive in the arrangement of the monomers (aminoacids) with each other and with the solvent, forming standard structures in three groups: the turns, the helix and the β-leaves. The way in which these secondary structures are organized three-dimensionally in space is what is called a tertiary structure, which is associated with the biological function of the molecule in question. In multimeric protein complexes (dimers, trimers, tetramers, etc.) there is also the formation of the quarternary structure, which is the oligomeric state formed by the aggregation of these macromolecular compounds of tertiary structure.

There are three types of computational modeling for predicting protein structures: by *ab initio*/*De novo*, by *Threading* and by homology modeling. Homology modeling is based on the premise that the three-dimensional structure of a protein tends to be much more conserved than its primary structure. Therefore, changes in the sequence do not always change the structural domains of a protein, thus maintaining its original function. It is assumed that proteins from the same functional family maintain their structural domains, which allows the so-called comparative modeling (by homology). If two proteins are homologous, it means that they belong to the same genetic and functional family, and hypothetically, they have the same structural motifs. In the case of a specific protein that does not have an elucidated three-dimensional structure, but it is homologous to a protein with a

#### **Figure 1.**

*Illustrative scheme for the structural complexity levels of proteins. Source: Google images.*

solved structure, a three-dimensional model for the sequence can be built using the known structure as a template. As a rule, a minimum identity of 25% between the amino acids of two proteins is sufficient for the construction of models by homology. Sequence identities above generally 40% provides good models, while those above 50% tend to provide excellent theoretical structures [3].

However, in addition to the identity and similarity between the amino acids, other parameters must be observed when choosing a good template, such as the resolution in angstroms of the crystallographic structure and the percentage of alignment coverage (**Figure 2**). The lower the resolution of a structure, the better its quality. The average resolution of the structures available in the PDB (Protein Data Bank) is around 3.5 Ä, while structures below 2.0 Ä are considered to have excellent resolution and represent less than 10% of the entries in the PDB. The higher the percentage of coverage of the alignment between a target protein (protein to be modeled) and the template (mold), the better [4]. Coverage alignments above 90% of the residues tends to have high scores and are considered to be excellent (**Figure 2**).

Something important to note in alignments is the presence of sequence gaps. A gap between sequences means the absence of residues, that is, amino acids that

**5**

**Table 1.**

*Source: Google search.*

*Introductory Chapter: Homology Modeling DOI: http://dx.doi.org/10.5772/intechopen.95446*

**Figure 2.**

**Figure 3.**

*(green). Source: Authors data.*

*Source: Authors data.*

have been deleted from some part of the sequence (**Figure 3**). The amount and size of gaps in an alignment is crucial to the final quality of the models. The greater the quantity and size of the gaps, the less reliable the models are and the greater is the chance of generating structural artifacts. Therefore, when choosing a template, it is

*Alignment between two proteins (query/Sbjct) showing the presence of 8 gaps (red) in three different sections* 

*Example of BLASTp alignment between a* Leishmania infantum *ATP-synthase sequence against the PDB database. Values of the coverage percentage (red) and identity (black) of each alignment are highlighted.* 

essential that the researcher be aware about gaps presence in the sequences. Once the template has been defined, we proceed to the stage of building the three-dimensional model. From specific programs and servers, the necessary files for modeling are submitted, which consists of the superimposition of the structural carbons of the target protein on the template protein, based on the alignment information to superimpose the equivalent amino acids. There are currently numerous

CONFOLD Software https://github.com/multicom-toolbox/CONFOLD

free tools for building three-dimensional models (**Table 1**).

Modeler Software https://salilab.org/modeler/ Swiss-Model Server https://swissmodel.expasy.org/ Phyre2 Server http://www.sbg.bio.ic.ac.uk/phyre2

Galaxy Server http://galaxy.seoklab.org/ RaptorX Software/Server http://raptorx.uchicago.edu/

ROBBETTA Server http://robetta.bakerlab.org/

**Nome Tipo Site**

*Examples of free tools for building homology models.*

#### *Introductory Chapter: Homology Modeling DOI: http://dx.doi.org/10.5772/intechopen.95446*


#### **Figure 2.**

*Homology Molecular Modeling - Perspectives and Applications*

solved structure, a three-dimensional model for the sequence can be built using the known structure as a template. As a rule, a minimum identity of 25% between the amino acids of two proteins is sufficient for the construction of models by homology. Sequence identities above generally 40% provides good models, while those

However, in addition to the identity and similarity between the amino acids, other parameters must be observed when choosing a good template, such as the resolution in angstroms of the crystallographic structure and the percentage of alignment coverage (**Figure 2**). The lower the resolution of a structure, the better its quality. The average resolution of the structures available in the PDB (Protein Data Bank) is around 3.5 Ä, while structures below 2.0 Ä are considered to have excellent resolution and represent less than 10% of the entries in the PDB. The higher the percentage of coverage of the alignment between a target protein (protein to be modeled) and the template (mold), the better [4]. Coverage alignments above 90% of the residues tends to have high scores and are considered to be excellent (**Figure 2**). Something important to note in alignments is the presence of sequence gaps. A gap between sequences means the absence of residues, that is, amino acids that

above 50% tend to provide excellent theoretical structures [3].

*Illustrative scheme for the structural complexity levels of proteins. Source: Google images.*

**4**

**Figure 1.**

*Example of BLASTp alignment between a* Leishmania infantum *ATP-synthase sequence against the PDB database. Values of the coverage percentage (red) and identity (black) of each alignment are highlighted. Source: Authors data.*


#### **Figure 3.**

*Alignment between two proteins (query/Sbjct) showing the presence of 8 gaps (red) in three different sections (green). Source: Authors data.*

have been deleted from some part of the sequence (**Figure 3**). The amount and size of gaps in an alignment is crucial to the final quality of the models. The greater the quantity and size of the gaps, the less reliable the models are and the greater is the chance of generating structural artifacts. Therefore, when choosing a template, it is essential that the researcher be aware about gaps presence in the sequences.

Once the template has been defined, we proceed to the stage of building the three-dimensional model. From specific programs and servers, the necessary files for modeling are submitted, which consists of the superimposition of the structural carbons of the target protein on the template protein, based on the alignment information to superimpose the equivalent amino acids. There are currently numerous free tools for building three-dimensional models (**Table 1**).


#### **Table 1.**

*Examples of free tools for building homology models.*
