Role of Force Fields in Protein Function Prediction

*Zaved Hazarika, Sanchaita Rajkhowa and Anupam Nath Jha*

## **Abstract**

The world today, although, has developed an elaborate health system to fortify against known and unknown diseases, it continues to be challenged by new as well as emerging, and re-emerging infectious disease threats with severity and probable fluctuations. These threats also have varying costs for morbidity and mortality, as well as for a complex set of socio-economic outcomes. Some of these diseases are often caused by pathogens which use humans as host. In such cases, it becomes paramount responsibility to dig out the source of pathogen survival to stop their population growth. Sequencing genomes has been finessed so much in the 21st century that complete genomes of any pathogen can be sequenced in a matter of days following which; different potential drug targets are needed to be identified. Structure modeling of the selected sequences is an initial step in structure-based drug design (SBDD). Dynamical study of predicted models provides a stable target structure. Results of these *in-silico* techniques greatly depend on force field (FF) parameters used. Thus, in this chapter, we intend to discuss the role of FF parameters used in protein structure prediction and molecular dynamics simulation to provide a brief overview on this area.

**Keywords:** homology modeling, force field (FF), molecular dynamics (MD) simulations, molecular docking

## **1. Introduction**

What is a "disease"? A disease is any condition that harms the normal function of a body organ and/or system, of the psyche, or of the organism as a whole, which is associated with specific signs and symptoms. Factors that often lead to the damage of the function of organs and/or systems may be of two types, i.e., intrinsic and extrinsic. Those factors, that arise from within the host body interfering with the normal functioning processes of a body organ and/or system, as a result of genetic features of an organism or any disorder within the host are known as intrinsic factors [1]. Huntington's disease is an example of genetic disease which causes uncontrolled movements, emotional problems and loss of thinking ability (cognition) owing to a progressive brain disorder, due to mutations in the *HTT* gene, involving a DNA segment known as CAG trinucleotide repeats [2]. When a host comes in contact with a pathogen from outside, the host's system is accessed by extrinsic factors [3]. Microorganisms are the main causative agents which are responsible for causing infectious diseases. Their importance is determined from the type and extent of damage their causative agents inflict on organs and/or systems when they enter into a host. Entry into the host is mostly by routes such as the mouth, eyes, genital openings, nose and the skin. Damage to tissues mainly results from the growth and

metabolic processes of infectious agents intracellular or within body fluids, with the production and release of toxins or enzymes that interfere with the normal functioning of organs and/or systems [4]. An example of extrinsic factor is the infection caused by novel pathogen, such as SARS-CoV-2, which represents an extremely challenging and complex endeavor. Currently, several promising therapeutics are underway and also many vaccine candidates with promises to mitigate the catastrophic effects of COVID-19 pandemic are under clinical trials. Still, an effective and successful countermeasure to control this catastrophe is not available [5].

In December 2019, a kind of pneumonia having an unknown etiology was reported from the Wuhan city of China in the Hubei province [6]. Isolation and genomic characterization of the complete sequence of the virus using next-generation sequencing (NGS), identified it as a novel coronavirus (CoV) and named it as 2019-nCoV, now as SARS-CoV-2 [7]. Although the characterization of the complete sequence was completed in January 2020, yet till date, there is no definitive cure or vaccine available for this virus. With the availability of the sequence, the three-dimensional (3D) structures of many proteins belonging to SARS-CoV-2 are now available. These 3D-structures can be obtained using various experimental and computational techniques. X-ray crystallography and NMR spectroscopy are currently the two major experimental techniques for protein structure determination [8] which are deposited in both UniProt and Protein Data Bank (PDB) [9]. For computational modeling of the 3D structure of proteins, homology modeling technique is used. Homology modeling is a computational technique which uses the amino acid sequence to predict the 3D structure. It is one of the widely used computational structure prediction method.

Proteins are one of the most extensively studied and complex macromolecules within living organisms with a unique 3D structure. Usually this leads to a diversity in their spatial shape, structure and thus, leading to different biological functionalities in a living system [8]. Yet, very little is known about the process of protein folding leading to its specific tertiary structure from its primary structure. Till date, approximately 175,000 experimentally determined 3D structures of biological macromolecules are available in the PDB [9]. However, reference sequence (refseq) release of National Center for Biotechnology Information (NCBI) contains as many as 178,304,046 protein sequences. This signifies a huge difference between the number of sequences in the NCBI and the number of protein 3D structures in the PDB. The difference in the number is even higher due to the fact that the reference sequences in the NCBI are non-redundant, whereas, structures available in PDB contain redundancy. This has resulted in an alarming situation owing to the increasing gap between the available 3D structures and the protein sequences. Therefore, computational structural prediction methods such as homology modeling are much needed in covering this widening gap. Thus, this chapter discusses homology modeling in a holistic manner covering the principles and different types of structure prediction methods along with giving a flavor of the different force field (FF) parameters that are used in protein structure prediction. The chapter also includes a brief overview of the molecular dynamics (MD) simulations that are used in computational modeling of proteins along with discussion of some application examples in this field.

#### **2. Protein structure prediction**

Protein sequences are much easier to obtain as compare to their structures. This is due to advancements in the field of protein sequencing technology. As a result, an exponential growth in the accumulation of protein sequences can be observed. An amino acid sequence is a very important source of insight into proteins, its function, structure and history. This is mostly because, first, comparison of an unknown

**33**

methods can be used:

using chromatographic procedures.

*Role of Force Fields in Protein Function Prediction DOI: http://dx.doi.org/10.5772/intechopen.93901*

**2.1 Amino acid sequence determination techniques**

(HCl) at 100–110**°** C for 24 hours or longer.

sequence with a known sequence helps in deciding whether significant similarities exist between them, which in turn helps in establishing the class of protein and can give valuable information regarding its structure and function. Secondly, genealogical relationships can be studied by comparing the sequences of the same protein from different species. Thirdly, the presence of internal repeats in protein sequences reveals the history of the proteins. Also, sequencing of amino acids is very important for making DNA probes which can be used for encoding of its protein, as knowledge of the primary structure also allows the use of reverse genetics [10].

Determination of the amino acid sequence of all or part of a protein or peptide is known as prediction of protein sequence. It is used to categorize the protein and may help in characterizing its post-translational modifications. In a protein, determination of the amino acid sequence involves the following steps [10]:

i.*Hydrolysis*: This procedure is required in order to hydrolyze the protein into its amino acid and includes the protein being heated in 6 M hydrochloric acid

ii. *Separation*: Separation of amino acid from a peptide can be achieved by ion-exchange chromatography. The amino acids are eluted by mixing them with an acidic solution and passing a buffer steadily while increasing the pH through the chromatography column on sulfonated polystyrene. Accordingly, when an amino acid reaches its isoelectric point, it is separated. The buffer used is correlated to a specific amino acid type. Thus, the amino acid having the most acidic side chain will emerge first, while the amino acid having the most basic side chain will emerge last. The absorbance is used to

iii.*Quantitation*: Once the separation of the amino acids is achieved, their respective quantities are determined by adding a reagent called ninhydrin which gives an intense blue color to the amino acids, except proline which, due to the presence of secondary amino group in its structure, gives it a yellow color. For very small quantities (nanogram), reagents like fluorescamine or orthophthaldehyde (OPA) are used to obtain fluorescent products. Therefore, the concentration of amino acids is directly proportional to either the absorbance

of the resulting solution or the fluorescence emitted by the sample.

For determining the composition and the sequence of the protein, two direct

a.*Edward Degradation Method*: This method uses phenyl iso-thio-cyanate to

b.*Mass Spectrometry*: Another technique to determine protein sequence is the mass spectrometry which uses the time of flight of ionized proteins to calculate the mass of the ionized proteins. In this process, the protein is cleaved using specific enzymes. The ionized amino acids are triggered by a laser beam which

cleave the amino acids one by one starting from the amino terminal. The amino acids when treated with phenyl iso-thio-cyanate forms a phenyl-thio-hydantoin (PTH)-amino acid (e.g. PTH-lysine, etc.) terminal residue, which gets released under mild acidic conditions. The released terminal compound is then identified

determine the amount of similar type amino acid residues.

*Role of Force Fields in Protein Function Prediction DOI: http://dx.doi.org/10.5772/intechopen.93901*

*Homology Molecular Modeling - Perspectives and Applications*

the widely used computational structure prediction method.

metabolic processes of infectious agents intracellular or within body fluids, with the production and release of toxins or enzymes that interfere with the normal functioning of organs and/or systems [4]. An example of extrinsic factor is the infection caused by novel pathogen, such as SARS-CoV-2, which represents an extremely challenging and complex endeavor. Currently, several promising therapeutics are underway and also many vaccine candidates with promises to mitigate the catastrophic effects of COVID-19 pandemic are under clinical trials. Still, an effective and successful countermeasure to control this catastrophe is not available [5].

In December 2019, a kind of pneumonia having an unknown etiology was reported from the Wuhan city of China in the Hubei province [6]. Isolation and genomic characterization of the complete sequence of the virus using next-generation sequencing (NGS), identified it as a novel coronavirus (CoV) and named it as 2019-nCoV, now as SARS-CoV-2 [7]. Although the characterization of the complete sequence was completed in January 2020, yet till date, there is no definitive cure or vaccine available for this virus. With the availability of the sequence, the three-dimensional (3D) structures of many proteins belonging to SARS-CoV-2 are now available. These 3D-structures can be obtained using various experimental and computational techniques. X-ray crystallography and NMR spectroscopy are currently the two major experimental techniques for protein structure determination [8] which are deposited in both UniProt and Protein Data Bank (PDB) [9]. For computational modeling of the 3D structure of proteins, homology modeling technique is used. Homology modeling is a computational technique which uses the amino acid sequence to predict the 3D structure. It is one of

Proteins are one of the most extensively studied and complex macromolecules within living organisms with a unique 3D structure. Usually this leads to a diversity in their spatial shape, structure and thus, leading to different biological functionalities in a living system [8]. Yet, very little is known about the process of protein folding leading to its specific tertiary structure from its primary structure. Till date, approximately 175,000 experimentally determined 3D structures of biological macromolecules are available in the PDB [9]. However, reference sequence (refseq) release of National Center for Biotechnology Information (NCBI) contains as many as 178,304,046 protein sequences. This signifies a huge difference between the number of sequences in the NCBI and the number of protein 3D structures in the PDB. The difference in the number is even higher due to the fact that the reference sequences in the NCBI are non-redundant, whereas, structures available in PDB contain redundancy. This has resulted in an alarming situation owing to the increasing gap between the available 3D structures and the protein sequences. Therefore, computational structural prediction methods such as homology modeling are much needed in covering this widening gap. Thus, this chapter discusses homology modeling in a holistic manner covering the principles and different types of structure prediction methods along with giving a flavor of the different force field (FF) parameters that are used in protein structure prediction. The chapter also includes a brief overview of the molecular dynamics (MD) simulations that are used in computational modeling

of proteins along with discussion of some application examples in this field.

Protein sequences are much easier to obtain as compare to their structures. This is due to advancements in the field of protein sequencing technology. As a result, an exponential growth in the accumulation of protein sequences can be observed. An amino acid sequence is a very important source of insight into proteins, its function, structure and history. This is mostly because, first, comparison of an unknown

**2. Protein structure prediction**

**32**

sequence with a known sequence helps in deciding whether significant similarities exist between them, which in turn helps in establishing the class of protein and can give valuable information regarding its structure and function. Secondly, genealogical relationships can be studied by comparing the sequences of the same protein from different species. Thirdly, the presence of internal repeats in protein sequences reveals the history of the proteins. Also, sequencing of amino acids is very important for making DNA probes which can be used for encoding of its protein, as knowledge of the primary structure also allows the use of reverse genetics [10].
