**9. Guiding the 3D-modeling of proteins with Rd-HMMer**

There are many publications describing different approaches to the solution to the protein folding problem (Roy et al., 2010; Kim et al., 2004; Karplus, 2009; Melo & Feytmans, 1998) but most of them focus on the theory, or present a technical treatment. Fisher and Sali published a practical guide to the use of the popular modeling software MODELLER (Fiser & Sali 2003), where many useful hints are given. Recently, Chavelas-Adame and coworkers published a guide with emphasis on the use of open software [45]. The present account will not attempt to repeat the work, and only the most important conclusions are given here:

230 Bioinformatics

equation (1).

6, at HMM node 4.

strong conservation, lower case letter conservative changes and plus sign a positive local score. The lower line, absent in HMMer 2 is the encoded posterior probability (d=0...9,\*; \* equals 9.5), where the approximate value of posterior probability for each site is given by

The final hit in the search in figure 4 is the M artificial protein. This protein was designed with Rosetta-design using the same Top7 folding. Its sequence is different, but it belongs to the same family of Rd proteins. The score is smaller than for Top7 (ratio of 32/(88-7), or 0.395), but still above 0.3 and with high statistical significance. Although Rd was used to design these proteins, the concordance reveals the robustness of the amino acid assignment made by Rd,

The alignment is very useful to protein modeling, because it reveals the distribution of coincident regions between the 3D-atomic coordinates of the backbone and the amino acid

a. Frame shift. If the residue number in the 3D-structure has an offset relative to the amino acid numbering in the sequence, either from the beginning, or starting at some intermediate site; this is usually a sign of a wrong threading of the model and the template during the modeling step. In the example, there is a difference in amino acid numbering, but this is not a frame shift, as the first residue solved in the PDB is ASP-3, corresponding to node one in the HMM, then the first 3 HMM nodes did not match the Top7 sequence and were discarded by HMMer search making the first match to residue

b. Insertion/deletions. An insertion in the sequence appears as a dot in the HMMer consensus, a deletion as dashes in the sequence found. Such changes are expected if the sequence is a homologue, and not the natural sequence that corresponds to the 3Dstructures analyzed with Rd.HMM. They may occur also when the PDB file has some missing amino acids (this happens frequently, due to experimental limitations of X-ray crystalography). If so, you expect this insertions to match the missing amino acids. For *in silico* modeled structures this means a local threading error, or a local defect in the model. c. Distribution of conserved sites. The higher the number of conserved sites, the better the model. However, some strained conformations have lower energy for glycine, proline and asparagine than for every other amino acid and these residues tend to appear as strongly conserved (Uppercase letters in the mask line, and in the Rd.HMM consensus). If the sequence conservation observed is dominated by these residues, you model may

There are many publications describing different approaches to the solution to the protein folding problem (Roy et al., 2010; Kim et al., 2004; Karplus, 2009; Melo & Feytmans, 1998) but most of them focus on the theory, or present a technical treatment. Fisher and Sali

and gives further support to the structurally aware nature of the Rd.HMM alignments.

sequence in the database. The following features are to be taken into account:

be wrong, even if your score has statistical significance.

**9. Guiding the 3D-modeling of proteins with Rd-HMMer** 

*pp d* 0.1 0.025 (1)


scores, statistical significance and the alignment may guide your template selection. However, if the Rd.HMM of a template candidate gives a negative score, and still you decide to use it, do not trust the Rd.HMM alignment without further improvement using other tools, as it may be seriously flawed.

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 233

the chemical interaction network affecting the model stability. The correction of this problem and the use of molecular dynamics simulations led to a well refined and reliable model with a good Rd.HMM score. A few months later (when the paper was in press) the 3D-structure of a close homologue (isomaltase) was released (Yamamoto et al., 2010). The Xray data corroborated the model quality, as the model core backbone has an rmsd of 1.81 Å form the experimental data. Figure 5 shows a superposition of both structures colored by

In a second example, the 3D-structure of two isoforms of plant inorganic pyrophophatases was obtained using a combined strategy of web servers, MODELLER and molecular dynamics simulations. The resulting models provided ground for the lack of quaternary structure in plant pyrophosphatases (Rosales-León et al., 2012). Although the sequences of several related isoforms were initially sent to the servers, only one isoform was correctly modeled, according to Rd.HMM, but the Rd.HMM of the good model gave an alignment for the sequence of a second isozyme. This alignment, and the correct model were then used to produce the second model. Though this last model was not directly based on experimental

backbone rmsd form blue (low) to white (medium) to red (high).

data, its quality was high, according to Rd.HMM (Rosales-León et al., 2012).

Since most Rd.HMM limitations have been mentioned. We only summarize them here:

a. Rd.HMM sensitivity makes it useful for medium to good quality models. Low quality models, may still be of use as starting points, but the Rd.HMM data will only indicate the low quality and will not allow to discriminate a wrong model from an unrefined one. b. The structurally aware nature of the Rd.HMM alignments is to be trusted only for good quality models. As the Rd.HMM score drops, the sequence to structure correlation

c. Rd.HMM does not offer much information on how to modify the model to improve its appropriateness, other than the presence of insertion/deletions or sequence to structure

d. A model may be badly refined and get a good Rd.HMM score, as long as Rosetta-design is able to process the backbone coordinates and repack the residues. Therefore, the Rd.HMM score is insufficient information. Information from other software, such as ANOLEA energy (Melo & Feytmans, 1998) or molecular mechanics energy (Hu & Jiang,

e. Finally, there is no formal proof for the perfect correspondence between a Rd.HMM high score and the prediction for the 3D-structure of a protein to be native-like. Therefore, from two predictions, of which only one represents the native fold, it might be possible to produce a high Rd.HMM score for the target sequence (a false positive). However, despite our best efforts we have only found the false negative case, *i.e.* a good prediction (or even a 3D-structure from experimental data) may give a low Rd.HMM score. To the best of our knowledge, among the quality assessment methods, this

2010) is always required to test a model quality.

feature is unique to the Rd.HMM protocol.

**10. Rd-HMMer limitations** 

becomes weak.

frame-shifts.

g. Finally, if you use the ROSETTA suite or the ROBETTA server to produce your models, these structures are expected to have a ROSETTA-like bias, *i.e.* their Rd.HMM scores will increase and a good model with this bias is expected to have a ratio of HMMer score to sequence length close to one. While in models produced with other software a Rd.HMM score ratio of 0.3 is acceptable, in a ROSETTA produced model this score is low and may reflect important flaws. Look at the alignment carefully, as recommended in the previous section.

**Figure 5.** Comparison of the yeast α-glucosidase model produced included in the publication by Brindis et al (Brindis et al., 2011), with the X-ray solved structure of its homologue, the yeast isomaltase (Yamamoto et al., 2010). The isomaltase is shown as blue cartoons and the α-glucosidase cartoons are colored according to the amino acid rmsd from isomaltase, ranging from very low (blue) to intermediate (white) to high (red).

As an example of the advantages of Rd.HMM, we refer to two cases of recent success. Brindis and coworkers (Brindis et al., 2011) analyzed the effects of a natural product on αglucosidase. This work reports a model for the budding yeast α-glucosidase used to analyze the molecular grounds for the (Z)-3-butylidenephthalide inhibitory action. In the preparation of the model, Rd.HMM allowed to detect a threading problem (insertion/deletion pair) in one β-strand in the core of the model. While the sheet was slid only a few Å from its position, the contact with neighboring strands completely distorted the chemical interaction network affecting the model stability. The correction of this problem and the use of molecular dynamics simulations led to a well refined and reliable model with a good Rd.HMM score. A few months later (when the paper was in press) the 3D-structure of a close homologue (isomaltase) was released (Yamamoto et al., 2010). The Xray data corroborated the model quality, as the model core backbone has an rmsd of 1.81 Å form the experimental data. Figure 5 shows a superposition of both structures colored by backbone rmsd form blue (low) to white (medium) to red (high).

In a second example, the 3D-structure of two isoforms of plant inorganic pyrophophatases was obtained using a combined strategy of web servers, MODELLER and molecular dynamics simulations. The resulting models provided ground for the lack of quaternary structure in plant pyrophosphatases (Rosales-León et al., 2012). Although the sequences of several related isoforms were initially sent to the servers, only one isoform was correctly modeled, according to Rd.HMM, but the Rd.HMM of the good model gave an alignment for the sequence of a second isozyme. This alignment, and the correct model were then used to produce the second model. Though this last model was not directly based on experimental data, its quality was high, according to Rd.HMM (Rosales-León et al., 2012).
