**8. The Rd-HMMer protocol: A practical guide**

This section describes how to generate a Rd.HMM and interpret the results.

1. Software requirements:

226 Bioinformatics

**Figure 3.** (A) TOPOFIT (Ilyin et al., 2004) sequence alignment based on the structural alignment in figure 2. Amino acids are aligned if their backbones are less than 3 Å apart. (B) Alignment guided by the Rd.HMM derived from 1ATI:A. (C) Alignment guided by the Rd.HMM derived from 1J5W:A. For clarity, only the section of the alignment including aminoacids 166 to 424 of 1ATI:A is shown.

Figure 3 (A to C) shows the lack of coincidence between the TOPOFIT structural alignment (Ilyin et al., 2004), and the two Rd.HMM based alignments for the core regions. A careful analysis of the alignments in figure 3 suggests a possible explanation for the notable specificity of 1ATI:A and 1J5W:A Rd.HMMs. While repacking the rotamers into the theoretical 3D-structures, Rosetta-design identifies sites of low or no variation, with higher informational content. Clearly these sites are distributed in a rather different way on the


VMD and SwissPDB viewer are not essential but are very useful for PDB file manipulation.

2. Prepare your PDB file. Rd v. 2.3 requires your PDB file to have non-zero beta factors. Residues may be absent, as long as none of the corresponding backbone heavy atoms are present. Therefore atoms with types C, CA, N, C and O for a particular residue should be all present for each residue, or all absent. For an incomplete residue you may open your file with Swiss PDB viewer. This program will rebuild the missing atoms, which is recommended for models; but you can use the software to completely remove the residue, which is preferable for experimental data. A special case is the oxygen atom of the C-terminus (OXT), which is required by Rd. This atom can be rebuilt with SwissPDB viewer, but this is not done automatically. An alternative to Swiss PDB viewer is VMD using the PFSgen plugin. Although PDB manipulation in VMD requires more experience, its scripting language is more powerful.

If your structural PDB file comes from a modeling exercise, review the geometrical and sterical quality of your model. If required, refine it with molecular mechanics software. A

more detailed description of this kind of problems in protein modeling and how to fix them can be found elsewhere (Chavelas Adame et al., 2011; Rosales-León et al., 2012).

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 229

Here a local copy of the nr is assumed to be in your system in fasta format. The -Z flag will scale the E-values to 10 million sequences. This is recommended to make E-values comparable, because E-values are linearly dependent on the size of the sequence database searched. The default E value is 10, but here it was set to 100 to lower the search threshold.

**Figure 4.** An HMMer search output result. The search was done using an Rd.HMM from Top7 (PDB id. 1QYS) and the NCBI-nr as database. (A) heading, (B) scores, (C) domain-parsed scores and alignments. The statistics at the end and some information was removed for brevity. The format is as in HMMer 3.0.

An extract of the results from a typical search is presented in figure 4. The HMMer search output will report three sections: (a) Heading, (b) scores for complete sequences, (c) domain parsing, alignments and statistics. As it can be seen, according to the information in the Rd.HMM from Top7, the Top7 amino acid sequence fits into the Top7 3D-atomic coordinates (1QYS). The most relevant sections are the scores and the alignment sections. Notice how this X-ray solved 3D-stucture reports an HMM score of 51.2, matching the sequence from amino acid 3 to 94, that gives a ratio of 0.56, close to the 0.6 average value for X-ray solved structures. The reason behind the relationship is not simple, but it holds for most X-ray solved structures (with a few exceptions) (Martínez-Castilla & Rodríguez-Sotres 2010). The second hit is the C-terminal fragment of Top7 solved by NMR, the score is 39.8 for a fragment of length 50 (ratio = 0.79). This last score is higher than the score for the complete sequence, because as shown in figure 4, the C-terminus has higher proportion of local coincidences to the HMM. In the alignment to the full sequence, the contribution of the N-terminus lowers the overall score. The alignment for the C-terminal fragment was omitted because it is identical to the 1QYS alignment from position 44 to 91 (Fig. 4C). The alignment shows a consensus for the hidden Markov model, as a reference, then the sequence found aligned separated by an intermediate mask. Uppercase letters indicate


*rosetta.gcc -s 1QYS.pdb-design -fixbb -chain A -resfile 1QYSaa.res-ndruns 1 -pdbout 1QYSAaa* 

Here, we assume 1QYS.pdb to be the starting PDB file, *A* to be the subunit of interest, *1QYSaa.res* is the *resfile*, and your result will be named *1QYSAaa\_0001.pdb* (depending on the version, you will also need a paths.txt in your folder, or the path to the rosetta database should be indicated in the command line). In Rd v. 3.1, the *resfile* format and some command line options have changed (check Rosetta documentation for details).

This step can be repeated at will, to create many sequence-randomized PDB files, but in our experience, at least 10 are needed for a reliable HMM.

4. Rebuild each sequence-randomized PDB file with Rd. First you will need a *resfile* with the tag *ALLAA* (1QYSall.res), and a text file (pdb4rbld.lst) containing the names of all the sequence-randomized PDB file created in the previous step, one per line. Then, rebuild each input file 29 times using the command:

*rosetta.gcc -design -fixbb -chain A-l pdb4rbld.lst -resfile 1QYSall.res -ndruns 29 -pdbout* 

You can generate many rebuilt PDBs per input file, but you need at least 100 sequences in the end to produce a representative HMM. In our experience, a better exploration of the sequence space results from many sequence-randomized input files and between 10 to 30 rebuilt PDBs for each input PDB file.

5. Extract the amino acid sequence for each rebuilt PDB file and save it in a text file (*1QYSA\_a-O.fas*), in fasta format. This file represents an alignment, though a sequence alignment software is not necessary, due to the reasons commented at the end of section 5. Then use HMMer to prepare a hidden Markov model of your sequence alignment:

*hmmbuild --informat afa 1QYSA\_a-O.hmm 1QYSA\_a-O.fas* 

If you are using HMMer v. 2.0, you need to calibrate your model with:

*hmmcalibrate 1QYSA\_a-O.hmm* 

6. Search a sequence database (*i.e.* NCBI-nr) with:

*hmmsearch -E 100 -Z 10000000 1QYSA\_a-O.hmm path2db/nr* 

Here a local copy of the nr is assumed to be in your system in fasta format. The -Z flag will scale the E-values to 10 million sequences. This is recommended to make E-values comparable, because E-values are linearly dependent on the size of the sequence database searched. The default E value is 10, but here it was set to 100 to lower the search threshold.

228 Bioinformatics

more detailed description of this kind of problems in protein modeling and how to fix them

3. Build many replicates of your PDB file with a random assignment of amino acid

a. *With VMD*. Use the *atomselect* command to select the backbone atoms of all residues, one at a time, and change the residue name to any amino acid selected at random. In the C-terminal residue make sure to include the OXT atom in your selection. Then select all backbone atoms including the OXT and save the file. A script to do this with VMD can

b. *With Rd*. Prepare several Rosetta input *resfile* with a tag *PICKAA X* (replace X for a random 1-letter amino acid code) for every position in the PDB file of interest. Then run

Here, we assume 1QYS.pdb to be the starting PDB file, *A* to be the subunit of interest, *1QYSaa.res* is the *resfile*, and your result will be named *1QYSAaa\_0001.pdb* (depending on the version, you will also need a paths.txt in your folder, or the path to the rosetta database should be indicated in the command line). In Rd v. 3.1, the *resfile* format and some command

This step can be repeated at will, to create many sequence-randomized PDB files, but in our

4. Rebuild each sequence-randomized PDB file with Rd. First you will need a *resfile* with the tag *ALLAA* (1QYSall.res), and a text file (pdb4rbld.lst) containing the names of all the sequence-randomized PDB file created in the previous step, one per line. Then,

You can generate many rebuilt PDBs per input file, but you need at least 100 sequences in the end to produce a representative HMM. In our experience, a better exploration of the sequence space results from many sequence-randomized input files and between 10 to 30

5. Extract the amino acid sequence for each rebuilt PDB file and save it in a text file (*1QYSA\_a-O.fas*), in fasta format. This file represents an alignment, though a sequence alignment software is not necessary, due to the reasons commented at the end of section 5. Then use HMMer to prepare a hidden Markov model of your sequence alignment:

*rosetta.gcc -design -fixbb -chain A-l pdb4rbld.lst -resfile 1QYSall.res -ndruns 29 -pdbout* 

*rosetta.gcc -s 1QYS.pdb-design -fixbb -chain A -resfile 1QYSaa.res-ndruns 1 -pdbout 1QYSAaa* 

can be found elsewhere (Chavelas Adame et al., 2011; Rosales-León et al., 2012).

Rd, once for each *resfile* you made, with the following command:

line options have changed (check Rosetta documentation for details).

sequence. This can be done in two ways:

be requested to the corresponding author.

experience, at least 10 are needed for a reliable HMM.

rebuilt PDBs for each input PDB file.

*hmmcalibrate 1QYSA\_a-O.hmm* 

rebuild each input file 29 times using the command:

*hmmbuild --informat afa 1QYSA\_a-O.hmm 1QYSA\_a-O.fas* 

6. Search a sequence database (*i.e.* NCBI-nr) with:

*hmmsearch -E 100 -Z 10000000 1QYSA\_a-O.hmm path2db/nr* 

If you are using HMMer v. 2.0, you need to calibrate your model with:


**Figure 4.** An HMMer search output result. The search was done using an Rd.HMM from Top7 (PDB id. 1QYS) and the NCBI-nr as database. (A) heading, (B) scores, (C) domain-parsed scores and alignments. The statistics at the end and some information was removed for brevity. The format is as in HMMer 3.0.

An extract of the results from a typical search is presented in figure 4. The HMMer search output will report three sections: (a) Heading, (b) scores for complete sequences, (c) domain parsing, alignments and statistics. As it can be seen, according to the information in the Rd.HMM from Top7, the Top7 amino acid sequence fits into the Top7 3D-atomic coordinates (1QYS). The most relevant sections are the scores and the alignment sections. Notice how this X-ray solved 3D-stucture reports an HMM score of 51.2, matching the sequence from amino acid 3 to 94, that gives a ratio of 0.56, close to the 0.6 average value for X-ray solved structures. The reason behind the relationship is not simple, but it holds for most X-ray solved structures (with a few exceptions) (Martínez-Castilla & Rodríguez-Sotres 2010). The second hit is the C-terminal fragment of Top7 solved by NMR, the score is 39.8 for a fragment of length 50 (ratio = 0.79). This last score is higher than the score for the complete sequence, because as shown in figure 4, the C-terminus has higher proportion of local coincidences to the HMM. In the alignment to the full sequence, the contribution of the N-terminus lowers the overall score. The alignment for the C-terminal fragment was omitted because it is identical to the 1QYS alignment from position 44 to 91 (Fig. 4C). The alignment shows a consensus for the hidden Markov model, as a reference, then the sequence found aligned separated by an intermediate mask. Uppercase letters indicate strong conservation, lower case letter conservative changes and plus sign a positive local score. The lower line, absent in HMMer 2 is the encoded posterior probability (d=0...9,\*; \* equals 9.5), where the approximate value of posterior probability for each site is given by equation (1).

$$pp = d \times 0.1 + \ 0.025 \tag{1}$$

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 231

published a practical guide to the use of the popular modeling software MODELLER (Fiser & Sali 2003), where many useful hints are given. Recently, Chavelas-Adame and coworkers published a guide with emphasis on the use of open software [45]. The present account will not attempt to repeat the work, and only the most important conclusions are given here:

a. Many servers and software programs are available to aid the comparative modeling of proteins (Roy et al., 2010; Kim et al., 2004; Karplus, 2009; Melo & Feytmans, 1998; Rosales-León et al., 2012; Fiser & Sali 2003), some options are available for *ab initio* modeling (Kryshtafovych et al., 2009; Kim et al., 2004; Srinivasan et al., 2004; Xu & Zhang, 2012), and this list is far form complete. None of them achieves 100% success, and even the most successful can fail where other, usually less reliable, may succeed

b. A model is fundamentally wrong when the folding pattern in the model bears little or no relationship to the true native fold. Some models may offer a good approximation, but have wrong geometrical, sterical and/or chemical features at some locations, *i.e.* the bond lengths, angles, sidechain-sidechain contact distances and orientation may have important deviations from the expected values found in known chemical structures.

c. Unrefined models can be recognized with various energy scoring strategies (Luthy et al., 1992; Shi et al., 2009; Hu & Jiang, 2010; Melo & Feytmans, 1998); and can be corrected through the use of molecular mechanics software (Rosales-León et al., 2012; Fiser & Sali 2003), though this approach has limitations, as mentioned before (Faver et

d. Wrong models instead may frequently be deceitful, because, due to their systematic error [5], a molecular mechanics force-field may report a low energy value, as long as the chemical and geometrical details are well refined. Rd.HMM offers a solution to this problem, because these models will produce an HMM search report with no hits, or will score sequences, other than the modeling target (Chavelas Adame et al., 2011; Rosales-

e. The analysis of the Rd.HMM search report may help in the identification of errors in the alignment between the target amino acid sequence and the template selected for comparative modeling. If you find a frame-shift or an unexpected insertion/deletion pair, you can use the HMM search alignment and realign the target sequence and the template. MODELLER is a very good choice for that aim (Fiser & Sali 2003). In addition, a wrongly threaded model can be recycled by replacing the consensus sequence with the PDB sequence in the model (which is the target sequence), and producing a target to target alignment, with the insertions and deletions suggested by HMMer. MODELLER can then be used to generate new models. This last procedure is only recommended if your HMMer score is positive and has good statistical significance, for otherwise, the

f. Comparative modeling has been extended thanks to methods able to find templates with low sequence homology to the target (Wallace et al., 2005; Karplus, 2009). But sometimes the selected template is too distant. If the Rd.HMM of the candidate structures are obtained, these can be use to score the target sequence. The resulting

(Kryshtafovych et al., 2009; Melo & Feytmans, 1998).

This last kind are usually designated as unrefined models.

al., 2011; Hu & Jiang, 2010; Melo & Feytmans, 19985).

structural inaccuracy of the Rd.HMMs becomes a serious issue.

León et al., 2012).

The final hit in the search in figure 4 is the M artificial protein. This protein was designed with Rosetta-design using the same Top7 folding. Its sequence is different, but it belongs to the same family of Rd proteins. The score is smaller than for Top7 (ratio of 32/(88-7), or 0.395), but still above 0.3 and with high statistical significance. Although Rd was used to design these proteins, the concordance reveals the robustness of the amino acid assignment made by Rd, and gives further support to the structurally aware nature of the Rd.HMM alignments.

The alignment is very useful to protein modeling, because it reveals the distribution of coincident regions between the 3D-atomic coordinates of the backbone and the amino acid sequence in the database. The following features are to be taken into account:

