**7. The unexpected sensitivity of Rd-HMMer**

In theory, when a Rd.HMM is used to scan a general sequence database, such as the NCBInr (Jiang et al., 2008), a sequence is selected if it is considerably less likely to be generated at random, than to be emitted by the Rd.HMM. But the Rd-step leaves only information related to the 3D-fold, which is then fed into the HMM, thus any selected sequence should be able to fold into a 3D-structure very similar to the starting one. Sequences selected by HMMer should then belong to the same folding family.

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 225

0.88) lacking significance (E-value 0.065), and the *T maritima* sequence (its corresponding natural one) received a positive score of 30.3 (0.11) and high statistical significance (*E*-value 8.4×10-17). In these cases, the score was obtained lowering the software threshold, because in a standard search of the NCBI-nr the 1ATI:A Rd.HMM only identified the *T. thermophilus*

**Figure 2.** (A) Comparison of glycyl-tRNA synthetases from *Thermus thermophilus* (PDB 1ATI:A, yellow tube) and from *Thermotoga maritima* (PDB 1J5W:A, green trace). The core α/β region was superimposed using TOPOFIT (Ilyin et al., 2004) (shown as cartoons) and colored according to its sequence similarity from blue (identical) to white (dissimilar). The figure was prepared using VMD (Humphrey et al., 1996). (B) HMM logo of the profile to profile alignment (Schuster-Böckler & Bateman, 2005) of Rd.HMMs from glycyl-tRNA synthetases in (A). (C) The segments in (A) corresponding to the nodes in the

In the previous example, the dissimilar regions have enough information to allow the discrimination between the structures. In addition, since the scores for the non-related sequence on each case were negative, the alignment produced by the Rd.HMM of both sequences is unreliable. Figure 2B shows the profile to profile comparison of HMM logos (Schuster-Böckler & Bateman, 2005) for the Rd.HMM derived from both glycyl-tRNA synthetases, which paired a significant subset of both Rd.HMMs. The corresponding

glycyl-tRNA amino acid sequence and its homologues.

alignment in (B)

segments were indeed structurally related (Fig. 2C).

One of the unexpected results of Rd.HMM is the sensitivity of this protocol, for instance, it is able to separate those sequences of the TIM-barrel fold that belong to the triose phosphate isomerase from those that belong to other TIM-barrels, such as the phosphoribosylpyrophosphate isomerase (PRAI) (Martínez-Castilla & Rodríguez-Sotres 2010).

Apparently, the Rd-step can imprint its artificial sequences with some details related to loop and turn shapes, as well as contact between secondary structure elements within the tertiary structure adopted by the original polypeptide chain. Then, only when two proteins with completely different activity retain an almost identical structure, a single Rd.HMM can score their corresponding sequences with a significant score. Such is the case of the novel engineered retroaldolases (RA-61, RA-22) and the corresponding templates used to host the newly designed amino acid catalyst, a β-1,4-*endo-*xylanse from *Nonomuraea flexuosa* and one indole-3-glycerol phosphate syntase from *Sulfolobus solfataricus* (Martínez-Castilla & Rodríguez-Sotres 2010; Jiang et al., 2008). This was also the case with the imidazoleglycerolphosphate synthase From *Thermotoga maritima* and the engineered imidazoleglycerol\_evolvedcerolphosphate synthase (Martínez-Castilla & Rodríguez-Sotres 2010; Röthlisberger et al., 2008).

The remarkable sensitivity of the Rd.HMM protocol is reflected also in the change of the score reported for the 3D-structure of one protein resolved by NMR, as compared to its Xray 3D-structure. An Rd.HMM produced with a X-ray 3D-structure will score its corresponding natural sequence with a value close to 0.6 times the length of its amino acid sequence. Instead a Rd.HMM from an NMR derived structure will report half of that score for its corresponding natural sequence (Martínez-Castilla & Rodríguez-Sotres 2010).

As an additional test of the Rd.HMM sensitivity, we compared the Rd.HMMs corresponding to subunit A from two prokaryotic glycyl-tRNA-synthases, one from *Thermus thermophilus* and another from *Thermotoga maritima.* These two X-ray resolved structures have a very similar core (Fig 2A), but the sequence similarity is below 15%. Despite the structural similarity, both proteins have extensive regions where the structure differs completely. Accordingly, the 1ATI:A Rd.HMM scored the *T. thermophilus* sequence (its corresponding natural one) with a value of 161.8 (score over sequence length 0.37) and a highly significant *E*-value (3.8 ×10-49), but the *T. maritima* sequence received a negative score of -271.1 (score over sequence length -0.94) lacking statistical significance (*E*-value 2). In contrast, the 1J5W:A Rd.HMM scored the *T. thermophilus* sequence with a value of -200.0 (-

0.88) lacking significance (E-value 0.065), and the *T maritima* sequence (its corresponding natural one) received a positive score of 30.3 (0.11) and high statistical significance (*E*-value 8.4×10-17). In these cases, the score was obtained lowering the software threshold, because in a standard search of the NCBI-nr the 1ATI:A Rd.HMM only identified the *T. thermophilus* glycyl-tRNA amino acid sequence and its homologues.

224 Bioinformatics

**7. The unexpected sensitivity of Rd-HMMer** 

HMMer should then belong to the same folding family.

2010; Röthlisberger et al., 2008).

In theory, when a Rd.HMM is used to scan a general sequence database, such as the NCBInr (Jiang et al., 2008), a sequence is selected if it is considerably less likely to be generated at random, than to be emitted by the Rd.HMM. But the Rd-step leaves only information related to the 3D-fold, which is then fed into the HMM, thus any selected sequence should be able to fold into a 3D-structure very similar to the starting one. Sequences selected by

One of the unexpected results of Rd.HMM is the sensitivity of this protocol, for instance, it is able to separate those sequences of the TIM-barrel fold that belong to the triose phosphate isomerase from those that belong to other TIM-barrels, such as the phosphoribosylpyrophosphate isomerase (PRAI) (Martínez-Castilla & Rodríguez-Sotres 2010).

Apparently, the Rd-step can imprint its artificial sequences with some details related to loop and turn shapes, as well as contact between secondary structure elements within the tertiary structure adopted by the original polypeptide chain. Then, only when two proteins with completely different activity retain an almost identical structure, a single Rd.HMM can score their corresponding sequences with a significant score. Such is the case of the novel engineered retroaldolases (RA-61, RA-22) and the corresponding templates used to host the newly designed amino acid catalyst, a β-1,4-*endo-*xylanse from *Nonomuraea flexuosa* and one indole-3-glycerol phosphate syntase from *Sulfolobus solfataricus* (Martínez-Castilla & Rodríguez-Sotres 2010; Jiang et al., 2008). This was also the case with the imidazoleglycerolphosphate synthase From *Thermotoga maritima* and the engineered imidazoleglycerol\_evolvedcerolphosphate synthase (Martínez-Castilla & Rodríguez-Sotres

The remarkable sensitivity of the Rd.HMM protocol is reflected also in the change of the score reported for the 3D-structure of one protein resolved by NMR, as compared to its Xray 3D-structure. An Rd.HMM produced with a X-ray 3D-structure will score its corresponding natural sequence with a value close to 0.6 times the length of its amino acid sequence. Instead a Rd.HMM from an NMR derived structure will report half of that score

As an additional test of the Rd.HMM sensitivity, we compared the Rd.HMMs corresponding to subunit A from two prokaryotic glycyl-tRNA-synthases, one from *Thermus thermophilus* and another from *Thermotoga maritima.* These two X-ray resolved structures have a very similar core (Fig 2A), but the sequence similarity is below 15%. Despite the structural similarity, both proteins have extensive regions where the structure differs completely. Accordingly, the 1ATI:A Rd.HMM scored the *T. thermophilus* sequence (its corresponding natural one) with a value of 161.8 (score over sequence length 0.37) and a highly significant *E*-value (3.8 ×10-49), but the *T. maritima* sequence received a negative score of -271.1 (score over sequence length -0.94) lacking statistical significance (*E*-value 2). In contrast, the 1J5W:A Rd.HMM scored the *T. thermophilus* sequence with a value of -200.0 (-

for its corresponding natural sequence (Martínez-Castilla & Rodríguez-Sotres 2010).

**Figure 2.** (A) Comparison of glycyl-tRNA synthetases from *Thermus thermophilus* (PDB 1ATI:A, yellow tube) and from *Thermotoga maritima* (PDB 1J5W:A, green trace). The core α/β region was superimposed using TOPOFIT (Ilyin et al., 2004) (shown as cartoons) and colored according to its sequence similarity from blue (identical) to white (dissimilar). The figure was prepared using VMD (Humphrey et al., 1996). (B) HMM logo of the profile to profile alignment (Schuster-Böckler & Bateman, 2005) of Rd.HMMs from glycyl-tRNA synthetases in (A). (C) The segments in (A) corresponding to the nodes in the alignment in (B)

In the previous example, the dissimilar regions have enough information to allow the discrimination between the structures. In addition, since the scores for the non-related sequence on each case were negative, the alignment produced by the Rd.HMM of both sequences is unreliable. Figure 2B shows the profile to profile comparison of HMM logos (Schuster-Böckler & Bateman, 2005) for the Rd.HMM derived from both glycyl-tRNA synthetases, which paired a significant subset of both Rd.HMMs. The corresponding segments were indeed structurally related (Fig. 2C).

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 227

1ATI:A, or 1J5W:A Rd.HMMs (Fig. 3 B and C), and only a fraction of these low-variance sites do coincide with structurally equivalent sites (Fig. 3A), making each Rd.HMM different

A similar analysis of the lysozymes from lambda phage (or *E. coli*), T4 phage, chicken (*Gallus* 

A somehow artificial example comes from the Rosetta-designed non-natural proteins Top7 and M. This example is used to illustrate the interpretation of the Rd.HMM information in

 Rosetta suite v. 2.3 or above. The examples given here apply to v. 2.3, but the porting to v. 3.1 is straightforward. Rd v. 2.1 is considerably faster, but exploration

HMMer v. 2 or above. The examples given here were done with v. 3, which is

 Sequence databases. You may download the protein nr, SeqRef or UniProt-Sprot databases from the NCBI site (Sayers et al., 2010), or any other fulfilling your needs. As an alternative, you may prepare a small database using psi-blast at any server. It is recommended to include not only the sequences of proteins related to the structure of interest, but also other unrelated sequences, preferably selected at

VMD and SwissPDB viewer are not essential but are very useful for PDB file manipulation. 2. Prepare your PDB file. Rd v. 2.3 requires your PDB file to have non-zero beta factors. Residues may be absent, as long as none of the corresponding backbone heavy atoms are present. Therefore atoms with types C, CA, N, C and O for a particular residue should be all present for each residue, or all absent. For an incomplete residue you may open your file with Swiss PDB viewer. This program will rebuild the missing atoms, which is recommended for models; but you can use the software to completely remove the residue, which is preferable for experimental data. A special case is the oxygen atom of the C-terminus (OXT), which is required by Rd. This atom can be rebuilt with SwissPDB viewer, but this is not done automatically. An alternative to Swiss PDB viewer is VMD using the PFSgen plugin. Although PDB manipulation in VMD requires

If your structural PDB file comes from a modeling exercise, review the geometrical and sterical quality of your model. If required, refine it with molecular mechanics software. A

*gallus*) and goose (*Anser anser anser*) led to very similar results.

**8. The Rd-HMMer protocol: A practical guide** 

of the sequence space is better in Rd. v 3.1.

considerably faster, therefore recommended. VMD v. 1.8.7 or above. (Humphrey et al., 1996)

SwissPDB viewer v. 4, or above (Kaplan &Littlejohn, 2001).

more experience, its scripting language is more powerful.

This section describes how to generate a Rd.HMM and interpret the results.

enough.

the next section.

1. Software requirements:

random.

**Figure 3.** (A) TOPOFIT (Ilyin et al., 2004) sequence alignment based on the structural alignment in figure 2. Amino acids are aligned if their backbones are less than 3 Å apart. (B) Alignment guided by the Rd.HMM derived from 1ATI:A. (C) Alignment guided by the Rd.HMM derived from 1J5W:A. For clarity, only the section of the alignment including aminoacids 166 to 424 of 1ATI:A is shown.

Figure 3 (A to C) shows the lack of coincidence between the TOPOFIT structural alignment (Ilyin et al., 2004), and the two Rd.HMM based alignments for the core regions. A careful analysis of the alignments in figure 3 suggests a possible explanation for the notable specificity of 1ATI:A and 1J5W:A Rd.HMMs. While repacking the rotamers into the theoretical 3D-structures, Rosetta-design identifies sites of low or no variation, with higher informational content. Clearly these sites are distributed in a rather different way on the 1ATI:A, or 1J5W:A Rd.HMMs (Fig. 3 B and C), and only a fraction of these low-variance sites do coincide with structurally equivalent sites (Fig. 3A), making each Rd.HMM different enough.

A similar analysis of the lysozymes from lambda phage (or *E. coli*), T4 phage, chicken (*Gallus gallus*) and goose (*Anser anser anser*) led to very similar results.

A somehow artificial example comes from the Rosetta-designed non-natural proteins Top7 and M. This example is used to illustrate the interpretation of the Rd.HMM information in the next section.
