**6. Hidden Markov models to deal with the reverse folding problem**

A Markov model (MM) is a model of a stochastic process with the Markov property. The model has the Markov property if, along the random succession of states, the future state is determined by the present state only, with no influence of the previous states (Eddy, 2004). The change from one state to another is called a transition, and each state has an associated transition probability. The states are finite or countable, but the succession itself may be infinite.

A MM can be used to describe a number of natural phenomena. For example, in a chemical kinetic mechanism, the states are chemical intermediates and transition probabilities are rate equations (Shapiro & Zeilberger, 1982). When these states constitute symbol emitters and each state has a defined emission probability for each possible symbol, and a concatenation of states will broadcast a symbols' sequence, for instance, an amino acid sequence (Eddy et al., 1995).

A very simple sequence-generating MM may consist of two states (Fig. 1): Let the state *S*1 be an emitter of any of the 20 amino acids abbreviations. Let the amino acid compositions of an infinite sequence of symbols produced by *S*1 equal the composition of natural proteins. With a probability of 0.1, state *S*1 may suffer a transition to a second emitter *S*2. In turn, *S*2 is able to emit a stop, or to transit back to *S*1, with 0.9 probability. This two states will go forth and back to give an infinite number of sequences of short length, because, given the probabilities, sequences longer than a few tenths of amino acids will be very infrequent.

On the Assessment of Structural Protein Models with ROSETTA-Design and HMMer: Value, Potential and Limitations 223

emission probability by a very general model, such as the one in figure 1. The ratio of these two probabilities can be used as an index or score, the higher the score, the higher the likelihood of the sequence being a member of the alignment. From this information, the expectancy of such an index value being due to chance can be estimated. Expectancies of

HMMer is a suite of programs developed by Sean Eddy (Eddy, 2004, Eddy et al., 1995) to create and use HMMs of amino acid and nucleic acid sequence alignments. HMMer has executables to estimate the parameters of a HMM from a sequence alignment and calibrate the model to allow a good estimation of scores and expectancies. Other executables will test a sequence database to extract those sequences with high score and low expectancy, aligning the new sequences to the model. Additional executables can create the starting HMM, use it to emit sequences with high probability of being members of the model, or update the

One critical step in the HMM preparation is the starting alignment fed to HMMer (Eddy, 2004), because as mentioned in section 1, the optimal alignment of sequence sets is not a trivial problem (Lathrop, 1994). When a HMM gives poor results, it is frequently as a

Another limitation of a HMM lies on its very definition, because a MM must be memoryless (Markov property). In the 3D-structure of proteins those amino acids brought into proximity during folding, must be of compatible nature from the sterical and chemical points of view. This property is stored in the sequence as sites with correlated variability, also known as mutual information. Its relevance has been recognized and exploited (Socolich et al., 2005),

After the above discussion of HMMer features, we can consider its value in dealing with the large a set of amino acid sequences redesigned by the Rd protocol described in the

1. Rd.HMM produces many sequences that can be trivially aligned, because every amino

2. Using a HMM to represent the redesigned sequences will result in the statistical extension of the sample to sequences with a similar frequency profile. This extension is however inaccurate, because not all off the sequences possibly emitted by the HMM will actually be low-energy solutions to the 3D-backbone redesign (Hamelryck et al.,

3. The HMM can be used to search those natural sequences having amino acid combinations suitable to the 3D-structure under analysis. The value of HMM in the analysis of relationships between biological sequences has been extensively

4. Due to (1), the search made in a database of natural sequences by means of the HMM will align each selected sequence in a structurally aware manner (Martínez-Castilla & Rodríguez-Sotres 2010). But given (2), such structurally aware alignment is somehow

inaccurate, the lower the HMMer score, the less reliable this alignment becomes.

one or above may indicate a meaningless score.

consequence of a defective alignment.

preceding section:

2010).

documented (Eddy, 2004).

but HMM are unable to encode such information.

model parameters using the newly discovered additional sequences.

acid has biunivocal correspondence with a 3D-backbone site.

Since a MM is a stochastic device, it is unsuited to represent only one particular sequence, but instead, it can be a powerful tool to represent a subset of the sequence space, notably, a sequence alignment. Such MM represents the observed aligned sequences, usually a subset of all the possible sequences in the alignment, but the states of the model (each one encoding the probabilities of one or more alignment positions) cannot be observed. When such is the case, the MM is then said to be hidden (HMM). However, the Viterbi algorithm, the forward algorithm, and the Baum–Welch algorithm make it possible to compute the most likely parameters of the model's states, out of the observations available (Eddy, 2004; Eddy et al., 1995).

**Figure 1.** A simple Markov model to emit random amino acid sequences of variable length with an amino acid composition similar to that in natural proteins.

Because the HMM represents more sequences than those observed, it can be used to produce new sequences, but most importantly, for any available sequence in a database, its emission probability by the HMM can be calculated and compared to corresponding emission probability by a very general model, such as the one in figure 1. The ratio of these two probabilities can be used as an index or score, the higher the score, the higher the likelihood of the sequence being a member of the alignment. From this information, the expectancy of such an index value being due to chance can be estimated. Expectancies of one or above may indicate a meaningless score.

222 Bioinformatics

(Eddy, 2004; Eddy et al., 1995).

A very simple sequence-generating MM may consist of two states (Fig. 1): Let the state *S*1 be an emitter of any of the 20 amino acids abbreviations. Let the amino acid compositions of an infinite sequence of symbols produced by *S*1 equal the composition of natural proteins. With a probability of 0.1, state *S*1 may suffer a transition to a second emitter *S*2. In turn, *S*2 is able to emit a stop, or to transit back to *S*1, with 0.9 probability. This two states will go forth and back to give an infinite number of sequences of short length, because, given the probabilities, sequences longer than a few tenths of amino acids will be very infrequent.

Since a MM is a stochastic device, it is unsuited to represent only one particular sequence, but instead, it can be a powerful tool to represent a subset of the sequence space, notably, a sequence alignment. Such MM represents the observed aligned sequences, usually a subset of all the possible sequences in the alignment, but the states of the model (each one encoding the probabilities of one or more alignment positions) cannot be observed. When such is the case, the MM is then said to be hidden (HMM). However, the Viterbi algorithm, the forward algorithm, and the Baum–Welch algorithm make it possible to compute the most likely parameters of the model's states, out of the observations available

**Figure 1.** A simple Markov model to emit random amino acid sequences of variable length with an

Because the HMM represents more sequences than those observed, it can be used to produce new sequences, but most importantly, for any available sequence in a database, its emission probability by the HMM can be calculated and compared to corresponding

amino acid composition similar to that in natural proteins.

HMMer is a suite of programs developed by Sean Eddy (Eddy, 2004, Eddy et al., 1995) to create and use HMMs of amino acid and nucleic acid sequence alignments. HMMer has executables to estimate the parameters of a HMM from a sequence alignment and calibrate the model to allow a good estimation of scores and expectancies. Other executables will test a sequence database to extract those sequences with high score and low expectancy, aligning the new sequences to the model. Additional executables can create the starting HMM, use it to emit sequences with high probability of being members of the model, or update the model parameters using the newly discovered additional sequences.

One critical step in the HMM preparation is the starting alignment fed to HMMer (Eddy, 2004), because as mentioned in section 1, the optimal alignment of sequence sets is not a trivial problem (Lathrop, 1994). When a HMM gives poor results, it is frequently as a consequence of a defective alignment.

Another limitation of a HMM lies on its very definition, because a MM must be memoryless (Markov property). In the 3D-structure of proteins those amino acids brought into proximity during folding, must be of compatible nature from the sterical and chemical points of view. This property is stored in the sequence as sites with correlated variability, also known as mutual information. Its relevance has been recognized and exploited (Socolich et al., 2005), but HMM are unable to encode such information.

After the above discussion of HMMer features, we can consider its value in dealing with the large a set of amino acid sequences redesigned by the Rd protocol described in the preceding section:

