**3. Secondary structure prediction by BRNN**

For each protein sequence, the secondary structure information is predicted based on an ensemble of BRNNs. The input for predicting secondary structure is a single protein sequence from a multiple sequence alignment. Then, BRNN derives protein sequence information from PSI-BLAST (Altschul et al., 1997) to include homology structure that is used in the protein secondary structure information prediction. Subsequently, the protein secondary structure information is divided into three classes: alpha-helices, beta-sheets, and coils.

The BRNN is described in Fig. 2–3. This BRNN involves a set of *i* protein sequences as input *Xi* variable, a forward *Fi* , and backward *Bi* , a chain of hidden variables, and a set of *Oi* as an output variable. The relationship between these variables is implemented using feedforward NN. Three NNs *No* , *N <sup>f</sup>* , and *Nb* are used to implement BRNN. The output *Oi* (Chen and Chaudhari, 2007) is as follows:

$$
\partial\_i O\_i = \mathcal{N}\_o(\mathcal{X}\_{i\prime} F\_i, B\_i) \,. \tag{1}
$$

The output *Oi* depends on input *Xi* at the position *i*, the forward *Fi* (Chen and Chaudhari, 2007) is the hidden context in the vector *<sup>n</sup> Fi* and the backward *Bi* (Chen and Chaudhari, 2007) is the hidden context in the vector *<sup>m</sup> Bi* where *m n* . To obtain the composite the *Fi* and *Bi* , the BRNN equation is applied as follows:

BRNN-SVM: Increasing the Strength of

along the protein chain.

**4. Features extraction** 

information.

Domain Signal to Improve Protein Domain Prediction Accuracy 137

Fig. 2. BRNN architecture with left (forward) and right (backward) context associated with two recurrent networks ("tool"). The left and right contexts are produced by two similar recurrent networks which intuitively can be thought in term of two "tools" that are shifted

Fig. 3. An example of secondary structure prediction using BRNN.

Features extraction in BRNN-SVM is important to obtain the protein domain information from the predicted secondary structure. The secondary structure information is used to compute the change of the protein sequence position that constitutes a part of the protein domain boundary. This information is believed to reflect the protein structural properties that have informative protein domain structure and is used to detect the protein domain boundaries. The information as shown in Fig. 4–9 is entropy, protein sequence termination, correlation, contact profile, physio-chemical properties and intron-exon

$$F\_i = N\_f \langle X\_i, F\_{i-1} \rangle\_\prime \tag{2}$$

$$B\_i = \mathcal{N}\_\emptyset \{ \mathcal{X}\_{i'} B\_{i+1} \} \; \prime \tag{3}$$

where 1 (, ) *N XF f ii* and 1 (, ) *NXB b ii* are learnable non-linear state transition function. The boundary condition for *Fi* and *Bi* can be set to 0, for example 1 0 *F F i n* where *n* is length of the protein sequence being processed.

The *N <sup>f</sup>* and *Nb* are assigned to be a "tool" that can be shifted along the protein sequence. For the prediction class at the position *i*, the "tool" is shifted in the opposite direction starting from the *N*, and *C* terminus, up to position *i*. Then, the "tool" output at position *i* is combined with the input *Xi* to compute the output *Oi* using *N* . From the output *Oi* , the membership probability of the residue at the position *i* is computed to predict the domain boundary.

BRNN is used to predict protein secondary structure into alpha-helices, beta-sheet, or coils. The BRNN consists of an input layer, hidden layer, and output layer. The protein sequences are fed into the input layer. The protein secondary structure is encoded into the output layer as follows:

(1, 0) = Alpha-Helices (0, 1) = Beta-Sheets (0, 0) = Coil

The input layer (John et al., 2006) is defined as follows:

*k ik i k i I WY b* , (4)

where *Wik* is the sum of all the input to the unit, *Yi* is the connection strength, *<sup>k</sup> b* is the bias from the protein sequence, *i* is the number of protein sequence, and *k* is the number of output from the protein sequence. The output layer (John et al., 2006) is defined as follows:

1 <sup>1</sup> *<sup>k</sup> <sup>k</sup> <sup>X</sup> <sup>O</sup> e* , (5)

where *X* is a real number between -8 and 8. This has been experimentally determined as the best range. *k* represents the number of outputs from the protein sequence.

The alpha-helices measure is divided into two types: amphipathic helices and hydrophobic helices. To predict an amphipathic helices region, Helical Wheel Representation (HWR: Renaund and McConkey, 2005) is applied. The HWR predicts the residues from the solvent and side chains interaction of protein sequence with amphipathic helices. Then, the score of amphipathic helices and hydrophobic helices are merged to predict the alpha-helices region for the protein sequence. The beta-sheets are assigned using Kabsch and Sander's program (Kabsch and Sander, 1983). The extension of beta-sheets is situated and connected to form theatre-backbone H-bonds according to the Pauling pairing rules (Pauling and Corey, 1951). When two H-bond is formed or surrounded by two H-bond in the sheet, this formation is defined as beta-sheet (E). If only one amino acid fulfils the criteria, the sheet will be called betabridge (B). The residues that are neither alpha-helices nor beta-sheets are classified as coils.

where 1 (, ) *N XF f ii* and 1 (, ) *NXB b ii* are learnable non-linear state transition function. The boundary condition for *Fi* and *Bi* can be set to 0, for example 1 0 *F F i n* where *n* is length

The *N <sup>f</sup>* and *Nb* are assigned to be a "tool" that can be shifted along the protein sequence. For the prediction class at the position *i*, the "tool" is shifted in the opposite direction starting from the *N*, and *C* terminus, up to position *i*. Then, the "tool" output at position *i* is combined with

BRNN is used to predict protein secondary structure into alpha-helices, beta-sheet, or coils. The BRNN consists of an input layer, hidden layer, and output layer. The protein sequences are fed into the input layer. The protein secondary structure is encoded into the output layer

> *k ik i k i*

where *Wik* is the sum of all the input to the unit, *Yi* is the connection strength, *<sup>k</sup> b* is the bias from the protein sequence, *i* is the number of protein sequence, and *k* is the number of output from the protein sequence. The output layer (John et al., 2006) is defined as follows:

> 1 <sup>1</sup> *<sup>k</sup> <sup>k</sup> <sup>X</sup> <sup>O</sup> e*

where *X* is a real number between -8 and 8. This has been experimentally determined as

The alpha-helices measure is divided into two types: amphipathic helices and hydrophobic helices. To predict an amphipathic helices region, Helical Wheel Representation (HWR: Renaund and McConkey, 2005) is applied. The HWR predicts the residues from the solvent and side chains interaction of protein sequence with amphipathic helices. Then, the score of amphipathic helices and hydrophobic helices are merged to predict the alpha-helices region for the protein sequence. The beta-sheets are assigned using Kabsch and Sander's program (Kabsch and Sander, 1983). The extension of beta-sheets is situated and connected to form theatre-backbone H-bonds according to the Pauling pairing rules (Pauling and Corey, 1951). When two H-bond is formed or surrounded by two H-bond in the sheet, this formation is defined as beta-sheet (E). If only one amino acid fulfils the criteria, the sheet will be called betabridge (B). The residues that are neither alpha-helices nor beta-sheets are classified as coils.

the best range. *k* represents the number of outputs from the protein sequence.

probability of the residue at the position *i* is computed to predict the domain boundary.

of the protein sequence being processed.

as follows:

(1, 0) = Alpha-Helices (0, 1) = Beta-Sheets (0, 0) = Coil

the input *Xi* to compute the output *Oi* using *N*

The input layer (John et al., 2006) is defined as follows:

<sup>1</sup> (, ) *F N XF i f ii* , (2)

<sup>1</sup> (, ) *B NXB i b ii* , (3)

*I WY b* , (4)

, (5)

. From the output *Oi* , the membership

Fig. 2. BRNN architecture with left (forward) and right (backward) context associated with two recurrent networks ("tool"). The left and right contexts are produced by two similar recurrent networks which intuitively can be thought in term of two "tools" that are shifted along the protein chain.

Fig. 3. An example of secondary structure prediction using BRNN.
