**4. Features extraction**

Features extraction in BRNN-SVM is important to obtain the protein domain information from the predicted secondary structure. The secondary structure information is used to compute the change of the protein sequence position that constitutes a part of the protein domain boundary. This information is believed to reflect the protein structural properties that have informative protein domain structure and is used to detect the protein domain boundaries. The information as shown in Fig. 4–9 is entropy, protein sequence termination, correlation, contact profile, physio-chemical properties and intron-exon information.

BRNN-SVM: Increasing the Strength of

Fig. 6. An example of correlation calculation.

of protein sequence lengths is used:

where

( ) log *g pp*

Fig. 7. An example of contact profile calculation.

columns *x* and *y* are defined as:

*<sup>p</sup>* is the various indel lengths seen at a position.

Domain Signal to Improve Protein Domain Prediction Accuracy 139

Correlation is two random protein sequences that are positively correlated if high values of one are likely to be associated with high values of the other. Possible correlations range is 1 or 0. A zero correlation indicates that there is no relationship between the sequences. A correlation of 1 indicates a perfect positive correlation, meaning that both sequence move in the same direction together. The correlation of amino acids with protein secondary structure information is used to predict the protein structure. It is also important to understand the force that causes the flexibility of a protein structure. Every protein sequence in a multiple sequence alignment contains information of structural flexibility. To find a position that is more flexible in a protein sequence, indel entropy (Zou et al., 2008) based on the distribution

*p*

The predicted contact profile of a protein sequence is obtained by getting the structural flexibility information. Then, the number of pairwise contact profile is counted for each protein sequence. The contact profile between residues in a protein sequence is predicted based on correlated mutations. Correlated mutations (Pazos et al., 1997) between two

 , (8)

*<sup>E</sup>*

Fig. 4. An example of entropy calculation.

The effective entropy measure takes into account the similarity of amino acids. An evolutionary pressure is used to calculate the evolutionary span (Nagaranjan and Yona, 2004) defined as:

$$Spam(\mathbf{x}) = \frac{2}{t(t-1)} \sum\_{p=1}^{t} \sum\_{q$$

where *sjk* (, ) is *s*(,) *px qx* . *Span*() is used to compare the sum of pairwise similarity of amino acids. The *x* is an alignment from the multiple sequence alignment and *t* is the number of protein sequences that has participated in *x*. *px* and *qx* represent the amino acids in position x. *sjk* (, ) is the similarity score of amino acids where *j* and *k* refer to the scoring matrix BLOSUM50 (Henikoff and Henikoff, 1992).

Fig. 5. An example of protein sequence termination calculation.

In a multiple sequence alignment, the protein sequence termination is not necessarily displayed. The left and right protein sequence termination score is calculated for each protein sequence with an e-value that is larger than 0. The scores of protein sequence termination are then used to identify the strong signal of the protein domain boundary. Left and right protein sequence terminations score (Menachem and Chen, 2008) are defined as:

$$T\_{seq\\_termination} = \log(\mathcal{T}\_1 \bullet \mathcal{T}\_2 \bullet \dots \bullet \mathcal{T}\_n) \,. \tag{7}$$

where *<sup>n</sup>* is the e-value of the *n* protein sequence.

The effective entropy measure takes into account the similarity of amino acids. An evolutionary pressure is used to calculate the evolutionary span (Nagaranjan and Yona,

*Span x sjk t t*

amino acids. The *x* is an alignment from the multiple sequence alignment and *t* is the

acids in position x. *sjk* (, ) is the similarity score of amino acids where *j* and *k* refer to the

In a multiple sequence alignment, the protein sequence termination is not necessarily displayed. The left and right protein sequence termination score is calculated for each protein sequence with an e-value that is larger than 0. The scores of protein sequence termination are then used to identify the strong signal of the protein domain boundary. Left and right protein sequence terminations score (Menachem and Chen, 2008) are defined as:

> 1 2 *Tseq\_termination* log( ... ) ...

*<sup>n</sup>* , (7)

1 <sup>2</sup> ( ) (, ) ( 1) *t*

*p qp*

*qx* . *Span*() is used to compare the sum of pairwise similarity of

*px* and

, (6)

*qx* represent the amino

Fig. 4. An example of entropy calculation.

2004) defined as:

where

where *sjk* (, ) is *s*(,)

*px* 

number of protein sequences that has participated in *x*.

scoring matrix BLOSUM50 (Henikoff and Henikoff, 1992).

Fig. 5. An example of protein sequence termination calculation.

*<sup>n</sup>* is the e-value of the *n* protein sequence.

Fig. 6. An example of correlation calculation.

Correlation is two random protein sequences that are positively correlated if high values of one are likely to be associated with high values of the other. Possible correlations range is 1 or 0. A zero correlation indicates that there is no relationship between the sequences. A correlation of 1 indicates a perfect positive correlation, meaning that both sequence move in the same direction together. The correlation of amino acids with protein secondary structure information is used to predict the protein structure. It is also important to understand the force that causes the flexibility of a protein structure. Every protein sequence in a multiple sequence alignment contains information of structural flexibility. To find a position that is more flexible in a protein sequence, indel entropy (Zou et al., 2008) based on the distribution of protein sequence lengths is used:

$$E\_{\mathcal{S}}(\mathbf{B}) = -\sum\_{\mathcal{P}} \beta\_{\mathcal{P}} \log \beta\_{\mathcal{P}} \,. \tag{8}$$

where *<sup>p</sup>* is the various indel lengths seen at a position.

Fig. 7. An example of contact profile calculation.

The predicted contact profile of a protein sequence is obtained by getting the structural flexibility information. Then, the number of pairwise contact profile is counted for each protein sequence. The contact profile between residues in a protein sequence is predicted based on correlated mutations. Correlated mutations (Pazos et al., 1997) between two columns *x* and *y* are defined as:

BRNN-SVM: Increasing the Strength of

(Saxonov et al., 2000) is calculated as follows:

Fig. 10. An example of features vector calculation.

**5. Domain detection by SVM** 

2003) vector is defined as follows:

used as input to SVM as follows:

where 

Domain Signal to Improve Protein Domain Prediction Accuracy 141

The intron-exon data contains intron-exon structure at Deoxyribonucleic Acid (DNA) level that is related to protein domain boundaries in which folded protein domain boundaries exist independently. Each protein domain defines the intron-exon position. The intron-exon data is taken from the EID database (Saxonov et al., 2000). Then, each protein sequence is compared with the database and the gapless matching protein sequence is kept. The similarity of the protein sequence is calculated in order to define the exon boundary using an equation defined as the sequence termination. Finally, the exon termination score

> exon\_termination 1 2 *E* log( ... ) .

 *<sup>n</sup>* is the e-value of the *n* protein sequence. After that, the average of measures score from features extraction's phase is calculated in order to generate the features vector and

( \_\_ ), *Score of measures*

where *Score of measures* \_ \_ is obtain from features extraction (entropy, protein sequence termination, correlation, contact profile, physio-chemical properties and intron-exon information) score and *n* refer to quantity of features extraction measurements where it

SVM is a machine learning technique based on statistical learning theory that trains multiple functions such as polynomial functions, radial basic functions and spines to form a single classifier. The SVM is applied to identify the protein domain boundaries position. The SVM works by: (1) Mapping the input vector into a feature space which is relevant to the kernel function; and (2) Seeking an optimized linear division from multiple *n*-separated hyperplane, where *n* is classes of protein sequence in the dataset. The input (Dong et al.,

could be seven. Fig. 10 has shown the example of features vector calculation.

 

*<sup>n</sup>* (11)

. . *<sup>n</sup>* , (10)

$$Corr\_{\mathfrak{m}}(\mathbf{x}, \mathbf{y}) = \frac{1}{t^2} \sum\_{p=1}^{t} \sum\_{q=1}^{t} \frac{\left(\mathbf{s}(a\_{px}, a\_{qx}) - < s\_{\mathbf{x}} > \left(\mathbf{s}(a\_{py}, a\_{qy}) - < s\_{\mathbf{y}} > \right)\right)}{\sigma\_{\mathbf{X}} \bullet \sigma\_{\mathbf{y}}},\tag{9}$$

where *px* and *qx* represent the amino acids in position *x* and the *py* and *qy* represent the amino acids in position *y*. The (,) *px qx s* and ( , ) *py qy s* are the similarity score of amino acids and *px* , *qx* , *py* , and *qy* refer to the scoring matrix BLOSUM50. The *<sup>x</sup> s* and *<sup>y</sup> s* are the average similarity of amino acids in position *x* and *y*. The *<sup>X</sup>* and *<sup>y</sup>* are standard deviations and *t* is the number of protein sequences that are indicated in the columns.

Fig. 8. An example of physio-chemical properties calculation.

Physio-chemical properties are information that is used to predict protein domain boundaries. Hydrophobicity is used to display the distribution of protein sequence residue that in turn, used for the detection of physio-chemical properties. In BRNN-SVM, the score of hydrophobicity and molecular weight (Black and Mould, 1991) is used to predict physiochemical properties in protein sequence. The average hydropobicity and molecular weight for each measure of protein sequence is calculated to determine the physio-chemical properties that are affecting the protein domain boundary detection.

Fig. 9. An example of intron-exon calculation.

*qx* represent the amino acids in position *x* and the

standard deviations and *t* is the number of protein sequences that are indicated in the

Physio-chemical properties are information that is used to predict protein domain boundaries. Hydrophobicity is used to display the distribution of protein sequence residue that in turn, used for the detection of physio-chemical properties. In BRNN-SVM, the score of hydrophobicity and molecular weight (Black and Mould, 1991) is used to predict physiochemical properties in protein sequence. The average hydropobicity and molecular weight for each measure of protein sequence is calculated to determine the physio-chemical

and *<sup>y</sup> s* are the average similarity of amino acids in position *x* and *y*. The

<sup>1</sup> (( , ) (( , ) ) (,) . *t t px qx x py qy y <sup>m</sup> p q X y*

*s ss s*

 and ( , ) *py qy s* 

 

 

, (9)

*qy* refer to the scoring matrix BLOSUM50. The *<sup>x</sup> s*

*py* and

are the similarity score of

*<sup>X</sup>* and

*qy* represent

*<sup>y</sup>* are

2

Fig. 8. An example of physio-chemical properties calculation.

properties that are affecting the protein domain boundary detection.

Fig. 9. An example of intron-exon calculation.

*Corr x y <sup>t</sup>*

the amino acids in position *y*. The (,) *px qx s*

 *px* , *qx* , *py* , and

where

columns.

*px* and

amino acids and

1 1

The intron-exon data contains intron-exon structure at Deoxyribonucleic Acid (DNA) level that is related to protein domain boundaries in which folded protein domain boundaries exist independently. Each protein domain defines the intron-exon position. The intron-exon data is taken from the EID database (Saxonov et al., 2000). Then, each protein sequence is compared with the database and the gapless matching protein sequence is kept. The similarity of the protein sequence is calculated in order to define the exon boundary using an equation defined as the sequence termination. Finally, the exon termination score (Saxonov et al., 2000) is calculated as follows:

$$E\_{\text{Exon\\_termination}} = \log(\mathcal{E}\_1 \mathcal{L}\_2 \bullet \dots \bullet \mathcal{E}\_n) \, , \tag{10}$$

where *<sup>n</sup>* is the e-value of the *n* protein sequence. After that, the average of measures score from features extraction's phase is calculated in order to generate the features vector and used as input to SVM as follows:

$$\sum \frac{\text{(Score\\_of\\_measures)}}{n} \,\tag{11}$$

where *Score of measures* \_ \_ is obtain from features extraction (entropy, protein sequence termination, correlation, contact profile, physio-chemical properties and intron-exon information) score and *n* refer to quantity of features extraction measurements where it could be seven. Fig. 10 has shown the example of features vector calculation.

Fig. 10. An example of features vector calculation.
