**5. Domain detection by SVM**

SVM is a machine learning technique based on statistical learning theory that trains multiple functions such as polynomial functions, radial basic functions and spines to form a single classifier. The SVM is applied to identify the protein domain boundaries position. The SVM works by: (1) Mapping the input vector into a feature space which is relevant to the kernel function; and (2) Seeking an optimized linear division from multiple *n*-separated hyperplane, where *n* is classes of protein sequence in the dataset. The input (Dong et al., 2003) vector is defined as follows:

BRNN-SVM: Increasing the Strength of

Fig. 11. Dataset generation process.

are defined as follows:

is defined as follows:

**7. Computational results** 

Domain Signal to Improve Protein Domain Prediction Accuracy 143

Based on the classification output of SVM, a series of statistical metrics were computed to measure the effectiveness of the BRNN-SVM. Sensitivity (SN: Zaki et al., 2006) and specificity (SP: Zaki et al., 2006), which indicates the ability of the prediction system to correctly classify the protein domain and not protein domain respectively; the SN and SP

<sup>100</sup> *TP SN*

<sup>100</sup> *TP SP*

To provide an indication of the overall performance of the system, we computed accuracy (AC: Zaki et al., 2006), for the percentage of the correctly predicted protein domain. The AC

> <sup>100</sup> *TP TN AC TP FN TN FP*

The BRNN-SVM is tested and compared its performance with other methods such as based on similarity and multiple sequence alignment (Biozon and KemaDOM), known protein structure (AutoSCOP and DOMpro), dimensional structure (GlobPlot, Mateo, and

*TP FN* , (15)

*TP FP* . (16)

. (17)

$$l\_s \in \{+1, -1\} \; \; \; \prime \tag{12}$$

where *sI* is the input space with corresponding predefined labels (Dong et al., 2003):

$$\{y\_i \in I\_s \mid i = 1, \dots, n\}\_{\prime} \tag{13}$$

where +1 and -1 are used to stand, respectively, for the two classes. The SVM is trained with Radial Basic Function (RBF) kernel, a function that is often used in pattern recognition. The parameters of SVM training are <sup>2</sup> , the RBF kernel smoothing parameter and *C,* the learning variable to trade-off between under- and over-generalization. The RBF (Zou et al., 2008) is defined as follows:

$$K(\vec{y}\_i, \vec{y}\_j) = \exp(\frac{-r \parallel \lfloor \vec{y}\_i - \vec{y}\_j \parallel \rfloor^2}{2\sigma^2}) \,\tag{14}$$

where *<sup>i</sup> y* is labels and *<sup>j</sup> <sup>y</sup>* is input vector. The input vector will be the centre of the RBF and will determine the area of influence this input vector has over the feature space. A larger value of will give a smoother decision surface and a more regular decision boundary since the RBF with large will allow an input vector to have a strong influence over a larger area.

The best pair of parameter of *C* and is search via *k*-fold cross-validation scheme to safeguard unbiased tweaking. In this study, *k* = 10 is applied where the protein sequence is split into *k* subsets of approximately equal size portions. The best combinations of *C* and obtained from the optimization process were used for training the final SVM classifier using the entire training set. The SVM classifier is subsequently used to predict the testing datasets. The SVM training detects the protein domain boundaries based on scores that corresponds to the domain information or different domain information. The SVM classified the protein domain into single-domain, two-domain, and multiple-domain. Various quantitative metrics were obtained to measure the effectiveness of the BRNN-SVM: true positives (TP) for the number of correctly classified protein domain; false positives (FP) for the number of incorrectly classified protein domain; true negatives (TN) for the number of correctly classified non protein domain; and false negatives (FN) for the number of incorrectly classified non protein domain.

#### **6. Dataset and evaluation measure**

To test the BRNN-SVM, seed protein sequences obtained from the PDB database (Berman et al., 2000) are selected with their corresponding domain structure that exists in SCOP database (Andreeva et al., 2008) version 1.73. The SCOP 1.73 with 40% less identity in PDB contains 9,536 protein sequences. The protein sequences are reconstructed from which short protein sequences that are less than 40 amino acids are removed. Then, the protein sequences are searched from the NR database (Henikoff et al., 1999) using BLAST and protein sequences that have more than 20 hits are kept. Hence, the number of protein data retained is 6,242. The dataset is divided into training and testing sets. Training set is used for optimizing the SVM parameters and for training the SVM classifier to predict unseen protein domain boundaries. Testing set is used for evaluating the performance of the SVM. The dataset are randomly split into training and testing sets in the same ratio which is 3,121 protein sequences respectively. The process of generating the dataset is shown in Fig. 10.

where +1 and -1 are used to stand, respectively, for the two classes. The SVM is trained with Radial Basic Function (RBF) kernel, a function that is often used in pattern recognition. The

learning variable to trade-off between under- and over-generalization. The RBF (Zou et al.,


will determine the area of influence this input vector has over the feature space. A larger

unbiased tweaking. In this study, *k* = 10 is applied where the protein sequence is split into *k*

the optimization process were used for training the final SVM classifier using the entire training set. The SVM classifier is subsequently used to predict the testing datasets. The SVM training detects the protein domain boundaries based on scores that corresponds to the domain information or different domain information. The SVM classified the protein domain into single-domain, two-domain, and multiple-domain. Various quantitative metrics were obtained to measure the effectiveness of the BRNN-SVM: true positives (TP) for the number of correctly classified protein domain; false positives (FP) for the number of incorrectly classified protein domain; true negatives (TN) for the number of correctly classified non protein domain; and

To test the BRNN-SVM, seed protein sequences obtained from the PDB database (Berman et al., 2000) are selected with their corresponding domain structure that exists in SCOP database (Andreeva et al., 2008) version 1.73. The SCOP 1.73 with 40% less identity in PDB contains 9,536 protein sequences. The protein sequences are reconstructed from which short protein sequences that are less than 40 amino acids are removed. Then, the protein sequences are searched from the NR database (Henikoff et al., 1999) using BLAST and protein sequences that have more than 20 hits are kept. Hence, the number of protein data retained is 6,242. The dataset is divided into training and testing sets. Training set is used for optimizing the SVM parameters and for training the SVM classifier to predict unseen protein domain boundaries. Testing set is used for evaluating the performance of the SVM. The dataset are randomly split into training and testing sets in the same ratio which is 3,121 protein sequences respectively. The process of generating the dataset is shown in Fig. 10.

*ryy Ky y*

where *sI* is the input space with corresponding predefined labels (Dong et al., 2003):

*i j*

subsets of approximately equal size portions. The best combinations of *C* and

false negatives (FN) for the number of incorrectly classified non protein domain.

parameters of SVM training are <sup>2</sup>

2008) is defined as follows:

is labels and *<sup>j</sup> <sup>y</sup>*

**6. Dataset and evaluation measure** 

The best pair of parameter of *C* and

where *<sup>i</sup> y*

value of

the RBF with large

{ 1, 1} *sl* , (12)

( 1,..., ) *i s y Ii n* , (13)

, the RBF kernel smoothing parameter and *C,* the

, (14)

is search via *k*-fold cross-validation scheme to safeguard

obtained from

2

is input vector. The input vector will be the centre of the RBF and

will allow an input vector to have a strong influence over a larger area.

2

will give a smoother decision surface and a more regular decision boundary since

*i j*

#### Fig. 11. Dataset generation process.

Based on the classification output of SVM, a series of statistical metrics were computed to measure the effectiveness of the BRNN-SVM. Sensitivity (SN: Zaki et al., 2006) and specificity (SP: Zaki et al., 2006), which indicates the ability of the prediction system to correctly classify the protein domain and not protein domain respectively; the SN and SP are defined as follows:

$$\text{LSN} = \frac{TP}{TP + FN} \times 100 \,\text{ \AA} \tag{15}$$

$$SP = \frac{TP}{TP + FP} \times 100 \,\text{.}\tag{16}$$

To provide an indication of the overall performance of the system, we computed accuracy (AC: Zaki et al., 2006), for the percentage of the correctly predicted protein domain. The AC is defined as follows:

$$AC = \frac{TP - TN}{TP + FN + TN + FP} \times 100 \,\text{.}\tag{17}$$

### **7. Computational results**

The BRNN-SVM is tested and compared its performance with other methods such as based on similarity and multiple sequence alignment (Biozon and KemaDOM), known protein structure (AutoSCOP and DOMpro), dimensional structure (GlobPlot, Mateo, and

BRNN-SVM: Increasing the Strength of

**Similarity and multiple sequence alignment**

**Dimensional structure**

**Comparative model**

**Sequence alone**

prediction methods.

**Known protein structure** 

Domain Signal to Improve Protein Domain Prediction Accuracy 145

Method SN SP SN SP SN SP AC BRNN-SVM 0.87 0.79 0.73 0.76 0.81 0.79 0.83

Biozon 0.27 0.93 0.33 0.23 0.21 0.35 0.38 KemaDom 0.82 0.76 0.70 0.73 0.78 0.76 0.79

AutoSCOP 0.80 0.65 0.62 0.57 0.73 0.72 0.69 DOMpro 0.85 0.80 0.43 0.55 0.79 0.73 0.71

GlobPlot 0.78 0.74 0.32 0.58 0.59 0.67 0.69 Mateo 0.57 0.74 0.21 0.25 0.47 0.53 0.45 Dompred-DPS 0.55 0.73 0.52 0.43 0.67 0.66 0.62

HMMPfam 0.65 0.60 0.53 0.59 0.35 0.33 0.62 HMMSmart 0.77 0.69 0.66 0.63 0.23 0.20 0.71

SBASE 0.86 0.77 0.69 0.74 0.76 0.76 0.80 Armadillo 0.31 0.89 0.29 0.21 0.17 0.35 0.27

Fig. 12. Performance comparison between BRNN-SVM and other protein domain prediction methods on single-domain. The best sensitivity is BRNN-SVM with 87% and the best specificity is Armadillo with 89% since the BRNN-SVM classifies the protein sequence with

no predicted protein domain boundaries as a single-domain.

Table 1. Performance comparison between BRNN-SVM and other protein domain

Single-Domain Two-Domain Multiple-

Domain

Dompred-DPS), comparative model (HMMPfam and HMMSMART), and sequence alone (Armadillo and SBASE). The properties of protein sequence are derived from a protein secondary structure using several measures such as entropy, correlation, protein sequence termination, contact profile, physio-chemical properties and intron-exon boundaries measures. The protein secondary structure generates a strong signal of protein domain boundaries and is used to locate the protein domain regions using the following procedures. Firstly, the BRNN-SVM starts by searching large protein sequences and comparing them with the NR database to generate multiple sequence alignments. Secondly, the secondary structure is predicted for each protein sequence using BRNN. Thirdly, some of the scores from several measures are calculated as input in the SVM training. Finally, the results generated by SVM are evaluated. This evaluation provides a clear understanding of strengths and weaknesses of an algorithm that has been designed.

The datasets obtained from SCOP 1.73 that have been defined in the previous section are used to test and evaluate the BRNN-SVM and other protein domain prediction methods. The results of the prediction accuracy compared with other protein domain prediction methods including sensitivity and specificity for single-domain, two-domain and multipledomain are presented in Table 1 and Fig. 11-14. It is easy to see that predicting two-domain or multiple-domain is more difficult than predicting single-domain. The results depict the higher sensitivity and specificity represent better achievement and the priority is given to sensitivity in order to determine the achievement of protein domain prediction since sensitivity measures the proportion of actual positives which are correctly identified for protein domain prediction. The BRNN-SVM achieved a higher sensitivity of 87% for singledomain, 73% for the two-domain and 81% for the multiple-domain compared to other methods. The BRNN-SVM achieved a higher specificity of 76% for the two-domain and 79% for the multiple-domain compared to other methods. The BRNN-SVM increases of 83% for accuracy as compared to KemaDom method with 79% and SBase method with 80%.

The properties of protein sequence have given a strong signal to assign protein boundaries because the protein secondary structure predicted is based on interaction between longrange interactions of the amino acid. The use of protein secondary structure prediction based on BRNN involves informative communion between an input and an output sequence of variable length. The BRNN is based on the forward, backward and hidden Markov chains that transmit information in both directions along the sequence between the input and output. This shows that interaction exists in protein folding and plays an important role in the formation of protein secondary structure. The information does have an effect on the protein domain boundaries prediction. The BRNN-SVM relies on scores of measures to detect the protein domain region in order to classify a domain for the protein sequence.

However, the prediction of specificity for a single-domain prediction is 79% which is 14% lower compared to the Biozon and 10% lower compared to Armadilo. The reason is that the BRNN-SVM classifies the protein sequence with no predicted protein domain boundaries as a single-domain. Therefore, the number of protein domain for the protein sequence is from the start until the end. The situation is aggravated when the protein sequence is too long. To solve this problem, the protein sequence can be split into protein sub-sequences before predicting the protein domain (Kalsum et al., 2009).


Table 1. Performance comparison between BRNN-SVM and other protein domain prediction methods.

Fig. 12. Performance comparison between BRNN-SVM and other protein domain prediction methods on single-domain. The best sensitivity is BRNN-SVM with 87% and the best specificity is Armadillo with 89% since the BRNN-SVM classifies the protein sequence with no predicted protein domain boundaries as a single-domain.

144 Recurrent Neural Networks and Soft Computing

Dompred-DPS), comparative model (HMMPfam and HMMSMART), and sequence alone (Armadillo and SBASE). The properties of protein sequence are derived from a protein secondary structure using several measures such as entropy, correlation, protein sequence termination, contact profile, physio-chemical properties and intron-exon boundaries measures. The protein secondary structure generates a strong signal of protein domain boundaries and is used to locate the protein domain regions using the following procedures. Firstly, the BRNN-SVM starts by searching large protein sequences and comparing them with the NR database to generate multiple sequence alignments. Secondly, the secondary structure is predicted for each protein sequence using BRNN. Thirdly, some of the scores from several measures are calculated as input in the SVM training. Finally, the results generated by SVM are evaluated. This evaluation provides a clear understanding of

The datasets obtained from SCOP 1.73 that have been defined in the previous section are used to test and evaluate the BRNN-SVM and other protein domain prediction methods. The results of the prediction accuracy compared with other protein domain prediction methods including sensitivity and specificity for single-domain, two-domain and multipledomain are presented in Table 1 and Fig. 11-14. It is easy to see that predicting two-domain or multiple-domain is more difficult than predicting single-domain. The results depict the higher sensitivity and specificity represent better achievement and the priority is given to sensitivity in order to determine the achievement of protein domain prediction since sensitivity measures the proportion of actual positives which are correctly identified for protein domain prediction. The BRNN-SVM achieved a higher sensitivity of 87% for singledomain, 73% for the two-domain and 81% for the multiple-domain compared to other methods. The BRNN-SVM achieved a higher specificity of 76% for the two-domain and 79% for the multiple-domain compared to other methods. The BRNN-SVM increases of 83% for

accuracy as compared to KemaDom method with 79% and SBase method with 80%.

The properties of protein sequence have given a strong signal to assign protein boundaries because the protein secondary structure predicted is based on interaction between longrange interactions of the amino acid. The use of protein secondary structure prediction based on BRNN involves informative communion between an input and an output sequence of variable length. The BRNN is based on the forward, backward and hidden Markov chains that transmit information in both directions along the sequence between the input and output. This shows that interaction exists in protein folding and plays an important role in the formation of protein secondary structure. The information does have an effect on the protein domain boundaries prediction. The BRNN-SVM relies on scores of measures to detect the protein domain region in order to classify a domain for the protein

However, the prediction of specificity for a single-domain prediction is 79% which is 14% lower compared to the Biozon and 10% lower compared to Armadilo. The reason is that the BRNN-SVM classifies the protein sequence with no predicted protein domain boundaries as a single-domain. Therefore, the number of protein domain for the protein sequence is from the start until the end. The situation is aggravated when the protein sequence is too long. To solve this problem, the protein sequence can be split into protein sub-sequences before

predicting the protein domain (Kalsum et al., 2009).

strengths and weaknesses of an algorithm that has been designed.

sequence.

BRNN-SVM: Increasing the Strength of

signal.

**8. Conclusion** 

domain and multiple-domain

Innovation (MOSTI) under Grant No. 02-01-06-SF0230.

**9. Acknowledgements** 

Domain Signal to Improve Protein Domain Prediction Accuracy 147

Fig. 15. Performance comparison between BRNN-SVM and other protein domain prediction methods on accuracy. The best accuracy of protein domain prediction is BRNN-SVM with 83% since the protein secondary structure is predicted using BRNN and the information of secondary structure is extracted from features extraction which increases the protein domain

An algorithm named BRNN-SVM has been developed in order to solve the problem of weak domain signal. The algorithm begins with searching the seed protein sequences as dataset from SCOP 1.73. The dataset is split into training and testing sets. Then, multiple sequence alignment is performed prior to the prediction of protein secondary structure using BRNN. Several measures such as entropy, protein sequence termination, correlation, contact profile, physio-chemical properties and intron-exon data are used to increase the strength of domain signal from protein secondary structure. SVM classified the prediction into single-domain, two-domain and multiple-domain. Lastly, the results from SVM are evaluated in term of sensitivity and specificity. BRNN is based on forward, backward and hidden Markov chains that transmit information in both directions along the sequence between the input and output. Therefore, it increases accuracy of protein secondary prediction and well as providing strong domain signal from this protein secondary structure based on the generated measures. This is believed to be the reason why BRNN-SVM is a good method for protein domain predictors especially in two-

We would like to express our appreciation to the reviewers of this paper for their valuable suggestions. This work is supported by the Malaysian Ministry of Science, Technology and

Fig. 13. Performance comparison between BRNN-SVM and other protein domain prediction methods on two-domain. The best performance for two-domain prediction is BRNN-SVM with 73% for sensitivity and 76% for specificity since the secondary structure information has given a strong signal to assign protein boundaries because the protein secondary structure predicted is based on interaction between long-range interactions of the amino acid.

Fig. 14. Performance comparison between BRNN-SVM and other protein domain prediction methods on multiple-domain. The best performance of multiple-domain prediction is BRNN-SVM with 81% sensitivity and 79% specificity since the BRNN is a transaction between an input and an output sequence of variable length. This shows that interaction exists in protein folding and plays an important role in the formation of protein secondary structure.

Fig. 15. Performance comparison between BRNN-SVM and other protein domain prediction methods on accuracy. The best accuracy of protein domain prediction is BRNN-SVM with 83% since the protein secondary structure is predicted using BRNN and the information of secondary structure is extracted from features extraction which increases the protein domain signal.
