**3.1. Performance measures for the evaluation of promoter classification programs**

A classification model (or classifier) is a mapping from instances of predicted classes (Fawcett, 2006). The promoter prediction problem is a kind of binary classification, as the input sequence can be classified in only one class of two non-overlapping classes (Sokolova & Lapalme, 2009). The result of a classifier during testing is based on the counting of the correct and incorrect classifications from each class (Bradley, 1997). In this way, the four possible outcomes of a classification model evaluate this correctness (Bradley, 1997; Fawcett, 2006; Sokolova & Lapalme, 2009):


This information is then normally displayed in a two-by-two confusion matrix (Table 2). A confusion matrix is a form of contingency table showing the differences between the true and predicted classes for a set of labeled examples (Bradley, 1997).

Bacterial Promoter Features Description

(1)

(4)

and Their Application on *E. coli in silico* Prediction and Recognition Approaches 247

*TN Specificity = TN + FP* (2)

*TP Sensibility or recall = TP + FN* (3)

axis and FP rate is plotted on the X axis. A common method associated with the ROC graph is to calculate the area under the ROC curve, abbreviated AUC. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Further information about the ROC curve

> *TP TN Accuracy TN TP FN FP*

> > *TP Precision*

Consensus sequences have been used to predict promoters by simple pattern matching. These strategies for promoter identification are usually based on a prior knowledge of some characterized sequences (Jacques et al., 2006). The first alignments of *E. coli* promoters were carried out by Hawley & McClure (1983), Galas et al. (1985), Lisser & Margalit (1993). From

A more sophisticated approach based on alignment is the Position-Weight Matrix (PWM). In this two-dimensional array, the rows represent one of the nucleotides A, T, C or G and the columns represent the analyzed motif. This accepted method yields results by aligning examples of referenced sequences, which allow estimating the base preference at each position of a matrix (Song et al., 2007). A weight is assigned to each base at each position in the promoter sequence and the final score of a candidate sequence decreases according to given differences of the reference matrix. Detailed information about the first

Huerta & Collado-Vides (2003) use a two stage PWM code-named Cover. This approach searches for conserved motifs using multiple sequence alignment methods and generates weight matrices for σ70-dependent promoter sequences. Aiming to select the best matrices, the authors added some criteria, such as the spacers between -10 and -35 hexamers, the distance from -10 region and the start codon, the distance from -10 region and the TSS, and statistical analysis and the matrix score. Despite the 86% of predictive capacity of this approach, the accuracy obtained was 53%. This value indicates that this approach presents a

Li & Lin (2006) have proposed a variation from PWMs called Position Correlation Scoring Matrix (PCSM). This approach considers the position-specific weight matrices at ten specific

implementations and the mathematical background can be found in Stormo (2000).

those compilations, the promoter consensual motifs were established.

*TP FP*

can be found in Fawcett (2006).

**3.2. Position-weight matrices** 

high number of false positives.


**Table 2.** Confusion matrix for classification results

Although the confusion matrix shows the whole information about the classifier's performance, it is the basis for many common metrics (Bradley, 1997; Fawcett, 2006). The often used performance measures are accuracy, sensitivity, specificity, precision and receiver operating characteristics ROC graphs. Their formulas are presented in equations 1 to 4. The accuracy measure gives an overall effectiveness of a classifier. Alternative measures are sensitivity (proportion of observed promoter sequences that are predicted as such) and specificity (probability of a classifier identifies non-promoter sequences). Additionally, the precision is related to the class agreement of identified promoters given by the classifier (Sokolova & Lapalme, 2009). A reliable performance of a promoter prediction program is the harmonic average of the sensitivity and specificity. A ROC graph is a technique for visualizing, organizing and selecting classifiers based on their performance (Figure 2). ROC graph allows visualizing and selecting classifiers based on their performance. It is presented as two-dimensional graphs in which TP rate is plotted on the Y

**Figure 2.** An example of ROC curve obtained from NN simulations results of *E. coli* promoter prediction and recognition (de Avila e Silva et al., 2011b)

axis and FP rate is plotted on the X axis. A common method associated with the ROC graph is to calculate the area under the ROC curve, abbreviated AUC. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Further information about the ROC curve can be found in Fawcett (2006).

$$Accuracy = \frac{TP + TN}{TN + TP + FN + FP} \tag{1}$$

$$Specificity = \frac{TN}{TN + FP} \tag{2}$$

$$\text{Sensitivity or recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{3}$$

$$Precision = \frac{TP}{TP + FP} \tag{4}$$

#### **3.2. Position-weight matrices**

246 Bioinformatics

i. TP: promoter sequences classified as promoter (true positive);

and predicted classes for a set of labeled examples (Bradley, 1997).

**Table 2.** Confusion matrix for classification results

iii. FP: non-promoter sequences classified as promoter (false positive); iv. FN: promoters classified as non-promoter sequences (false negative).

ii. TN: non-promoter sequences recognized as non-promoters (true negative);

This information is then normally displayed in a two-by-two confusion matrix (Table 2). A confusion matrix is a form of contingency table showing the differences between the true

> Data Class Classified as promoter Classified non-promoter Promoter True positive (TP) False negative (FN) Non-promoter False positive (FP) True negative (TN)

Although the confusion matrix shows the whole information about the classifier's performance, it is the basis for many common metrics (Bradley, 1997; Fawcett, 2006). The often used performance measures are accuracy, sensitivity, specificity, precision and receiver operating characteristics ROC graphs. Their formulas are presented in equations 1 to 4. The accuracy measure gives an overall effectiveness of a classifier. Alternative measures are sensitivity (proportion of observed promoter sequences that are predicted as such) and specificity (probability of a classifier identifies non-promoter sequences). Additionally, the precision is related to the class agreement of identified promoters given by the classifier (Sokolova & Lapalme, 2009). A reliable performance of a promoter prediction program is the harmonic average of the sensitivity and specificity. A ROC graph is a technique for visualizing, organizing and selecting classifiers based on their performance (Figure 2). ROC graph allows visualizing and selecting classifiers based on their performance. It is presented as two-dimensional graphs in which TP rate is plotted on the Y

**Figure 2.** An example of ROC curve obtained from NN simulations results of *E. coli* promoter

prediction and recognition (de Avila e Silva et al., 2011b)

Consensus sequences have been used to predict promoters by simple pattern matching. These strategies for promoter identification are usually based on a prior knowledge of some characterized sequences (Jacques et al., 2006). The first alignments of *E. coli* promoters were carried out by Hawley & McClure (1983), Galas et al. (1985), Lisser & Margalit (1993). From those compilations, the promoter consensual motifs were established.

A more sophisticated approach based on alignment is the Position-Weight Matrix (PWM). In this two-dimensional array, the rows represent one of the nucleotides A, T, C or G and the columns represent the analyzed motif. This accepted method yields results by aligning examples of referenced sequences, which allow estimating the base preference at each position of a matrix (Song et al., 2007). A weight is assigned to each base at each position in the promoter sequence and the final score of a candidate sequence decreases according to given differences of the reference matrix. Detailed information about the first implementations and the mathematical background can be found in Stormo (2000).

Huerta & Collado-Vides (2003) use a two stage PWM code-named Cover. This approach searches for conserved motifs using multiple sequence alignment methods and generates weight matrices for σ70-dependent promoter sequences. Aiming to select the best matrices, the authors added some criteria, such as the spacers between -10 and -35 hexamers, the distance from -10 region and the start codon, the distance from -10 region and the TSS, and statistical analysis and the matrix score. Despite the 86% of predictive capacity of this approach, the accuracy obtained was 53%. This value indicates that this approach presents a high number of false positives.

Li & Lin (2006) have proposed a variation from PWMs called Position Correlation Scoring Matrix (PCSM). This approach considers the position-specific weight matrices at ten specific

positions for the promoter. A PCSM for promoter and another for non-promoter training sequences sets have been computed. For classifying a new test sequence, the resulted scores from promoter and negative PCSM were used. Based on those scores, the sequence was identified as promoter only if the score was higher for positive PCSM. The results achieved in this paper present sensitivity of 91% and specificity of 81%. In order to predict promoters in the whole genome, the PCSM was applied and all the 683 experimentally identified σ70 dependent promoter sequences were successfully predicted. Besides that, 1567 predictions were considered as probable promoters.

Bacterial Promoter Features Description

and Their Application on *E. coli in silico* Prediction and Recognition Approaches 249

ML approaches usually split the data set into training and test groups. They learn from examples (training data), and the set of examples, which were not exposed to the classifier in the training process, are used to test the classification model. Among all ML techniques, Support Vector Machines (SVM) and Artificial Neural Network (ANN) applications have produced promising results in the promoter prediction problem. For this reason, the purpose of this section is to provide an explanation about the basic ideas of these two ML

SVM has been applied to identify important biological elements including protein, promoters and TSS, among others. This technique is used in bioinformatics as not only it can represent complex nonlinear functions but it also has flexibility in modeling diverse sources of data. This approach, introduced by Vapnik and his collaborators in 1992, is usually implemented as binary classifiers and it yields results by two key concepts: the separation of the data set into two classes by a hyperplane, and the application of supervised learning algorithms denoted as kernel machines (Ben-Hur et al., 2008; Kapetanovic et al., 2004; Polat & Günes, 2007). In a simple way (Figure 3), SVM classifies the data by: *(i)* drawing a straight line which separates the positive examples in one side and negative examples in the other side and, *(ii)* computing the similarity of two points with the kernel function (Ben-Hur et al., 2008). The kernel function is crucial for SVM, since the knowledge captured from the data set is obtained if a suitable kernel is defined (Ben-Hur et al., 2008). Further information and mathematical background of SVM can be found in Abe (2010), Ben-Hur et al. (2008), and

Some published paper devoted to promoter prediction using SVM. L. Gordon et al. (2003) carried out SVM with alignment kernel in two different data sets: promoters and coding regions, and promoters and non-promoter intergenic regions. The average error achieved was 16.5% and 18.6%, respectively for the data sets used. This method is preferable in cases which present a sufficient number of known promoter regions, but might not know anything about their composition (L. Gordon et al.*,* 2003). This tool is available online in http://nostradamus.cs.rhul.ac.uk/~leo/sak\_demo/. Another SVM carried out by J. J. Gordon

approaches.

Zhang (2010).

**Figure 3.** Representation of the basic idea of the SVM classification

*3.3.1. Support vector machines* 

To predict σ28 promoter-dependent sequences of ten gamma-proteobacteria species, Song et al. (2007) carried out an alternative approach based on PWM named as Position Specific Score Matrix (PSSM). The species chosen were *E. coli*, *Bacillus subtilis, Campylobacter jejuni, Helicobacter pylori, Streptomyces coelicolor, Corynebacterium glutamicum, Vibrio cholera*, *Shewanella oneidensis*, *Xanthomonas oryzae* and *Xanthomonas campestris*. This approach involved two steps: *(i)* a simple pattern-matching with the short *E. coli* σ28 promoter consensus sequence (TAAAG-N14-GCCGATAA) for predicting σ28 promoters upstream of mobility and chemotaxis genes in test species; *(ii)* these predicted promoters were used to generate a preliminary PSSM for each species. The total length of DNA analyzed for each bacteria was between 4 x 105 bp and 7 x 105 bp. The cut-off values chosen were set to control the false positive rate at 1 every 5 x 105 bp of sequence analyzed using random DNA sequence of 5 x 107 pb. Although the performance measures were not present by the authors, this paper is devoted to predict other promoter sequences than those recognized by σ70 and it shows interesting results about the σ28 consensual promoter sequences.

PWM models are commonly used because they are a simple predictive approach. Moreover, they are a convenient way to account for the fact that some positions are more conserved, than others (Stormo, 2000). However, in a large number of sequences the consensus can be insufficiently conserved, that is, they present insertions, deletions, variable spacing between elements or they are difficult to define. In such cases, this approach yield many false predictions (Kalate *et al.*, 2003). Another limitation is the assumption that the occurrence of a given nucleotide at a position is independent of the occurrence of nucleotides at other positions (Stormo, 2000). Additionally, the use of this approach is highly influenced by the cut-off value chosen, since low cut-off values encourage a high false positive rate, while high cut-off values encourage a high false-negative rate (Song et al., 2007).

## **3.3. Machine Learning**

Machine Learning (ML) concerns the development of computer algorithms which allow the machine to learn from examples. The classification (or pattern recognition) is an important application of ML techniques in bioinformatics due to their capability of capturing hidden knowledge from data. This is possible to achieve even if the underlying relationships are unknown or hard to describe. Additionally, they can recognize complex patterns in an automatic way or distinguish exemplars based on these patterns (Cen *et al.*, 2010; Sivarao *et al.*, 2010).

ML approaches usually split the data set into training and test groups. They learn from examples (training data), and the set of examples, which were not exposed to the classifier in the training process, are used to test the classification model. Among all ML techniques, Support Vector Machines (SVM) and Artificial Neural Network (ANN) applications have produced promising results in the promoter prediction problem. For this reason, the purpose of this section is to provide an explanation about the basic ideas of these two ML approaches.
