**4. Conclusions**

254 Bioinformatics

as BacPP, was developed by weighting rules extracted from ANNs trained with promoter sequences known to respond to a specific σ factor. The information obtained from the rules was weighted to optimize promoter prediction and classification of the sequences according to σ factor which recognize them. The accuracy results for *E. coli* were 86.9%, 92.8%, 91.5%, 89.3%, 97.0% and 83.6% for σ24, σ28, σ32, σ38, σ54 and σ70 dependent promoter sequences, respectively. As related by the authors, the sensitivity and specificity results showed similar values, indicating that this tool present a reduction of false positive rate. In contrast to tools previously reported in the literature, BacPP is not only able to identify bacterial promoters in background genome sequence, but it is also designed to provide pragmatic classification according to σ factor. By separating the promoter sequences according their σ factor which recognize them, the authors have demonstrated that the current boundaries of prediction and classification of promoters can be dissolved. Moreover, when applied to a set of promoters from diverse *enterobacteria*, the accuracy of BacPP was 76%, indicating that this

In spite of the ANN capability capture imprecise and incomplete patterns, such as individual promoter motifs including mismatches (Cotik et al., 2005), this ML approach can present some intrinsic difficulties. Many decisions related to the choice of ANN structure and parameters are often completely subjective. The final ANN solution may be influenced by a number of factors (e.g., starting weights, number of cases, number of training cycles, etc.). Besides, the overtraining needs to be avoided, since it results in ANN which memorizes the data, instead of to do a generalization of them (Kapetanovic et al., 2004).

The symbolic representation of DNA nucleotides given by the letters A,T,G,C lead to many studies which aiming at understanding its structure through distributions, complexities, redundancy and statistical regularities (Krishnamachari et al., 2004). All this kind of information have a theoretical potential to be a distinguish feature of promoter sequences. Some papers are devoted to applied this features either alone or in combination with other

Kanhere and Bansal (2005b) developed their own promoter recognition approach based on differences of DNA stability between promoter and coding regions. That tool was improved by Rangannan e Bansal (2007) and achieves sensitivity of 98% and a just precision of 55%. The authors claim that this stability-based approach can be used to annotate entire genome sequences for promoter regions. According to the authors, the low precision can be reduced if it was combined with other sequence based methods. Additionally, they argue that this method can be used to investigate characteristic properties of specific subclasses of promoters, as well as other functional elements which no exhibit obvious consensus

Jacques et al. (2006) describe a novel approach based on matrices representing the genomic distribution of hexanucleotides pairs. The principal strategy was based on the observation that the promoters are over-represented in intergenic regions relative to the whole genome.

tool can be reliably extended beyond the E. coli model.

approaches for improve promoter prediction results.

**3.4. Other approaches** 

sequences.

A brief survey of currently *E. coli* promoter information and their recognition and prediction approaches was presented. In order to improve the *in silico* promoter prediction, an appreciation of the biological mechanistic of promoter sequences is necessary. In this way, the comprehensive analyses of bacterial promoter sequences revealed the fact that the sequence dependent properties are important and can be exploited in developing *in silico*  tool for promoter prediction.

The currently available approaches described in this paper make efforts to reduce the number of false predictions. Recent bioinformatics applications are increasingly appreciating the DNA structural features and incorporating this kind of information for detecting promoter tools. Some works shows the advantage of the use of the feature selection or extraction process as an important part of pattern recognition, since this procedure can decrease the computation cost and increase the performance of the classification (Polat and Günes, 2009). One of the goals of promoter recognition is to locate promoter regions in the genome sequence. Predicting promoters on a genome-wide scale is

problematic due to the higher number of false positive predictions caused by the large amount of DNA analyzed. It is important for consideration the fact that a given classification method is not universally better than other, since each method has a class of target functions for which it is best suited (Bradley, 1997).

Bacterial Promoter Features Description

and Their Application on *E. coli in silico* Prediction and Recognition Approaches 257

Cotik, V.; Zaliz, R. R.; Zwir, I. (2005). A hybrid promoter analysis methodology for prokaryotic genomes. *Fuzzy Sets and Systems*, Vol.152, No.1, (May, 2005), pp.83-102. Cowing, D.W.; Bardwell, J.C.A.; Craig, E.A.; Woolford, C.; Hendrix, R.W.; Gross, C. (1985) Consensus sequence for *Escherichia coli* heat-shock gene promoters. *Proc. Natl. Acad. Sci,*

de Avila e Silva, S.; Echeverrigaray, S.; Gerhardt, G. J. L. (2011a). BacPP: Bacterial promoter prediction—A tool for accurate sigma-factor specific assignment in enterobacteria.

de Avila e Silva, S.; Gerhardt, G. J. L.; Echeverrigaray, S. (2011b). Rules extraction from neural networks applied to the prediction and recognition of prokaryotic promoters.

Demeler, B.; Zhou, G. (1991). Neural network optimization for *E. coli* promoter prediction.

Fauteux, F.; Blanchette, M.; Strömvik, M. V. (2008). Seeder: discriminative seeding DNA

Fawcett, T. (2006). An introduction to ROC analysis. *Pattern Recognition Letters*, Vol.27, pp.

Gabrielian, A.; Bolshoy, A. (1999). Sequence complexity and DNA curvature. *Computers and* 

Galas, D. J.; Eggert, M.; Waterman, M. S. (1985). Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from *Escherichia coli*. *Journal of* 

Gordon, J. J.; Towsey, M. W.; Hogan, J. M.; Mathews, S. A.; Timms, P. (2006). Improved prediction of bacterial transcription start sites. *Bioinformatics*, Vol. 22, No.2, pp. 142-148. Gordon, L.; Chervonenkis, A.; Gammerman, A. J.; Shahmuradov, I. A.; Solovyev, V. V. (2003). Sequence alignment for recognition of promoter regions. *Bioinformatics*, Vol.19,

Hawley, D. K.; McClure, W. R. (1983). Compilation and analysis of Escherichia coli promoter DNA sequences. *Nucleic Acids Research*, Vol.11, No.8, (April, 1983), pp. 2237–2255. Helmann, J.D.; Chamberlin, M.J. (1987). DNA sequence analysis suggests that expression of flagellar and chemotaxis genes in *Escherichia coli and Salmonella typhimurium* is controlled by an alternative sigma factor. *Proc. Natl. Acad. Sci. USA,* Vol. 84, pp. 6422–

Hilal, N.; Ogunbiyi, O. O.; Al-Abri, M. (2008). Neural network modeling for separation of

Hook-Barnard, I., Johnson, X.B., Hinton, D.M. *Escherichia coli* RNA polymerase enzyme of σ70- dependent promoter requiring a -35 DNA element and an extended -10 TGn motif,

Holloway, D.T.; Kon, M.; DeLisi, C. (2007). Machine learning for regulatory analysis and transcription factor target prediction in yeast. *Systems and Synthetic Biology,* Vol. 1, No.

bentonite in tubular ceramic membranes. *Desalination,* Vol. 228, pp. 175-182.

Journal of Bacteriology 188 (2006) 8352-8359.

Vol.80, pp. 2679–2683.

861–874.

*Chemistry*, Vol.23, pp. 263–274.

No.15, pp. 1964-1971.

6424.

1, pp. 25–47.

*Journal of Theoretical Biology*, Vol.287, pp. 92-99.

*Nucleic Acids Research*, Vol.19, No.,pp.1593-1599.

*Genetics and Molecular Biology*, Vol.34, No.2, pp. 353-360.

motif discovery. *Bioinformatics*, Vol. 24, No. 20, pp 2303–2307.

*Molecular Biology*, Vol.186, No.1, (November, 1985), pp. 117–128.
