**6. References**


10 Will-be-set-by-IN-TECH

Clustering groups of objects such that similar ones are placed in the same cluster, and in its application to biological datasets are very important in that it can help identification of natural groups of biological entities that might give insight about biomarkers. In this chapter, we review some clustering algorithms applied to biological data. Ensemble clustering approaches for biological data are also reviewed. Implementation of K-means, C-means and HC algorithms and merging of the algorithms using an ensemble frame work are presented using two different datasets. The datasets are protein and DLBCL-B. Two different cluster validation indices, adjusted rand and silhouette, are used for comparing the partitions from individual algorithms and ensemble clustering. Investigating Table 1, we conclude that merging individual partitions improves C-rand values meaning that ensemble approach finds partitions similar to the real partitions. Ensemble approach is coded as a Java application and

Authors thank Dilip Gautam for his contribution to this chapter. The work of ¸Sadi Evren Seker was supported by Scientific Research Projects Coordination Unit of Istanbul University, ¸

[1] Arora, S., Rao, S. & Vazirani, U. [2009]. Expander flows, geometric embeddings and

[2] Asur, S., Parthasarathy, S. & Ucar, D. [2006]. An Ensemble Approach for Clustering Scale-Free Graphs, *KDD-2006 Workshop on Link Analysis, 12th ACM SIGKDD International*

[3] Asur, S., Ucar, D. & Parthasarathy, S. [2007]. An ensemble framework for clustering

[4] Asyali, M. H., Colak, D., Demirkaya, O. & Inan, M. S. [2006]. Gene expression profile

[5] Avogadri, R. & Valentini, G. [2009]. Fuzzy ensemble clustering based on random projections for dna microarray data analysis, *Artificial Intelligence in Medicine*

[6] Bandyopadhyay, S., Mukhopadhyay, A. & Maulik, U. [2007]. An improved algorithm for

protein–protein interaction networks, *Bioinformatics* 23(13): 29–40.

**5. Conclusion**

available upon request.

**Acknowledgement**

**Author details**

Sadi Evren ¸ ¸ Seker

**6. References**

*Istanbul University, Turkey*

Harun Pirim

project number YADOP-16728.

*King Fahd University of Petroleum and Minerals*

graph partitioning, *Journal of the ACM* 56(2): 1–37. URL: *http://doi.acm.org/10.1145/1502793.1502794*

*Conference on Knowledge Discovery and Data Mining*.

URL: *http://dx.doi.org/10.1093/bioinformatics/btm212*

classification: A review, *Current Bioinformatics* pp. 55–73. URL: *http://dx.doi.org/10.2174/157489306775330615*

45(2-3): 173–183. URL: *http://dx.doi.org/10.1016/j.artmed.2008.07.014*

clustering gene expression data, *Bioinformatics* 23(21): 2859–2865.

	- URL: *http://dx.doi.org/10.1016/j.patcog.2008.09.027*

URL: *http://dx.plos.org/10.1371*


networks: clustering expression data based on gene neighborhoods, *BMC Bioinformatics* 8(250): 1–13.


© 2012 Wang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Wang et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Research on Pattern Matching with** 

**Wildcards and Length Constraints:** 

The practical importance of the string matching problem should be obvious to everyone. For typical word-processing applications, immense amounts of work have been done on this subject. However, with the developments in bioinformatics (Cole et al., 2005), information retrieval (Califf et al., 2003), pattern mining (Xie et al., 2010; Ji et al., 2007; He et al., 2007), etc, sequential Pattern Matching with Wildcards and Length constraints (PMWL) has attracted more and more attention. It is not difficult to think up realistic cases where PMWL plays an important role. In Dan Gusfield's book (Gusfield, 1997), they give an example about *transcription factor* to illustrate the concept of wildcard. A *transcription factor* is a protein that binds to specific locations in DNA and regulates the transcription of the DNA into RNA. In this way, production of the protein that the DNA codes for is regulated. Many transcription factors are found and can be separated into families characterized by specific substrings containing wildcards. They use *Zinc Finger*, a common transcription factor as an example. It

CYS¢¢CYS¢¢¢¢¢¢¢¢¢¢¢¢¢HIS¢¢HIS Where CYS is the amino acid cysteine and HIS is the amino acid histidine. They also give a conclusion that if the number of wildcards is bounded by a fixed constant, the problem can

Another respective example is about *promoter*. In bioinformatics, *promoter* will help researchers to quickly locate the starting position of the intron from hundreds of millions of the sequence of *ACGT*. Among these promoters, *TATA* box is a common one (Manber & Baeza-Yates, 1991). It has very loose sequence specificity, so many *TATA* sequences are not

**Methods and Completeness** 

Haiping Wang, Taining Xiang and Xuegang Hu

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48574

has the following signature:

be solved in linear time.

**1. Introduction** 
