**Ensemble Clustering for Biological Datasets**

Harun Pirim and ¸Sadi Evren ¸Seker

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/49956

**1. Introduction**

286 Bioinformatics

NONCODE, http://www.bioinfo.org.cn/NONCODE/; NCBI Reference Sequence,

OncoDb HCC, http://oncodb.hcc.ibms.sinica.edu.tw/index.ht

ftp://ftp.ncbi.nih.gov/refseq/; UCSC Genome Bioinformatics Site, http://genome.ucsc.edu;

Recent technologies and tools generated excessive data in bioinformatics domain. For example, microarrays measure expression levels of ten thousands of genes simultaneously in a single chip. Measurements involve relative expression values of each gene through an image processing task.

Biological data requires both low and high level analysis to reveal significant information that will shed light into biological facts such as disease prediction, annotation of a gene function and guide new experiments. In that sense, researchers are seeking for the effect of a treatment or time course change befalling. For example, they may design a microarray experiment treating a biological organism with a chemical substance and observe gene expression values comparing with expression value before treatment. This treatment or change make researchers focus on groups of genes, other biological molecules that have significant relationships with each other under similar conditions. For instance, gene class labels are usually unknown, since there is a little information available about the data. Hence, data analysis using an unsupervised learning technique is required. Clustering is an unsupervised learning technique used in diverse domains including bioinformatics. Clustering assigns objects into the same cluster, based on a cluster definition. A cluster definition or criterion is the similarity between the objects. The idea is that one needs to find the most important cliques among many from the data. Therefore, clustering is widely used to obtain biologically meaningful partitions. However, there is no best clustering approach for the problem on hand and clustering algorithms are biased towards certain criteria. In other words, a particular clustering approach has its own objective and assumptions about the data.

Diversity of clustering algorithms can benefit from merging partitions generated individually. Ensemble clustering provides a framework to merge individual partitions from different clustering algorithms. Ensemble clustering may generate more accurate clusters than individual clustering approaches. Here, an ensemble clustering framework is implemented as described in [10] to aggregate results from K-means, hiearchical clustering and C-means algorithms. We employ C-means instead of spectral clustering in [10]. We also use different

©2012 Pirim and ¸Seker, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012 Pirim and ¸Seker, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

data sets. Two different biological datasets are used for each algorithm. A comparison of the results is presented. In order to evaluate the performance of the ensemble clustering approach, one internal and one external cluster validation indices are used. Silhouette (S) [31] is the internal validation index and C-rand [23] is the external one. The chapter reviews some clustering algorithms, ensemble clustering methods, includes implementation, and conclusion sections.

Another method, Self-organizing Map (SOM), is one of the machine-learning techniques widely used in gene clustering. A recent study is [14]. SOM requires a grid structured input

Ensemble Clustering for Biological Datasets 289

(a) Initial clusters (b) Final clusters

Hierarchical clustering (HC) algorithms are also widely used and area of two types: agglomerative and divisive. In agglomerative approach objects are all in different clusters and they are merged till they are all in the same cluster as seen in Figure 4. Two important drawbacks of the HC algorithms are that they are not robust and they have high computational complexity. HC algorithms are "greedy" which often means that the final solution is suboptimal due to locally optimal choices being made in initial steps, which turn

Graph-theoretical clustering techniques exist in which the genomic data are represented by nodes and edges of a graph. Network methods have been applied to identify and characterize various biological interactions [13]. Identification of clusters using networks is

out to be poor choices with respect to the global solution. A recent study is [26].

that makes it ineffective.

**Figure 2.** Initial clusters

**Figure 3.** Iteration of K-means
