**4. Implementation**

6 Will-be-set-by-IN-TECH

fuzzy t-norms algorithm, and finally, the fuzzy K-means algorithm is applied to the rows of the similarity matrix to obtain the consensus clustering. It is demonstrated that the proposed

High throughput data may be generated by microarray experiments. If the dataset is very large, it is possible to generate an ensemble of clustering solutions, or partition the data so that clustering may be performed on tractable-sized disjoint subsets [20]. The data can then be distributed at different sites, for which a distributed clustering solution with a final merging of partitions is a natural fit. [20] introduce two new approaches to combining partitions represented by sets of cluster centers. It is stated that these approaches provide a final partition of data that is comparable to the best existing approaches and that the approaches can be 100,000 times faster while using much less memory. The new algorithms are compared with the best existing cluster ensemble approaches that cluster all of the data at once, and a clustering algorithm designed for very large datasets. Fuzzy and hard K-means based clustering algorithms are used for the comparison. It is demonstrated that the centroid-based ensemble merging algorithms presented in the study generated partitions which are as good as the best label vector method, or the method of clustering all the data at once. The proposed

[11] propose evidence accumulation clustering based on dual rooted prim tree cuts (EAC-DC). The proposed algorithm computes the co-association matrix based on a forward algorithm that repeatedly adds edges to Prim's minimum spanning tree (MST) to identify clusters until a satisfying criterion is met. A consensus cluster is then generated from the co-association matrix using spectral partitioning. Here, a MST is a fully connected sub-graph with no cycles and a dual-rooted tree is obtained by finding the union of two sub-trees. They test their approach using the Iris dataset [8], the Wisconsin breast cancer dataset [27] (both obtained from [9]) and synthetic datasets, and presented a comparison of their results with other

[22] use a cluster ensemble in gene expression analysis. In the proposed ensemble framework, the partitions generated by each individual clustering algorithm are converted into a distance matrix. The distance matrices are then combined to construct a weighted graph. A graph partitioning approach is then used to generate the final set of clusters. It is reported that the ensemble approach yields better results than the best individual approach on both synthetic

[10] merge multiple partitions using evidence accumulation. Each partition generated by a clustering algorithm is used as a new piece of knowledge, to help uncover the relationships between objects. For this chapter, we adopt their ensemble approach. The core idea behind the ensemble approach here is constructing the co-association matrix by employing a voting mechanism for the partitions generated using individual clustering algorithms. A co-association matrix *C* is constructed based upon the formulation below, where *nij* is the number of times the object pair (*i*, *j*) is assigned to the same cluster among the *N* different

*C*(*i*, *j*) =

After constructing the co-association matrix, [10] use single linkage hierarchical clustering to obtain the new cluster tree (dendrogram) and then use a cut-off value corresponding to the maximum life time (difference between merge points where branching starts) on the tree.

*nij N*

ensemble approach is competitive with the other ensemble methods.

algorithms are also more efficient in terms of speed.

existing ensemble clustering methods.

and yeast gene expression datasets.

partitions:

We employ the ensemble approach described in [10]. Different set of base clustering algorithms are chosen and implemented on protein and lymphoma datasets.

Protein dataset consists of 698 objects (corresponding to protein folds) with 125 attributes. The protein dataset contains 698 proteins from 125 samples. The real clusters correspond to the four classes of protein−folds: *α*, *β*, *α*/*β* and *α*+*β* protein classes. DLBCL−B is 2−channel custom cDNA microarray dataset. This is a B cell lymphoma dataset with predefined three subtypes [21].

The ensemble clustering algorithm uses an array of vectors data structure for each of the file, in order to use the dynamic memory allocation and starts with initializing the file content in the vectors. The algorithm also processes the vectors and generates two temporary matrices with the dimension of maximum vector length. The ensemble clustering algorithm steps are as follows:

```
Algorithm 1 Ensemble Clustering Algorithm
```

```
Require: partitions
Ensure: distance matrix
```

```
for i = 0 to max(V[n]) do
  for j = 0 to max(V[n]) do
     for k = 0 to n do
       if V[k].elementAt(i) = V[k].elementAt(j)) then
          C[i][j] = C[i][j] + 1/n
       end if
        D[i][j] = 1 − C[i][j]
     end for
  end for
end for
```
Here, *n* is the number of files, *V*[*n*] are the vectors holding the content of each file. *max*(*V*[*n*]) is the length of the longest vector, *C*[*i*][*j*] is the co-association matrix and *D*[*i*][*j*] is the distance matrix. The algorithm iterates through the two dimensional matrix via *i* and *j* loop variables inside a nested loop at lines 1 and 2 and for each member of the matrix, all the vectors are processed inside the loop via *k* loop variable at line 3. The condition of equality for the selected vector with the selected loop variables *i* and *j*, causes an increase on the co-association matrix elements at lines 4 and 5. Finally the distance matrix is calculated at line 7. After obtaining the distance matrix, hierarchial clustering with complete linkage is used to generate the dengrogram. The dendrogram is cut at a certain level to obtain consensus partition.

Ensemble approach is coded as a java application which is available upon request. The software allows addition of many partitions to generate the distance matrix of the corresponding ensemble. Files including the partitions can be added by clicking on the "Add File" button as seen in Figure 6. Distance matrix of the ensemble is generated by "Calculate" button.

corresponding partition is (1, 1, 1, 2, 3, 3) which is the same as second partition (2, 2, 2, 1,

Ensemble Clustering for Biological Datasets 295

We employ hierarchical clustering, K-means and C-means to obtain base partitions. K-means and hierarchical clustering algorithm are implemented using R base package, C-means is implemented using R e1071 package. Silhouette and C-rand indices are utilized to evaluate the performance of individual and ensemble algorithms. Silhouette and C-rand values are calculated using R clusterSim and flexclust packages respectively. Silhouette is an internal measure of compactness and separation of clusters [6]. The silhouette index values are between -1 and 1 representing worst and best values. C-rand is an external measure of agreement between two partitions. C-rand has maximum value of 1 and it can take negative values. The silhouette and C-rand values found by the base and ensemble algorithms are given in Table 1. Ensemble approach improves clustering result both for the protein and DLBCL-B datasets. Ensemble approach finds better C-rand value, 0.157 than values by K-means and C-means, 0.127 for the protein dataset. Ensemble approach also finds the best C-rand value, 0.135 compared to values generated by individual clustering algorithms, 0.021,

0.063, 0.098. However, the ensemble approach makes S values worse in most cases.

Protein

DLBCL-B

**Table 1.** Index values for base and ensemble algorithms

Dataset Method Num. of clusters S value C value

HC 4 0.344 0.199 K-means 4 0.379 0.127 C-means 4 0.379 0.127 Ensemble 4 0.078 0.157

HC 3 -0.034 0.021 K-means 3 -0.015 0.063 C-means 2 -0.005 0.098 Ensemble 3 -0.017 0.135

3, 3).

**Figure 8.** Example clusters


**Figure 6.** File input interface

The output is displayed on a separate screen as demonstrated on Figure 7. The output with csv format can be written into a file by clicking on the "Output CSV" button.


Considering two different partitions of a dataset with six objects which are (1, 1, 2, 1, 3, 3) and (2, 2, 2, 1, 3, 3), the algorithm's output is the distance matrix:

```
⎛
⎜⎜⎜⎜⎜⎜⎝
   0 0 0.5 0.5 1 1
   0 0 0.5 0.5 1 1
  0.5 0.5 0 1 1 1
  0.5 0.5 1 0 1 1
   1 1 1 1 00
   1 1 1 1 00
                   ⎞
                   ⎟⎟⎟⎟⎟⎟⎠
```
The distance matrix is used in hierarchical clustering with complete linkage and the following dendrogram is generated. The dendrogram is cut at a level to give three clusters. The corresponding partition is (1, 1, 1, 2, 3, 3) which is the same as second partition (2, 2, 2, 1, 3, 3).

**Figure 8.** Example clusters

8 Will-be-set-by-IN-TECH

Ensemble approach is coded as a java application which is available upon request. The software allows addition of many partitions to generate the distance matrix of the corresponding ensemble. Files including the partitions can be added by clicking on the "Add File" button as seen in Figure 6. Distance matrix of the ensemble is generated by "Calculate"

The output is displayed on a separate screen as demonstrated on Figure 7. The output with

Considering two different partitions of a dataset with six objects which are (1, 1, 2, 1, 3, 3) and

0 0 0.5 0.5 1 1 0 0 0.5 0.5 1 1 0.5 0.5 0 1 1 1 0.5 0.5 1 0 1 1 1 1 1 1 00 1 1 1 1 00

The distance matrix is used in hierarchical clustering with complete linkage and the following dendrogram is generated. The dendrogram is cut at a level to give three clusters. The

⎞

⎟⎟⎟⎟⎟⎟⎠

csv format can be written into a file by clicking on the "Output CSV" button.

(2, 2, 2, 1, 3, 3), the algorithm's output is the distance matrix:

⎛

⎜⎜⎜⎜⎜⎜⎝

button.

**Figure 6.** File input interface

**Figure 7.** Example clusters

We employ hierarchical clustering, K-means and C-means to obtain base partitions. K-means and hierarchical clustering algorithm are implemented using R base package, C-means is implemented using R e1071 package. Silhouette and C-rand indices are utilized to evaluate the performance of individual and ensemble algorithms. Silhouette and C-rand values are calculated using R clusterSim and flexclust packages respectively. Silhouette is an internal measure of compactness and separation of clusters [6]. The silhouette index values are between -1 and 1 representing worst and best values. C-rand is an external measure of agreement between two partitions. C-rand has maximum value of 1 and it can take negative values. The silhouette and C-rand values found by the base and ensemble algorithms are given in Table 1. Ensemble approach improves clustering result both for the protein and DLBCL-B datasets. Ensemble approach finds better C-rand value, 0.157 than values by K-means and C-means, 0.127 for the protein dataset. Ensemble approach also finds the best C-rand value, 0.135 compared to values generated by individual clustering algorithms, 0.021, 0.063, 0.098. However, the ensemble approach makes S values worse in most cases.


**Table 1.** Index values for base and ensemble algorithms
