**4. Matching-based clustering algorithm**

Similar to any clustering algorithm the main objective of matching-based clustering is partitioning the data into relatively similar groups. The algorithm is defined for categorical data only. However, one can modify it to work with other types also, but it is out of the scope of this paper. The main idea is, while there are still objects without clusters, the algorithm will choose features to drop based on their importance. Then it will update the similarity matrix and try to cluster the objects based on the new SM. It uses the similarity matrix where the similarity measure between two objects is defined by formula (10). We also use either PGPI or PPPI measure to choose the features to drop on each iteration. For the sake of notations, we define **θp** as the count of the remaining features on iteration **p**. The initial value of **θ<sup>0</sup> = m**. We consider two objects to belong to the same cluster if *mi*,*<sup>j</sup>* ¼ *ϴ***p**. In other words, they are grouped if their categories coincide for all the remaining features on iteration **p**.

The algorithm consists of the following steps:


The algorithm stops if all the objects are clustered or the importance of remaining features is the same. To illustrate how the algorithm works, we will apply it to the dataset below:

<sup>7.</sup>Return to step 3.

*Perspective Chapter: Matching-Based Clustering Algorithm for Categorical Data DOI: http://dx.doi.org/10.5772/intechopen.109548*


In this dataset 10 objects are defined by 4 categorical features *A*, *B*,*C*, *D*, and *E* with ½ � *a*1, *a*<sup>2</sup> , ½ � *b*1, *b*<sup>2</sup> , ½ � *c*1,*c*2,*c*3,*c*<sup>4</sup> , ½ � *d*1, *d*2, *d*3, *d*<sup>4</sup> , and ½ � *e*1,*e*2,*e*3,*e*<sup>4</sup> unique categories, respectively. Thus, we initialize the algorithm by constructing the similarity matrix:

$$S\_{10,10} = \begin{pmatrix} - & 5 & 4 & 4 & 0 & 0 & 1 & 0 & 0 & 1 \\ - & - & 4 & 4 & 0 & 0 & 1 & 0 & 0 & 1 \\ - & - & 4 & 0 & 0 & 0 & 1 & 0 & 0 \\ - & - & - & - & 0 & 1 & 0 & 0 & 0 & 0 \\ - & - & - & - & 2 & 4 & 2 & 3 & 4 \\ - & - & - & - & - & 2 & 4 & 2 & 2 \\ - & - & - & - & - & - & 2 & 2 & 5 \\ - & - & - & - & - & - & 2 & 2 \\ - & - & - & - & - & - & - & - \\ \end{pmatrix} \tag{13}$$

Then, the importance of each feature is assessed. In this example, we will use the PGPI measure. For instance, the PGPIA will be:

$$PGPI\_A = \frac{21}{21 + 21 + 10 + 10 + 9} = 0.296\tag{14}$$

respectively *PGPIB* ¼ 0*:*296, *PGPIC* ¼ 0*:*14, *PGPID* ¼ 0*:*14, and *PGPIE* ¼ 0*:*127. Then as the *θ*<sup>0</sup> ¼ *m* ¼ 5, all the objects *i*, *j* with *mi*,*<sup>j</sup>* ¼ 5 are grouped. As we can see we have two clusters ½ � *O*1, *O*<sup>2</sup> and ½ � *O*7, *O*<sup>10</sup> , respectively. As we still have some objects left without cluster allocation, we continue to next step. In particular, as the feature *E* has the lowest *PGPI*, we drop it and the similarity matrix is updated. Also to avoid the merging of already existing clusters, we additionally update SM according to step 6. As similarity between clusters ½ � *O*1, *O*<sup>2</sup> and ½ � *O*7, *O*<sup>10</sup> is not equal to *θ*<sup>1</sup> ¼ 4, we will not make additional changes, and the new data view and corresponding similarity matrix will be:


As *θ*<sup>1</sup> ¼ 4, we will have ½ � *O*1, *O*2, *O*3, *O*<sup>4</sup> , ½ � *O*5, *O*7, *O*<sup>10</sup> , and ½ � *O*6, *O*<sup>8</sup> clusters. However, we still have one more object to assign to a cluster, thus we drop *C* and *D:* We update the similarity matrix:


But we also check the statement in step 6 between any pair of the existing clusters. As *θ*<sup>2</sup> ¼ 2, the statement is true in the case of ½ � *O*5, *O*7, *O*<sup>10</sup> and ½ � *O*6, *O*<sup>8</sup> *:* Thus, the values corresponding rows and columns, which are equal *θ*<sup>2</sup> ¼ 2, are set to zero. The purpose of this modification is that as we are dropping features with the low grouping power, the cluster is more likely to merge. Therefore, we may lose important local partitioning of data points. Thus the finally updated similarity matrix will be:

*Perspective Chapter: Matching-Based Clustering Algorithm for Categorical Data DOI: http://dx.doi.org/10.5772/intechopen.109548*


**Table 1.**

*Final form of the clustering.*

$$
\begin{pmatrix}
\end{pmatrix}
\tag{17}
$$

However, the second iteration does not group the *O*9. At the same time as the importance of the remaining features *A* and *B* is the same the algorithm terminates and the object *O*<sup>9</sup> forms the fourth cluster. Thus, the final form of the clustering is presented in **Table 1**.

The algorithm has some unique characteristics worth mentioning. First, to achieve better performance one can notice that all the changes required in each step should be done only on the similarity matrix and there is no need to update the dataset. Second, there is no need for user-defined parameters. However, one may consider one, for instance, the required number of clusters to be created. Third, even though we introduced step 6 to avoid the merging of clusters to achieve higher accuracy, one can avoid this step. In this case, the algorithm will create a tree were each leaf is a possible cluster, like in the hierarchical cluster. Furthermore, based on the user-defined parameter the algorithm can create the required number of clusters if required. For instance, in case of our example above the dendrogram will be (**Figure 1**).

Forth, as the algorithm is based on either feature grouping or partitioning power, this information can be used to understand the data better. For instance, this algorithm can serve as a subroutine for selecting features for other clustering algorithms. The main disadvantage is that may create too many clusters.
