**3. Categorical data overview**

Data comes in various forms such as numeric, categorical, mixture, spatial, and so on. The analysis of each type of data possesses unique challenges. The categorical data is not an exception. This type of data is widely used in political, social, and biomedical science. For instance, the measures of attitudes and opinions can be assessed with categorical data. The performance of medical treatments can also be categorical. Even though the mentioned areas of science have the largest influence on the development of the methods for categorical data, this type of data commonly occurs in other areas of science such as marketing, behavior science, education, psychology, public health, engineering, and so on.

Generally speaking, categorical data is the data where objects are defined by categorical features. Categorical features can have two types of scales: nominal and ordinal. In the case of nominal scale, they have unordered categories. In contrast, ordinal scale possesses ordered categories, but the interval between categories is unknown. In this paper, we focus only on categorical features with nominal scales.

For sake of notation, consider a multidimensional dataset **D** containing **n** objects, such that each object is described by **m** categorical features each with **k** = 1, 2, 3, … unique categories. Thus, the dataset **D** can be viewed as a matrix below:

$$D\_{n,m} = \begin{pmatrix} a\_{1,1} & a\_{1,2} & \cdots & a\_{1,m} \\ a\_{2,1} & a\_{2,2} & \cdots & a\_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ a\_{n,1} & a\_{n,2} & \cdots & a\_{n,m} \end{pmatrix} \tag{1}$$

where each object is described by a *a* set of categories *Oi* ¼ ½ � *ai*,**<sup>1</sup>**, *ai*,**<sup>2</sup>**, *ai*,**<sup>3</sup>⋯***ai*,*<sup>m</sup>* . As the categorical attributes have discrete values with no order values, the application of distance measures such as *lpnorm* will produce inaccurate results. However, the most common approach to overcome this limitation is the implementation of data transformation techniques. For instance, one can use binarization to transform the data into binary data and then apply the distance measure. On the other hand, the traditional way of comparing two objects with categorical features is to simply check if the categories coincide. If the categories of all the features under consideration match, the objects can be viewed as similar. This does not mean they are the same, because they can be distinguished by other features. Thus, researchers have proposed various similarity measures instead of requiring all the features to match. The most popular approach is the overlap. According to it, the similarity *Perspective Chapter: Matching-Based Clustering Algorithm for Categorical Data DOI: http://dx.doi.org/10.5772/intechopen.109548*

between two objects *Ox* ¼ ½ � *ax*,**<sup>1</sup>**, *ax*,**<sup>2</sup>**, *ax*,**<sup>3</sup>⋯***ax*,*<sup>m</sup>* and *Oy* ¼ *ay*,**<sup>1</sup>**, *ay*,**<sup>2</sup>**, *ay*,**<sup>3</sup>⋯***ay*,*<sup>m</sup>* � � is assessed by:

$$\begin{aligned} \text{Ov}\left(\text{O}\_{\text{x}}, \text{O}\_{\text{y}}\right) &= \frac{1}{m} \sum\_{i=1}^{m} \text{\textdegree}\_{i} \\ \text{where } \text{\textdegree}\_{i} &= \begin{cases} 1 & \text{if } a\_{\text{x},i} = a\_{\text{y},i} \\ 0 & \text{otherwise} \end{cases} \end{aligned} \tag{2}$$

It can take values from [0, 1]. The closer value gets to one, the higher the similarity between the objects.

While implementing overlap, one can notice that the probability of finding another object with the same categories rapidly decreases as the number of features and the number of unique categories of each feature increases. To illustrate this, one can calculate the probability of finding another object with the same categories as object **Ox** using the formula below:

$$P = \prod\_{i=1}^{m} \frac{f(a\_{\mathbf{x},i}) - \mathbf{1}}{k\_m(n-1)} \tag{3}$$

where *f a*ð Þ *<sup>x</sup>*,*<sup>i</sup>* is the frequency of category *ax*,*<sup>i</sup>* in the dataset.

If we consider the number of objects constant, the probability of finding another similar object depends on the number of features and the number of unique categories of each. It can be seen from the formula above that as the number of attributes or the number of categories increases the probability of finding another object rapidly decreases. The problem is that the overlap measure gives equal weights to the features and does not take into account the importance of each feature in partitioning the data. However, the researchers have proposed more efficient ways of assessing similarity, which takes into account the frequency of each category in the dataset. There are various types of similarity measures that are based on this concept. For instance, Goodall [33], Lin [34], and so on:

$$\begin{aligned} \text{Goodall}(O\_x, O\_y) &= \frac{1}{m} \sum\_{i=1}^{m} S\_i \\ \text{where} \quad S\_i &= \begin{cases} 1 - \frac{f(a\_{x,i})(f(a\_{x,i}) - 1)}{n(n-1)} & \text{if } a\_{x,i} = a\_{y,i} \\ 0 & \text{otherwise} \end{cases} \end{aligned} \tag{4}$$

$$\text{Lin}(O\_x, O\_y) = \frac{1}{m} \sum\_{i=1}^{m} S\_i$$

$$\text{(where)} \quad S\_i = \begin{cases} 2 \log f(a\_{x,i}) & \text{if } a\_{x,i} = a\_{y,i} \\ 2 \left( \log \left( \frac{f(a\_{x,i})}{n} + \frac{f(a\_{y,i})}{n} \right) \right) & \text{otherwise} \end{cases} \tag{5}$$

Nevertheless, there are still cases when the use of similarity measures can be misleading. For instance, consider the dataset below with four objects and two categorical attributes with ½ � *c***1**, *c***<sup>2</sup>** , ½ � *b***1**, *b***<sup>2</sup>** categories, respectively:

$$D\_{4,2} = \begin{pmatrix} c\_1 & b\_1 \\ c\_2 & b\_2 \\ c\_1 & b\_2 \\ c\_2 & b\_1 \end{pmatrix} \tag{6}$$

According to the measures presented above, the similarity between each unique pair of these objects will be:


As one can see, these measures can be misleading. For instance, one can group **O3, O4** to either **O1** or **O2** as the similarity measures are the same. Therefore, similarity measures are powerful tools, but they should be used with caution. In this regard, one may consider using a quantitative measure to compare the features and choose relatively important ones. Then the objects will be similar if the categories of the selected features match. This is the main motivation for our approach.

Therefore, we employ several feature importance measures. We define the partial grouping power of a feature in dataset **D** as the number of unique matching pairs on the feature divided by the total number of unique matching pairs in the dataset. This is based on the notion that if the feature has a relatively higher number of matching pairs than others, it is more likely to group objects. The **PGPI l** can be assessed by:

$$PGPI\_{l} = \frac{\sum\_{i=1}^{k\_l} \frac{f(c\_i)(f(c\_i) - 1)}{2}}{\sum\_{i=1}^{m} \sum\_{j=1}^{k\_m} \frac{f\left(c\_j\right)(f\left(c\_j\right) - 1)}{2}} \tag{7}$$

where *cs* is the unique category of the feature, and *f c*ð Þ*<sup>s</sup>* is the frequency of the category in the dataset. This measure can take values from [0, 1]. The closer the value to one the higher the importance of the feature in aggregating the objects.

We also define a measure for the partitioning power of a feature. We define the partial partitioning power of a feature in dataset **D** as the number of unique mismatching pairs on the feature divided by the total number of unique mismatching pairs in the dataset. The **PPPIl** can be assessed by:

$$PPPI\_l = \frac{\sum\_{s=1}^{k\_l} \frac{n(n-1)}{2} - \frac{f(c\_i)(f(c\_i) - 1)}{2}}{\sum\_{i=1}^{m} \sum\_{j=1}^{k\_m} \frac{n(n-1)}{2} - \frac{f(c\_j)(f(c\_j) - 1)}{2}} \tag{8}$$

This measure can take values from [0, 1]. The closer the value to one the higher the importance of the feature in partitioning the objects. Both methods can be used in any analysis. However, as the objects can be relatively grouped or separated

*Perspective Chapter: Matching-Based Clustering Algorithm for Categorical Data DOI: http://dx.doi.org/10.5772/intechopen.109548*

depending on the features under consideration, one of the measure may perform better.

We also present another measure for assessing the feature importance. This one is based on the similarity matrix. The similarity matrix is defined as the matrix below:

$$\text{SM}\_{n,n} = \begin{pmatrix} m\_{1,1} & m\_{1,2} & \cdots & m\_{1,m} \\ m\_{2,1} & m\_{2,2} & \cdots & m\_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ m\_{n,1} & m\_{n,2} & \cdots & m\_{n,m} \end{pmatrix} \\ \tag{9}$$

where *mi*,**<sup>j</sup>** is a similarity measure between object **i** and **j** such as Overlap, Lin, and Goodall. Throughout this paper, we will use the count of matches between two objects as a similarity measure:

$$m\_{i\bar{j}} = \sum\_{i=1}^{m} \gamma\_i,$$

$$\text{where } \gamma\_i = \begin{cases} 1 & \text{if } a\_{x,i} = a\_{y,i} \\\\ 0 & \text{otherwise} \end{cases} \tag{10}$$

This measure is also known as the hamming distance. The similarity matrix is symmetrical, thus only the upper triangular matrix is used in the calculations. Furthermore, the diagonal will also be ignored. For another measure of the feature importance, based on the similarity matrix we define the general influence matrix as:

$$IM\_{n,n} = \begin{pmatrix} I\_{1,1} & I\_{1,2} & \cdots & I\_{1,m} \\ \\ I\_{2,1} & I\_{2,2} & \cdots & I\_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ \\ I\_{n,1} & I\_{n,2} & \cdots & I\_{n,m} \end{pmatrix},\tag{11}$$
 
$$\text{where } I\_{ij} = \begin{cases} 1 & \text{if } m\_{ij} > a \\\\ 0 & \text{otherwise} \end{cases}$$

where **α** is a threshold, which is bounded by the values similarity measure can take. In this chapter, we set **α** to **0**. After the construction, the features or the subset of features under consideration are dropped, and the influence matrix is updated. The matrix after the drop is defined as the partial influence matrix of corresponding feature or subset of features **l**. In this case, the partial grouping power of a feature or subset of features is assessed by dividing the count of the ones in the partial influence matrix by the count of ones in general influence matrix:

$$\text{PGPI2}\_{j} = \frac{\eta\_{\text{PIM}\_{l}}}{\eta\_{\text{GIM}}} \tag{12}$$

where *ηPIMl* is the count of ones in the *PIMl* and *ηGIM* is the count of ones in the *GIM*. One can notice that these measures of feature importance depend only on the number of unique matches in the dataset, and the number of categories of each feature does not influence them. In the next section, we present the matching-based clustering algorithm, which combines the importance measures of the features and the similarity matrix to partition categorical data into homogeneous groups.
