**2. Categorical data clustering literature review**

Researchers have proposed various methods and algorithms for clustering categorical data. The most common approach is the transformation of the data into binary dataset and then implementation of the standard algorithms with some modification if required. Nevertheless, scholars have developed a wide variety of algorithms for clustering categorical data in recent years. These algorithms can be grouped into five main classes: model-based, partition-based, density base, hierarchical, and projection-base [12]. The main difference between these algorithms is how the similarity or distance is defined for the data points, and according to what criteria the clusters are formed.

Model-based clustering is based on the notion that data come from a mixture model. The most common models used are statistical distributions. Based on the user-specified parameters the prior models are assessed. Then the algorithm aims at recovering the latent model by changing it on each iteration. The main disadvantage of this type of clustering is that it requires user-specified parameters. Hence, if the assumptions are false the results will be inaccurate. At the same time, models may oversimplify the actual structure of the data. Another disadvantage of model-based clustering is that it can be slow on large datasets. Some model-based clustering algorithms are AutoCLass [13], SVM clustering [14], BILCOM Empirical Bayesian [15], etc.

Partition-based clustering algorithms are the most common ones. The main advantage of them is the fast processing time on large datasets. The main concept is defining representatives of each cluster, allocating objects to the cluster, redefining representatives, and reassigning objects based on the dissimilarity measurements. This is repeated until the algorithm converges. The main drawback of this type of algorithm is that they require the number of clusters to be predefined by the user. Another disadvantage is that several algorithms of this type produce locally optimal solutions and are dependent on the structure of the dataset. Several partition-based algorithms are K-modes, Fuzzy K-modes [16], Squeezer [17], COOLCAT [18], etc.

Density-based algorithms define clusters as subspaces where the objects are dense and are separated by subspaces of low density [19]. The implementation of densitybased algorithms for categorical data is challenging as the attributes values are unordered. Even though they can be fast in clustering, they sometimes may fail to cluster data with varying densities [20].

Hierarchical algorithms represent the data as a tree of nodes, where each node is a possible grouping of data. There are two possible ways of clustering categorical data using hierarchical algorithms: in an agglomerative (bottom-up) and divisive (topdown) fashion. However, the latter is less common. The main concept of the agglomerative algorithm is using a similarity measure to gradually allocate the objects to the nodes of the tree. The main disadvantage of hierarchical clustering is its slow speed. Another problem is that the clusters may merge thus these algorithms might lead to information distortion. Several categorical data hierarchical clustering algorithms are ROCK [21], LIMBO [22], COBWEB [23], etc.

Projected clustering algorithms are based on the fact that in high-dimensional datasets clusters are formed based on specific attribute subsets. In other words, each cluster is a subspace of high-dimensional datasets defined by a subset of attributes only relevant to that cluster. The main issue with projected clustering algorithms is that it requires user-specified parameters. If the defined parameters are inaccurate, the clustering will be poor. Projected cluster algorithms include CACTUS [24], CLICKS [25], STIRR [26], CLOPE [27], HIERDENC [28], MULIC [29], etc.

More detailed presentations and comparisons of the existing algorithm can be found in [30–32]. Summarizing the existing algorithms, we can conclude that most of them find some tradeoff between accuracy and speed. However, considering the growing interest in analyzing categorical data in social, behavioral, and biomedical science we are more interested in highly accurate algorithms. Furthermore, as one can see the majority of the algorithms uses some distance/similarity metrics and defines representatives of clusters as subroutine of the algorithms. At the same time, they also require user-specified parameters. These factors can be seen as limitations in case of clustering categorical data. Therefore, we purpose another approach to partitioning the categorical data, which tries to avoid these features. Therefore, in the next section, we discuss the main characteristics of categorical data.
