**1. Introduction**

In the academic literature, one can find many publications in which data mining methods have been applied to solve problems in relation to blockchain systems. In their detailed review article, Liu et al. [1] divide these tasks into the following three categories: cryptocurrency price prediction, blockchain address deanonymization, and anomaly detection. This chapter presents a new clustering method for anomaly detection.

Blockchain platforms are often subject to a variety of malicious attacks [2]. Such actions can potentially be detected by analyzing patterns in transactions. Since the number of anomalous transactions is small, this problem was solved using either empirically derived rules or cluster analysis [3, 4]. As noted by Liu et al. [1], more

research is required in this area, since most of the models obtained were characterized by a low proportion of identified anomalies.

Clustering is one of the "super problems" in data mining. Generally speaking, clustering is partitioning data points into intuitively similar groups [5]. This definition is simple and does not consider the challenges that occur while applying cluster analysis to real-world datasets. Nevertheless, this type of analysis is common in different areas such as text mining, marketing research, customer behavior analysis, financial market exploration, and so on. Nowadays various clustering algorithms are introduced in the literature, each of them with its advantages and disadvantages. Moreover, as the data come in different forms such as text, numeric, categorical, image, and so on, they perform differently in different scenarios. In other words, the performance of a particular clustering algorithm depends on the structure of the data under consideration.

Cluster analysis of numeric data is relatively well studied in the literature. Various approaches are implemented such as representative base, hierarchical, density base, graph base, probabilistic, and so on [6]. Recently, increasing attention has been paid to clustering none numeric types of data. An important topic is the clustering of categorical data. The problem is that the algorithms for categorical data clustering are mainly modifications of the ones introduced for numeric data. For instance, the most common algorithm is K-modes [7] which is a prototype of the K-means [8] algorithm. However, several researchers have developed algorithms specifically for categorical data, but there is still much room for new approaches.

The main problem with partitioning categorical data is that the standard operations used in clustering algorithms are not applicable. For instance, the definition of distance between two objects with categorical attributes is not as straightforward as with numeric attributes. The main problem is that categorical data takes only discrete values, which do not have any order, unlike continuous data. Thus, the definition of the distance in case of categorical data is ambiguous. Therefore, researchers have developed and used similarity measures [9–11] or have applied different types of transformation [12]. Another problem is the assessment of cluster representatives because many mathematical operations are not applicable to categorical data. For instance, it is impossible to assess the mean of the categorical feature. Taking into account the limitation of existing algorithms one may consider developing an algorithm, which is not using predefined distance/similarity measures as a key concept and is not based on representatives for assigning data points to clusters.

This idea motivated us to develop the matching-based clustering algorithm. In this paper, we are not interested in improving the similarity measure or modifying existing algorithms. The key concept of the algorithm introduced is that two objects with categorical features are similar only if all the features match. Thus, the algorithm is based on the similarity matrix. Besides, we employ a feature importance framework to choose which features to drop on each iteration until all the objects are clustered. The tests on the soybean disease dataset show that the algorithm is highly accurate and possesses much better results.

The rest of the paper is organized as follows. We briefly review the common categorical data clustering algorithms in Section 2. In Section 3, we discuss the categorical data and its limitations. In Section 4 we introduce the general framework of the matching-based clustering algorithm. Section 5 presents the experimental results on the soybean disease dataset. Finally, we summarize our work and describe our future plans in Section 5.
