2.5 K-means algorithm

K-means is an iterative method that consists of partitioning a set of n objects into k ≥ 2 clusters, such that the objects in a cluster are similar to each other and are different from those in other clusters. In the following paragraphs, the clustering problem related to K-means is formalized.

LetN= {x1,…, xn} be the set of n objects to be clustered by a similarity criterion, where xi ∈ ℜ<sup>d</sup> fori = 1, …, n and d ≥ 1 is the number of dimensions. Additionally, let k ≥ 2 be an integer and K = {1, …, k}. For a k-partition, Ρ = {G(1), …, G (k)} of N, let μ<sup>j</sup> denote the centroid of cluster G(j), forj ∈ K, and let M = {μ1, …, μk} and W = {w11, …, wij}.

Therefore, the clustering problem can be formulated as an optimization problem [24], which is described by Eq. (3):

$$\mathcal{P}: \text{minimize} \, z(\mathcal{W}, M) = \sum\_{i=1}^{n} \sum\_{j=1}^{k} w\_{ij} d\left(\mathbf{x}\_{i}, \mu\_{j}\right) \tag{3}$$

$$\text{subject to} \sum\_{j=1}^{k} w\_{ij} = \mathbf{1}, \text{for } i = \mathbf{1}, \dots, n,$$

$$w\_{\vec{\eta}} = \mathbf{0} \text{ or } \mathbf{1}, \text{for } i = \mathbf{1}, \dots, n, \text{and } j = \mathbf{1}, \dots, k,$$

where wij = 1 implies object xi belongs to cluster G(j) and d(xi, μj) denotes the Euclidean distance between xi and μ<sup>j</sup> for i = 1,…, n and j = 1,…, k.

The standard version of the K-means algorithm consists of four steps, as shown in Figure 1.

The pseudocode of the standard K-means algorithm is shown in Algorithm 1.

Algorithm 1. Standard K-means algorithm



Since the pioneering studies conducted by Steinhaus [11], Lloyd [12], MacQueen [6], and Jancey [19], many investigations have been aimed at finding a k-partition of N that solves problem P, defined by Eq. (3).

It has been shown that the clustering problem belongs to the NP-hard class for k ≥ 2 or d ≥ 2 [25, 26]. Therefore, obtaining an optimal solution for an instance of moderate size is generally an intractable problem. Consequently, a variety of heuristic algorithms have been proposed for obtaining the closest possible solution to the optimum of P, being the most important of those designed as K-means-type algorithms [6].

It is important to emphasize that the establishment of useful gaps between the optimal solution of the problem P and the solution achieved by K-means remains an open research problem.

The computational complexity of K-means is O(nkdr), where r represents the number of iterations [8, 9], which restricts its use for large instances, because each iteration involves the calculation of all the object-centroid distances. For reducing the complexity of K-means, numerous investigations have been carried out using different strategies for reducing the computational cost and minimizing the objective function.
