2.3 MacQueen (1967)

James MacQueen, from Department of Statistics of the University of California, in his article titled "Some Methods for Classification and Analysis of Multivariate Observations" [6], proposed an algorithm for partitioning an instance into a set of clusters whose variance was small for each cluster. The term K-means was coined by him; it was known by different names: dynamic clustering method [13–15], iterative minimum-distance clustering [16], nearest centroid sorting [17], and h-means [18], among others.

## 2.4 Jancey (1966)

Jancey, from the Department of Botany, School of Biological Sciences, University of Sydney, in one of his articles titled "Multidimensional Group Analysis" [19], presented a clustering method for characterizing species Phyllota phylicoides. Jancey conducted his research in the field of taxonomy. There exists a variant of this method with similar characteristics, which was introduced by Forgy in the article "Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classification" [20]. The fundamental difference with respect to Jancey's work lies in the way in which the initial centroids are selected.

Because the results from Jancey's research will be used as reference for this chapter, his algorithm will be described in detail. The author stated that the similarity measures are based on the results published by the following authors: (a) Pearson in his article titled "On the Coefficient of Racial Likeness [21] published in 1926," (b) Rao in the article named "The Use of Multiple Measurements in Problems of Biological Classification" [22] published in 1948, and (c) Sokal in his article titled "Distance as a Measure of Taxonomic Similarity" [23] in 1961.

Pearson [21] in his article "On the Coefficient of Racial Likeness," when studying craniology and physical anthropology, confronted the difficulty of comparing two types of races, in order to determine the membership of a limited number of individuals to one race or another or both. As a result, Pearson proposed a coefficient of racial likeness (CRL). For calculating this coefficient, it is necessary to obtain first the means and variances of each characteristic in each sample, since it is assumed that there is variability for each of the characteristics considered. This coefficient is used to measure the dispersion around the mean and the degree of association between two variables.

The article published by Radhakrishna Rao [22] in the Journal of the Royal Statistical Society, titled "The Utilization of Multiple Measurements in Problems of Biological Classification," aimed at presenting a statistical approach for two types of problems that arise in biological research. The first deals with the determination of an individual as member of one of the many groups to which he/she possibly might belong. The second problem deals with the classification of groups into a system based on the configuration of their different characteristics.

Sokal [23] published his article titled "Distance as Measure of Taxonomic Similarity," which is based on the methods for quantifying the taxonomy classification process, and he points out the importance of having fast processing and data calculation methods. The purpose of his work is to evaluate the similarities among taxa that have observed characteristic values, instead of phylogenetic speculations and interpretations.

The similarity among objects is evaluated based on many attributes, and all the attributes are considered as equal taxonomic values; therefore, an attribute is not weighted more or less than any other.

For performing the weighting of attributes, three types of coefficients are used: association, correlation, and distance, where the last one is of interest for this study. This distance coefficient is employed for determining the similarity between two objects by using a distance function in an n-dimensional space, in which the coordinates represent the attributes.

A measure of similarity between the objects 1 and 2 based on two attributes would be the distance in a two-dimensional space (i.e., a Cartesian plane) between the two objects. This distance δ1,2 can be easily calculated through the well-known formula from analytic geometry, Eq. (1):

$$\delta\_{1,2} = \sqrt{\left(\mathbf{X\_1} - \mathbf{X\_2}\right)^2 + \left(\mathbf{Y\_1} - \mathbf{Y\_2}\right)^2} \tag{1}$$

where X<sup>1</sup> and Y<sup>1</sup> are the object 1 coordinates and X<sup>2</sup> and Y<sup>2</sup> are the object 2 coordinates.

Similarly, when three attributes are needed for two different objects, it is now necessary to carry out the distance calculation in a three-dimensional space so that the exact position of the two objects can be represented regarding the three attributes. For calculating the distance between these two objects, an extension to the three-dimensional space of the formula for δ1,2 can be applied. When more than three dimensions are needed for the objects, it is not possible to represent their positions using conventional geometry; therefore, it is necessary to resort to algebraic calculation of data. However, the formula for distance calculation from analytic geometry is equally valid for an n-dimensional space.

The general formula for calculating the distance for two objects with n attributes is shown in Eq. (2):

$$\delta\_{\mathbf{1},2}^{2} = \sum\_{i=1}^{n} (\mathbf{X}\_{i\mathbf{1}} - \mathbf{X}\_{i\mathbf{2}})^{2} \tag{2}$$

where Xij is the value of attribute i for object j (j = 1, 2).

Once the object classification process is completed, then the matrix of similarity coefficients obtained (based on object distances) can be used in the usual methods for clustering analysis.

Finally, it is important to emphasize the feasibility of calculating distance as the summation of the squared differences of the attribute values of objects of different kinds.

The clustering method proposed by Jancey consists of the following four steps:


The K-Means Algorithm Evolution DOI: http://dx.doi.org/10.5772/intechopen.85447

Figure 1. Standard K-means algorithm.

