4. k-Means clustering using the MDE distance

Based on the derived formulas, the MDE distance and the mean, our aim in this research is to develop k-means clustering algorithms for incomplete datasets [1].

The MDE distance and the mean are general and can be integrated within any algorithm that computes distances or mean computation. In this section, we describe our proposed method to integrate those formulas within the framework of the k-means clustering algorithm.

We developed three different versions for k-means. For simplicity, we assume that all the points are from R<sup>2</sup> . We have two way to look about incomplete points. The first one considers each point as a single point, this version is similar to the GMM algorithm described in [14, 15]. On the other hand, the second way is to replace each incomplete point with a set of points according to the data distribution (these are the other two methods). As will be shown in our experiments, they outperform the first algorithm.

The k-means clustering algorithm is constructed from two basic steps: (1) associate each point with its closest centroid, and then, (2) update the centroid based on the new association from Eq. (1). Given dataset D that may contain points with missing values. In the first step, the MDE distance is used to compute the distances between each data point and the centroids in order to associate each point with the closest centroid. This association is general for all the three versions. However, there are several possible ways to then compute the new centroids of the clusters. We use Figure 1(a) in order to illustrate those possibilities. In this example, we see two clusters (i.e., C1 was assigned to be the yellow cluster and C2 was assigned to be the brown cluster). Our goal is to calculate the centers of each cluster. As an example, we will deal only with C1. If all the instances do not contain missing values, the centroid will be computed based on the Euclidean mean formula, resulting in the magenta star.

However, when the associated points for a given cluster contain incomplete points, it is not clear how to compute the mean. In the given example, let ð Þ x0; ? (i.e., the red star) be a point with a missing y value and x ¼ x0. This point was associated with C1's cluster using the MDE distance. It is important to note that we are able to associate incomplete points with closest centroid even though their geometric locations are unknown since we use the MDE distance.

On the other hand, using the MDE distance is similar to use the MA-method based on the Euclidean distance, the point ð Þ x0; ? will be replaced with x0; μ<sup>y</sup> � �. It is clear that the difference between the two methods is only the variance of known values in coordinate y, a fixed value that does not influence the association result.

The naïve method to compute the new centroid is by replacing the point with the missing value with all the possible points

$$(\mathfrak{x}\_0)\_{\text{possible}} = \left\{ \left( \mathfrak{x}\_0, y\_p \right) | y\_p \in Y\_{\text{possible}} \right\}\_{\prime}$$

the set of all the possible points that satisfy x ¼ x0. And

) f 0 xl ¼ 2

) nx<sup>l</sup> <sup>¼</sup> <sup>X</sup>nl

) <sup>x</sup><sup>l</sup> <sup>¼</sup> nl n

nate of the mean is the mean of the known values of that coordinate.

xl <sup>w</sup> ¼

4. k-Means clustering using the MDE distance

experiments, they outperform the first algorithm.

develop k-means clustering algorithms for incomplete datasets [1].

Pnl <sup>i</sup>¼<sup>1</sup> wixl

Thus, we simply get:

6 Recent Applications in Data Clustering

yielding:

weight.

points are from R<sup>2</sup>

Xnl i¼1

i¼1 pl

> Xnl i¼1 pl i

nl þ n � nl

xl � pl i � � <sup>þ</sup> <sup>2</sup>

<sup>i</sup> <sup>þ</sup> mlμ<sup>l</sup> ) xl <sup>¼</sup>

xl <sup>¼</sup> <sup>μ</sup><sup>l</sup>

Repeating this for all the coordinates yields <sup>x</sup> <sup>¼</sup> <sup>μ</sup><sup>1</sup>; <sup>μ</sup><sup>2</sup>;…; <sup>μ</sup><sup>k</sup> � �. In other words, each coordi-

In the same way, we derive a formula for computing the weighted mean for each coordinate l,

P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> wi

where wi is the weight of point xi. It means, in order to compute the weighted mean of a set of numbers that some of them are unknown, we must distinguish between known and unknown values. If the value is known, we multiply it with its weight. On the other hand, if the value is missing, we replace it with the mean of the known values and then multiply it by the matching

Based on the derived formulas, the MDE distance and the mean, our aim in this research is to

The MDE distance and the mean are general and can be integrated within any algorithm that computes distances or mean computation. In this section, we describe our proposed method to

We developed three different versions for k-means. For simplicity, we assume that all the

each point as a single point, this version is similar to the GMM algorithm described in [14, 15]. On the other hand, the second way is to replace each incomplete point with a set of points according to the data distribution (these are the other two methods). As will be shown in our

. We have two way to look about incomplete points. The first one considers

integrate those formulas within the framework of the k-means clustering algorithm.

<sup>i</sup> <sup>þ</sup> <sup>P</sup>nl

<sup>i</sup>¼<sup>1</sup> wiμ<sup>l</sup>

,

Xml i¼1

<sup>n</sup> <sup>μ</sup><sup>l</sup> <sup>¼</sup> <sup>μ</sup><sup>l</sup>

xl � <sup>μ</sup><sup>l</sup> � � <sup>¼</sup> <sup>0</sup>

Xnl i¼1 pl i

n þ mlμ<sup>l</sup> n

:

: (3)

Figure 1. An example for computing the centroids for two clusters in a dataset with missing values. (a) shows the results of the different methods of computing the mean. (b) shows the Voronoi diagram.

$$\mathcal{Y}\_{\text{possible}} = \{ y \in \mathbb{R} | \exists (x, y) \in D \} \prime$$

denote all the possible values for attribute Y. And then computing the mean according to these points (C1real and ð Þ x<sup>0</sup> possible), where each point from C1real has weight one and each point from ð Þ <sup>x</sup><sup>0</sup> possible has weight <sup>1</sup> ∣Ypossible∣ . Where

$$\mathsf{C1}\_{\mathsf{real}} = \{ (\mathsf{x}, \mathsf{y}) \in D | (\mathsf{x}, \mathsf{y}) \in \mathsf{C1} \}$$

be the set of all the data points without missing values that are associated with the C1 cluster. As a result, the weighted mean of C1 is:

$$mean(\mathbb{C}1) = \frac{\sum\_{(\mathbf{x}, \mathbf{y}) \in \mathbb{C}1\_{nd}} (\mathbf{x}, \mathbf{y}) + \binom{\mathbf{x}\_0, \mu\_y}{}{\sum \frac{1}{|Y\_{parallel}|}}. \tag{4}$$

weighted k-means on the new dataset. This method, in one hand, is simple to implement, but in the other hand, its runtime is high, since each point with, for example, a missing y value will be

∣Dreal∣ þ jDj�jDreal ð Þ� j ∣Attpossible∣,

where Dreal is the set of each data points that do not contain missing values. In order to reduce the runtime complexity, we turn to use Voronoi diagram. Based on Voronoi diagram, the data space is partitioned to k subspaces (as can be seen in Figure 1(b)). Each point is associated with

The third possibility is to divide the y value space to several disjoint intervals. Where, each interval will be represented by its mean, and the weight of each interval will be the ratio between the number of points in the interval to the number of all possible points. This method we called k-mean-HistMDE. k-mean-HistMDE method approximates the two methods men-

In this section, we will describe another use case that integrates the derived distance function MDE within the framework of mean shift clustering algorithm. Firstly, we will give a short overview of the mean shift algorithm, and then, we will describe how we use MDE distance in this algorithm. Here, we only review some of the results described in [16, 17] which should be

> Xn i¼1

is a nonparametric estimator of the density at x in the feature space. Where k xð Þ, 0 ≤ x ≤ 1 is the profile of the kernel and the normalization constant ck, <sup>d</sup> assures that K xð Þ integrates to one. As a

<sup>K</sup> <sup>x</sup> � xi h

, i ¼ 1, …, n is associated with a bandwidth value h > 0.

K xð Þ¼ ck, dk <sup>∥</sup>x∥<sup>2</sup> � � <sup>∥</sup>x<sup>∥</sup> <sup>≤</sup> <sup>1</sup> (6)

� �: (5)

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 9

replaced with all ∣Ypossible∣ points. As a result, the size of the dataset will be:

the subspace of the cluster in which it lies.

tioned before that compute the weighted mean.

• The naïve method which is equivalent to the MA method.

These methods differ in their performance, efficiency, and the way they work.

<sup>b</sup>f xð Þ¼ <sup>1</sup> nhd

Based on a symmetric kernel K with bounded support satisfying

result, the density estimator Eq. (5) can be rewritten as

In conclusion, we have three methods:

• k-means-MDE

• k-mean-HistMD<sup>E</sup>

5. Mean shift algorithm

consulted for the details. Let xi ∈ R<sup>d</sup>

The sample point density estimator at point x is

This is identical to the Euclidean mean when the missing point is replaced with x0; μ<sup>y</sup> � � and is equivalent to the MA method when x0; μ<sup>y</sup> � � is associated with <sup>C</sup>1. As a result, the real centroid of the cluster (the magenta star) moves to the green star as described in Figure 1(b), where not all the blue "+" marks are belonging to C1.

As a result, the mean computation must distinguish between two possible methods. The first method (which we call k-mean-MDE) takes into account all the possible points that their y coordinates are the y coordinates of the real data points from the yellow cluster in addition to the real points within the yellow circle. As a result, the mean of this set will be computed based on all the real points C1real and C1ð Þ <sup>x</sup><sup>0</sup> possible where,

$$\mathcal{C}\mathbf{1}\_{\left(\mathbf{x}\_{0}\right)\_{\text{pessimle}}} = \left\{ \left(\mathbf{x}\_{0}, \mathbf{y}\_{p}\right) \in \left(\mathbf{x}\_{0}\right)\_{\text{pessimle}} \left| \exists (\mathbf{x}, \mathbf{y}) \in \mathbf{C} \mathbf{1}\_{\text{real}} \land \mathbf{y} = \mathbf{y}\_{p} \right\} . \right.$$

Computing the new centroid using Eq. (3) yields not only the same centroid as using the Euclidean distance, but also preserves the runtime of the standard k-means using the Euclidean distance.

The second method (which we called k-mean-HistMDE): In this case, we first associate each of the points from ð Þ x<sup>0</sup> Ypossible with its nearest center, and after that compute a weighted mean. It means that to compute the mean, we will take into account all the real points C1real, in addition to PC1possible where

$$PC1\_{\text{possible}} = \left\{ \left( \mathbf{x}\_0, y\_p \right) \in \left( \mathbf{x}\_0 \right)\_{\text{possible}} \left| \left( \mathbf{x}\_0, y\_p \right) \in \mathbf{C1} \right\} \right.$$

According to this method, use all the points from ð Þ x<sup>0</sup> possible that are associated with the C1 cluster and not only the points from ð Þ x<sup>0</sup> possible whose y coordinates are from the real points associated with that cluster. Since the weights are computed using the entire dataset, we cannot use Eq. (3). To this end, our suggested method for implementing the mean computation is simply to replace each point with a missing value with the ∣Ypossible∣ points, each with a weight <sup>1</sup> ∣Ypossible∣ , and run weighted k-means on the new dataset. This method, in one hand, is simple to implement, but in the other hand, its runtime is high, since each point with, for example, a missing y value will be replaced with all ∣Ypossible∣ points. As a result, the size of the dataset will be:

∣Dreal∣ þ jDj�jDreal ð Þ� j ∣Attpossible∣,

where Dreal is the set of each data points that do not contain missing values. In order to reduce the runtime complexity, we turn to use Voronoi diagram. Based on Voronoi diagram, the data space is partitioned to k subspaces (as can be seen in Figure 1(b)). Each point is associated with the subspace of the cluster in which it lies.

The third possibility is to divide the y value space to several disjoint intervals. Where, each interval will be represented by its mean, and the weight of each interval will be the ratio between the number of points in the interval to the number of all possible points. This method we called k-mean-HistMDE. k-mean-HistMDE method approximates the two methods mentioned before that compute the weighted mean.

In conclusion, we have three methods:


Ypossible ¼ f g y∈ Rj∃ð Þ x; y ∈ D ,

denote all the possible values for attribute Y. And then computing the mean according to these points (C1real and ð Þ x<sup>0</sup> possible), where each point from C1real has weight one and each point from

C1real ¼ f g ð Þ x; y ∈ Djð Þ x; y ∈ C1

be the set of all the data points without missing values that are associated with the C1 cluster.

ð Þ <sup>x</sup>;<sup>y</sup> <sup>∈</sup> <sup>C</sup>1realð Þþ <sup>x</sup>; <sup>y</sup> <sup>x</sup>0; <sup>μ</sup><sup>y</sup>

<sup>∣</sup>C1real<sup>∣</sup> <sup>þ</sup> <sup>P</sup> <sup>1</sup>

� �

is associated with C1. As a result, the real centroid

:

: (4)

� �

and is

∣Ypossible∣

∈ ð Þ x<sup>0</sup> possiblej∃ð Þ x; y ∈C1real ∧ y ¼ yp n o

P

� �

This is identical to the Euclidean mean when the missing point is replaced with x0; μ<sup>y</sup>

of the cluster (the magenta star) moves to the green star as described in Figure 1(b), where not

As a result, the mean computation must distinguish between two possible methods. The first method (which we call k-mean-MDE) takes into account all the possible points that their y coordinates are the y coordinates of the real data points from the yellow cluster in addition to the real points within the yellow circle. As a result, the mean of this set will be computed based

Computing the new centroid using Eq. (3) yields not only the same centroid as using the Euclidean distance, but also preserves the runtime of the standard k-means using the Euclid-

The second method (which we called k-mean-HistMDE): In this case, we first associate each of the points from ð Þ x<sup>0</sup> Ypossible with its nearest center, and after that compute a weighted mean. It means that to compute the mean, we will take into account all the real points C1real, in addition

According to this method, use all the points from ð Þ x<sup>0</sup> possible that are associated with the C1 cluster and not only the points from ð Þ x<sup>0</sup> possible whose y coordinates are from the real points associated with that cluster. Since the weights are computed using the entire dataset, we cannot use Eq. (3). To this end, our suggested method for implementing the mean computation is simply to replace

∈ð Þ x<sup>0</sup> possiblej x0; yp

n o

� �

∈ C1

:

∣Ypossible∣

, and run

ð Þ <sup>x</sup><sup>0</sup> possible has weight <sup>1</sup>

8 Recent Applications in Data Clustering

∣Ypossible∣

As a result, the weighted mean of C1 is:

equivalent to the MA method when x0; μ<sup>y</sup>

all the blue "+" marks are belonging to C1.

on all the real points C1real and C1ð Þ <sup>x</sup><sup>0</sup> possible where,

ean distance.

to PC1possible where

C1ð Þ <sup>x</sup><sup>0</sup> possible ¼ x0; yp

� �

PC1possible ¼ x0; yp

� �

each point with a missing value with the ∣Ypossible∣ points, each with a weight <sup>1</sup>

. Where

mean Cð Þ¼ 1

• k-mean-HistMD<sup>E</sup>

These methods differ in their performance, efficiency, and the way they work.
