3. Mean computation

MDE m<sup>i</sup>

4 Recent Applications in Data Clustering

mean Euclidean distance will be:

MDE xi; yi � � <sup>¼</sup>

MDE m<sup>i</sup>

tional mean and variance, respectively.

MDE m<sup>i</sup>

; yi � � <sup>¼</sup>

; yi � � <sup>¼</sup>

ð

where xobs denotes the observed attributes of point X, and μ<sup>i</sup>

ð

where p xjmi � � is the distribution of <sup>x</sup> when <sup>x</sup> is missing.

; yi � � <sup>¼</sup> E x � yi � �<sup>2</sup> h i

MDE m<sup>i</sup>

Both these values are selected from distribution χ<sup>i</sup>

ð ð

¼ ð

p xð Þ <sup>x</sup> � yi � �<sup>2</sup>

That means, to measure the distance between known value yi and unknown value, the algorithm will compute the expectation distance for all the distances between yi and all the possible values of the missing value. These computations did not take into account the possible correlations between the missing values and the other known values (missing completely at random —MCAR) and the probability was computed according to the whole dataset. The resulting

; yi � � <sup>¼</sup> yi � <sup>μ</sup><sup>i</sup> � �<sup>2</sup>

where μ<sup>i</sup> and σ<sup>i</sup> � �<sup>2</sup> are the mean and the variance for all the known values of the attribute.

as we did for the one missing value problem. As a result, the distance is:

MDE xi

p xð Þp yð Þð Þ x � y

As <sup>x</sup> and <sup>y</sup> belong to the same attribute, E x½ �¼ E y½ � <sup>≔</sup> <sup>μ</sup><sup>i</sup> and <sup>σ</sup><sup>x</sup> <sup>¼</sup> <sup>σ</sup><sup>y</sup> <sup>≔</sup> <sup>σ</sup><sup>i</sup>

the other observed values, and then, the distance will be computed as:

p xð Þ <sup>j</sup>xobs <sup>x</sup> � <sup>y</sup><sup>i</sup> � �<sup>2</sup>

p xjm<sup>i</sup> � � <sup>x</sup> � yi � �<sup>2</sup>

3. Both values are missing: In this case, in order to measure the distance, we should compute all the distances between each possible pair of values one for each missing value xi and yi

Then, we compute the expectation of the Euclidean distance between each selected value

; yi � � <sup>¼</sup> <sup>2</sup> <sup>σ</sup><sup>i</sup> � �<sup>2</sup>

dx <sup>¼</sup> yi � <sup>μ</sup><sup>i</sup>

dx <sup>¼</sup> yi � <sup>μ</sup><sup>i</sup>

x∣xobs � �<sup>2</sup>

x∣mi � �<sup>2</sup>

� �<sup>2</sup> � �

� �<sup>2</sup> � �

<sup>x</sup>∣xobs and <sup>σ</sup><sup>i</sup>

<sup>þ</sup> <sup>σ</sup><sup>i</sup> x∣mi

<sup>þ</sup> <sup>σ</sup><sup>i</sup> x∣xobs

As we mentioned, all these computations assume that the missing data is MCAR. However, in real-world datasets, the missing data are MAR. In this case, the probability p xð Þ depends on

On the other hand, in the case that the missing values are NMAR, the probability p xð Þ that was used in Eq. (1) will be computed based on this information, and then, the distance will be:

2

<sup>þ</sup> <sup>σ</sup><sup>i</sup> � �<sup>2</sup> � �

.

dxdy <sup>¼</sup> ð Þ E x½ �� E y½ � <sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup>

� �

dx <sup>¼</sup> yi � <sup>μ</sup><sup>i</sup> � �<sup>2</sup>

<sup>þ</sup> <sup>σ</sup><sup>i</sup> � �<sup>2</sup> � �

:

, (1)

<sup>x</sup> <sup>þ</sup> <sup>σ</sup><sup>2</sup> y

. Thus:

: (2)

x∣xobs � �<sup>2</sup> :

,

,

are the condi-

.

Since one of our goals is developing a k-means clustering algorithm over incomplete datasets, we need to derive a formula to compute the mean of a given set that may contain incomplete points. We decide to derive this equation based on our distance function MDE.

Let A ⊆ R<sup>K</sup> be a set of n points that may contain points with missing values. Then, the mean of this dataset is defined as:

$$\overline{\mathfrak{x}} = \underset{\mathfrak{x} \in \mathbb{R}}{\operatorname{arg\,min}} \sum\_{i=1}^{n} \left( distance(\mathfrak{x}, p\_i) \right)^2.$$

for any <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup>, where pi <sup>∈</sup> <sup>A</sup> denotes each point from the set <sup>A</sup>, and distanceðÞ is a distance function.

Let f xð Þ be a multidimensional function: <sup>f</sup> : <sup>R</sup><sup>K</sup> :! <sup>R</sup> which is defined as:

$$f(\mathbf{x}) = \sum\_{i=1}^{n} \left( \text{distance}(\mathbf{x}, p\_i) \right)^2,$$

In our case, the distanceðÞ ¼ MDE. Thus,

$$f(\mathbf{x}) = \sum\_{i=1}^{n} \left( \text{distance}(\mathbf{x}, p\_i) \right)^2 = \sum\_{i=1}^{n} \left( \underbrace{\sqrt{\sum\_{j=1}^{K} MD\_{\mathcal{E}} \left( \mathbf{x}^j, p\_i^j \right)}\_{\text{The } MD\_{\mathcal{E}}(\text{) distance}}}\_{\text{The } MD\_{\mathcal{E}}(\text{) distance}} \right)^2 = \sum\_{i=1}^{n} \sum\_{j=1}^{K} MD\_{\mathcal{E}} \left( \mathbf{x}^j, p\_i^j \right).$$

where xj is the coordinate j and p j <sup>i</sup> is the coordinate j in point pi . Since each point pi may contain missing attributes, and according to the definition of the MDE distance in the previous section, f xð Þ will be:

$$f(\mathbf{x}) = \sum\_{j=1}^{K} \left[ \underbrace{\sum\_{i=1}^{n\_j} \left( \mathbf{x}^j - p\_i^j \right)^2}\_{\text{there are } \mathbf{n} \text{ known coordinates}} + \underbrace{\sum\_{i=1}^{m\_j} \left( \left( \mathbf{x}^j - \boldsymbol{\mu}^j \right)^2 + \left( \boldsymbol{\sigma}^j \right)^2 \right)}\_{\text{there are } \mathbf{m} \text{ missing coordinates}} \right].$$

x is the solution of f 0 ð Þ¼ x 0, and in a multidimensional case: x is the solution of ∇f ¼0 ! , where

$$\nabla f = \left( f'\_{x^1}, f'\_{x^2}, \dots, f'\_{x^k} \right) = 0,$$

is the gradient of function f . Firstly, we will deal with one coordinate, and then, we will generalize it for the other coordinates.

$$\Rightarrow f\_{\mathbf{x}^l}^l = 2\sum\_{i=1}^n \left(\mathbf{x}^l - p\_i^l\right) + 2\sum\_{i=1}^m \left(\mathbf{x}^l - \boldsymbol{\mu}^l\right) = \mathbf{0}$$

$$\Rightarrow n\mathbf{x}^l = \sum\_{i=1}^n p\_i^l + m\_l \boldsymbol{\mu}^l \Rightarrow \mathbf{x}^l = \frac{\sum\_{i=1}^n p\_i^l}{n} + \frac{m\_l \boldsymbol{\mu}^l}{n}$$

$$\Rightarrow \mathbf{x}^l = \frac{n\_l}{n}\frac{\sum\_{i=1}^n p\_i^l}{n\_l} + \frac{n - m\_l}{n}\boldsymbol{\mu}^l = \boldsymbol{\mu}^l.$$

Thus, we simply get:

$$\mathbf{x}^l = \mu^l. \tag{3}$$

The k-means clustering algorithm is constructed from two basic steps: (1) associate each point with its closest centroid, and then, (2) update the centroid based on the new association from Eq. (1). Given dataset D that may contain points with missing values. In the first step, the MDE distance is used to compute the distances between each data point and the centroids in order to associate each point with the closest centroid. This association is general for all the three versions. However, there are several possible ways to then compute the new centroids of the clusters. We use Figure 1(a) in order to illustrate those possibilities. In this example, we see two clusters (i.e., C1 was assigned to be the yellow cluster and C2 was assigned to be the brown cluster). Our goal is to calculate the centers of each cluster. As an example, we will deal only with C1. If all the instances do not contain missing values, the centroid will be computed based

However, when the associated points for a given cluster contain incomplete points, it is not clear how to compute the mean. In the given example, let ð Þ x0; ? (i.e., the red star) be a point with a missing y value and x ¼ x0. This point was associated with C1's cluster using the MDE distance. It is important to note that we are able to associate incomplete points with closest centroid even though their geometric locations are unknown since we use the MDE distance. On the other hand, using the MDE distance is similar to use the MA-method based on the

between the two methods is only the variance of known values in coordinate y, a fixed value

The naïve method to compute the new centroid is by replacing the point with the missing

Figure 1. An example for computing the centroids for two clusters in a dataset with missing values. (a) shows the results

� �jyp <sup>∈</sup>Ypossible n o,

ð Þ x<sup>0</sup> possible ¼ x0; yp

� �. It is clear that the difference

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 7

on the Euclidean mean formula, resulting in the magenta star.

Euclidean distance, the point ð Þ x0; ? will be replaced with x0; μ<sup>y</sup>

that does not influence the association result.

the set of all the possible points that satisfy x ¼ x0. And

of the different methods of computing the mean. (b) shows the Voronoi diagram.

value with all the possible points

Repeating this for all the coordinates yields <sup>x</sup> <sup>¼</sup> <sup>μ</sup><sup>1</sup>; <sup>μ</sup><sup>2</sup>;…; <sup>μ</sup><sup>k</sup> � �. In other words, each coordinate of the mean is the mean of the known values of that coordinate.

In the same way, we derive a formula for computing the weighted mean for each coordinate l, yielding:

$$\overline{\mathfrak{X}}\_w^l = \frac{\sum\_{i=1}^{n\_l} \mathfrak{w}\_i \mathfrak{x}\_i^l + \sum\_{i=1}^{n\_l} \mathfrak{w}\_i \mathfrak{\mu}^l}{\sum\_{i=1}^{n} \mathfrak{w}\_i} \,\,\, \mathfrak{w}\_i$$

where wi is the weight of point xi. It means, in order to compute the weighted mean of a set of numbers that some of them are unknown, we must distinguish between known and unknown values. If the value is known, we multiply it with its weight. On the other hand, if the value is missing, we replace it with the mean of the known values and then multiply it by the matching weight.
