2. Our distance measure

Based on [3–6], there are three main types of missing data:

depends on the value that would have been observed.

obstacle of this method is the runtime complexity.

the mode or the mean of each attribute.

sample;

2 Recent Applications in Data Clustering

1. Missing completely at random (MCAR): when the missing value is not related to any other

2. Missing at random (MAR): when the probability that a value is missing may depend on

3. Not missing at random (NMAR): when the probability that a known value is missing

There are two basic types of methods to deal with the problem of incomplete datasets. (1) Deletion: methods from this category ignore all the incomplete instances. These methods may change the distribution of the data by decreasing the volume of the dataset [7]. (2) Imputation: in these methods, the missing values were replaced with known value according to statistical computation. Based on these methods, we convert then incomplete data to complete data, and as a result, the existing machine learning algorithms can be run as they deal with complete data.

One of the most common approaches in this domain is the mean imputation (MI) method that replaces each incomplete data point with the mean of the data. There are several obvious disadvantages to this method: (a) using a fixed instance to replace all the incomplete instances will change the distribution of the original dataset and (b) ignoring the relationship among attributes will bias the performance of subsequent data mining algorithms. These problems were caused since we replace all the incomplete instances with a fixed one. On the other hand, a variant of this method is to replace the missing values only based on the distribution of the attributes. It means that the algorithm will replace each missing value with the mean of the of its attribute (MA) and the whole instance [8]. And in a case that the values were discrete, the missing value will be replaced by the most common (MCA) value in the attribute [9] (i.e., filling the unknown values of the attribute with the value that occurs most often for the same attribute). All those methods ignore the other possible values of the attribute and their distribution and represent the missing value with one value, that is, wrong in real-world datasets.

Finally, the k-Nearest Neighbor Imputation method [10, 11] estimates the values that should be replaced based on the k nearest neighbors based only on the known values. The main

We can summarize the main drawbacks of each suggested method as: (1) inability to approximate the missing value and (2) inefficiency to compute the suggested value. Based on our suggested method [12], the distance between two points, that they may include missing value,

To do that in the computation procedure, we take into account all the possible values of the missing value with their probabilities, which are derived from the attribute's distribution. This is in contrast to the MCA and the MA methods, which replace each missing value only with

There are three possible cases between the values: (a) both of them are known: in this case, the distance will be computed as the Euclidean distance; (b) both of them are missing; and (c) one

is not only efficient but also takes into account the distribution of each attribute.

some known values but it does not depend on the other missing values;

Firstly, we will give a short preview to basic distance function that is able to compute distances between points with missing values developed by [2].

Let A ⊆ R<sup>K</sup> be a set of points. For the ith attribute A<sup>i</sup> , the conditional probability for Ai will be computed according to the known values for this attribute from <sup>A</sup> (i.e., P A<sup>i</sup> � <sup>χ</sup><sup>i</sup> ), where χ<sup>i</sup> is the distribution of the ith coordinate.

Given two sample points X, Y ⊆ R<sup>K</sup>, the goal is to compute the distance between them. Let xi and y<sup>i</sup> be the ith coordinate values from points X, Y, respectively. There are three possible cases for the values of xi and yi :


Therefore, we approximate the mean Euclidean distance (MDE) between yi and the missing value m<sup>i</sup> as:

#### 4 Recent Applications in Data Clustering

$$MD\_E\left(m^i, y^i\right) = E\left[\left(\mathbf{x} - \mathbf{y}^i\right)^2\right] = \int p(\mathbf{x})\left(\mathbf{x} - \mathbf{y}^i\right)^2 d\mathbf{x} = \left(\left(y^i - \mu^i\right)^2 + \left(\sigma^i\right)^2\right).$$

That means, to measure the distance between known value yi and unknown value, the algorithm will compute the expectation distance for all the distances between yi and all the possible values of the missing value. These computations did not take into account the possible correlations between the missing values and the other known values (missing completely at random —MCAR) and the probability was computed according to the whole dataset. The resulting mean Euclidean distance will be:

$$MD\_{\mathbb{E}}\left(m^i, y^i\right) = \left(\left(y^i - \mu^i\right)^2 + \left(\sigma^i\right)^2\right),\tag{1}$$

3. Mean computation

this dataset is defined as:

In our case, the distanceðÞ ¼ MDE. Thus,

distance x; pi � � � � <sup>2</sup> <sup>¼</sup> <sup>X</sup><sup>n</sup>

f xð Þ¼ <sup>X</sup><sup>n</sup>

f xð Þ will be:

x is the solution of f

i¼1

where xj is the coordinate j and p

f xð Þ¼ <sup>X</sup> K

0

generalize it for the other coordinates.

j¼1

function.

Since one of our goals is developing a k-means clustering algorithm over incomplete datasets, we need to derive a formula to compute the mean of a given set that may contain incomplete

Let A ⊆ R<sup>K</sup> be a set of n points that may contain points with missing values. Then, the mean of

for any <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>K</sup>, where pi <sup>∈</sup> <sup>A</sup> denotes each point from the set <sup>A</sup>, and distanceðÞ is a distance

distance x; pi � � � � <sup>2</sup>

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

MDE xj


> þ Xmj i¼1

ð Þ¼ x 0, and in a multidimensional case: x is the solution of ∇f ¼0

; p j i � �

X K

vuut

j¼1

<sup>i</sup> is the coordinate j in point pi

missing attributes, and according to the definition of the MDE distance in the previous section,

is the gradient of function f . Firstly, we will deal with one coordinate, and then, we will

distance x; pi � � � � <sup>2</sup>

,

,

1

2

<sup>¼</sup> <sup>X</sup><sup>n</sup> i¼1

X K

MDE xj

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 5

. Since each point pi may contain

; p j i � �,

> ! , where

j¼1

CCCCCA

xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> � �


Xn i¼1

points. We decide to derive this equation based on our distance function MDE.

x ¼ argmin x∈ R

Let f xð Þ be a multidimensional function: <sup>f</sup> : <sup>R</sup><sup>K</sup> :! <sup>R</sup> which is defined as:

f xð Þ¼ <sup>X</sup><sup>n</sup>

i¼1

j

Xnj i¼1

xj � <sup>p</sup> j i � �<sup>2</sup>


> ∇f ¼ f 0 <sup>x</sup><sup>1</sup> ; f 0 <sup>x</sup><sup>2</sup> ;…; f 0 xk � � <sup>¼</sup> <sup>0</sup>,

i¼1

0

BBBBB@

where μ<sup>i</sup> and σ<sup>i</sup> � �<sup>2</sup> are the mean and the variance for all the known values of the attribute.

3. Both values are missing: In this case, in order to measure the distance, we should compute all the distances between each possible pair of values one for each missing value xi and yi . Both these values are selected from distribution χ<sup>i</sup> .

Then, we compute the expectation of the Euclidean distance between each selected value as we did for the one missing value problem. As a result, the distance is:

$$MD\_{\mathbb{E}}\left(\mathbf{x}\_{i},y\_{i}\right) = \int \left[p(\mathbf{x})p(y)(\mathbf{x}-y)^{2}d\mathbf{x}dy = \left(\left(E[\mathbf{x}] - E[y]\right)^{2} + \sigma\_{x}^{2} + \sigma\_{y}^{2}\right).$$

As <sup>x</sup> and <sup>y</sup> belong to the same attribute, E x½ �¼ E y½ � <sup>≔</sup> <sup>μ</sup><sup>i</sup> and <sup>σ</sup><sup>x</sup> <sup>¼</sup> <sup>σ</sup><sup>y</sup> <sup>≔</sup> <sup>σ</sup><sup>i</sup> . Thus:

$$\text{MD}\_{\text{E}}\left(\mathbf{x}^{i}, \mathbf{y}^{i}\right) = \text{2}\left(\boldsymbol{\sigma}^{i}\right)^{2}.\tag{2}$$

As we mentioned, all these computations assume that the missing data is MCAR. However, in real-world datasets, the missing data are MAR. In this case, the probability p xð Þ depends on the other observed values, and then, the distance will be computed as:

$$MD\_{\mathbb{E}}\left(m^i, y^i\right) = \int p(\mathbf{x}|\mathbf{x}\_{obs}) \left(\mathbf{x} - y^i\right)^2 d\mathbf{x} = \left(\left(y^i - \mu^i\_{\mathbf{x}|\mathbf{x}\_{obs}}\right)^2 + \left(\sigma^i\_{\mathbf{x}|\mathbf{x}\_{obs}}\right)^2\right) \mathbf{x}$$

where xobs denotes the observed attributes of point X, and μ<sup>i</sup> <sup>x</sup>∣xobs and <sup>σ</sup><sup>i</sup> <sup>x</sup>∣xobs � �<sup>2</sup> are the conditional mean and variance, respectively.

On the other hand, in the case that the missing values are NMAR, the probability p xð Þ that was used in Eq. (1) will be computed based on this information, and then, the distance will be:

$$MD\_E\left(m^i, y^i\right) = \int p\left(\mathbf{x}|m^i\right) \left(\mathbf{x} - y^i\right)^2 d\mathbf{x} = \left(\left(y^i - \mu^i\_{x|m^i}\right)^2 + \left(\sigma^i\_{x|m^i}\right)^2\right),$$

where p xjmi � � is the distribution of <sup>x</sup> when <sup>x</sup> is missing.
