1. Introduction

Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on.

In this research, we developed two popular clustering algorithms to run over incomplete datasets: (1) k-means clustering algorithm [1] and (2) mean shift clustering algorithms [2].

Based on [3–6], there are three main types of missing data:

1. Missing completely at random (MCAR): when the missing value is not related to any other sample;

value is missing. In the last two cases, the distance will be computed based only on the mean and the variance of the attribute. As a result, the runtime of the developed distance is Oð Þ1 as

In this research, we integrated this distance function in order to develop the k-means and the mean shift clustering algorithms. To this end, we derived more two formulas to compute the mean (for k-means algorithm) and for computing the gradient function of the local estimated

The developed algorithms yield better results than the other methods and preserve the runtime of the algorithms which deals with complete data as can be seen in the experiments. We experimented on six standard numerical datasets from different fields from the Speech and Image Processing Unit [13]. Our experiments show that the performance of the developed

This chapter is organized as follows. A review of our distance function (MDE) is described in Section 2. The mean computation is presented in Section 3. Section 3 describes several directions for integrating the (MDE) distance and the computed mean within the k-means clustering algorithm. The mean shift clustering algorithm is presented in Section 4. Section 4.1 describes how to integrate the (MDE) distance and the derived mean shift vector within the mean shift clustering algorithm. Experimental results of running the developed clustering algorithms are presented in Section 5. Finally, our conclusions and future work are presented in Section 6.

Firstly, we will give a short preview to basic distance function that is able to compute distances

Given two sample points X, Y ⊆ R<sup>K</sup>, the goal is to compute the distance between them. Let xi and y<sup>i</sup> be the ith coordinate values from points X, Y, respectively. There are three possible cases

1. Two values are known: the distance between them will be defined as the Euclidean distance. 2. One value is missing: Suppose that xi is missing and the value yi is given. Since the value of xi is unknown, we cannot compute the distance using the Euclidean distance equation. Instead, we compute the expectation of all the distances between the given value yi and all

Therefore, we approximate the mean Euclidean distance (MDE) between yi and the miss-

computed according to the known values for this attribute from <sup>A</sup> (i.e., P A<sup>i</sup> � <sup>χ</sup><sup>i</sup>

the possible values from attribute i according to its distribution χ<sup>i</sup>

, the conditional probability for Ai will be

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 3

.

), where χ<sup>i</sup> is

algorithms using our distance function was superior to using other methods.

the Euclidean distance.

2. Our distance measure

the distribution of the ith coordinate.

for the values of xi and yi

ing value m<sup>i</sup> as:

between points with missing values developed by [2].

Let A ⊆ R<sup>K</sup> be a set of points. For the ith attribute A<sup>i</sup>

:

density (for mean shift clustering algorithm).


There are two basic types of methods to deal with the problem of incomplete datasets. (1) Deletion: methods from this category ignore all the incomplete instances. These methods may change the distribution of the data by decreasing the volume of the dataset [7]. (2) Imputation: in these methods, the missing values were replaced with known value according to statistical computation. Based on these methods, we convert then incomplete data to complete data, and as a result, the existing machine learning algorithms can be run as they deal with complete data.

One of the most common approaches in this domain is the mean imputation (MI) method that replaces each incomplete data point with the mean of the data. There are several obvious disadvantages to this method: (a) using a fixed instance to replace all the incomplete instances will change the distribution of the original dataset and (b) ignoring the relationship among attributes will bias the performance of subsequent data mining algorithms. These problems were caused since we replace all the incomplete instances with a fixed one. On the other hand, a variant of this method is to replace the missing values only based on the distribution of the attributes. It means that the algorithm will replace each missing value with the mean of the of its attribute (MA) and the whole instance [8]. And in a case that the values were discrete, the missing value will be replaced by the most common (MCA) value in the attribute [9] (i.e., filling the unknown values of the attribute with the value that occurs most often for the same attribute). All those methods ignore the other possible values of the attribute and their distribution and represent the missing value with one value, that is, wrong in real-world datasets.

Finally, the k-Nearest Neighbor Imputation method [10, 11] estimates the values that should be replaced based on the k nearest neighbors based only on the known values. The main obstacle of this method is the runtime complexity.

We can summarize the main drawbacks of each suggested method as: (1) inability to approximate the missing value and (2) inefficiency to compute the suggested value. Based on our suggested method [12], the distance between two points, that they may include missing value, is not only efficient but also takes into account the distribution of each attribute.

To do that in the computation procedure, we take into account all the possible values of the missing value with their probabilities, which are derived from the attribute's distribution. This is in contrast to the MCA and the MA methods, which replace each missing value only with the mode or the mean of each attribute.

There are three possible cases between the values: (a) both of them are known: in this case, the distance will be computed as the Euclidean distance; (b) both of them are missing; and (c) one value is missing. In the last two cases, the distance will be computed based only on the mean and the variance of the attribute. As a result, the runtime of the developed distance is Oð Þ1 as the Euclidean distance.

In this research, we integrated this distance function in order to develop the k-means and the mean shift clustering algorithms. To this end, we derived more two formulas to compute the mean (for k-means algorithm) and for computing the gradient function of the local estimated density (for mean shift clustering algorithm).

The developed algorithms yield better results than the other methods and preserve the runtime of the algorithms which deals with complete data as can be seen in the experiments. We experimented on six standard numerical datasets from different fields from the Speech and Image Processing Unit [13]. Our experiments show that the performance of the developed algorithms using our distance function was superior to using other methods.

This chapter is organized as follows. A review of our distance function (MDE) is described in Section 2. The mean computation is presented in Section 3. Section 3 describes several directions for integrating the (MDE) distance and the computed mean within the k-means clustering algorithm. The mean shift clustering algorithm is presented in Section 4. Section 4.1 describes how to integrate the (MDE) distance and the derived mean shift vector within the mean shift clustering algorithm. Experimental results of running the developed clustering algorithms are presented in Section 5. Finally, our conclusions and future work are presented in Section 6.
