5. Mean shift algorithm

In this section, we will describe another use case that integrates the derived distance function MDE within the framework of mean shift clustering algorithm. Firstly, we will give a short overview of the mean shift algorithm, and then, we will describe how we use MDE distance in this algorithm. Here, we only review some of the results described in [16, 17] which should be consulted for the details. Let xi ∈ R<sup>d</sup> , i ¼ 1, …, n is associated with a bandwidth value h > 0. The sample point density estimator at point x is

$$\widehat{f}(\mathbf{x}) = \frac{1}{nh^d} \sum\_{i=1}^n K\left(\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right). \tag{5}$$

Based on a symmetric kernel K with bounded support satisfying

$$K(\mathbf{x}) = c\_{k,d} k \left( \|\mathbf{x}\|^2 \right) \qquad \|\mathbf{x}\| \le 1 \tag{6}$$

is a nonparametric estimator of the density at x in the feature space. Where k xð Þ, 0 ≤ x ≤ 1 is the profile of the kernel and the normalization constant ck, <sup>d</sup> assures that K xð Þ integrates to one. As a result, the density estimator Eq. (5) can be rewritten as

$$\widehat{f}\_{h,k}(\mathbf{x}) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^n k\left(\left\|\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right\|^2\right). \tag{7}$$

5.1. Mean shift computing using the MDE distance

ck, <sup>d</sup> nh<sup>d</sup> Xn i¼1

shift vector within the mean shift algorithm.

ck, <sup>d</sup> nh<sup>d</sup>

bf h, <sup>k</sup>ð Þ¼ x

ck,d nhdþ<sup>2</sup>

> 0 B@

X kni

j¼1

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h i<sup>0</sup>

computation for all the other coordinates.

�k 0

> 2 4

Xn i¼1

∇bf h, <sup>k</sup>ð Þ¼ x

<sup>¼</sup> ck, <sup>d</sup> nhdþ<sup>2</sup>

<sup>þ</sup> <sup>P</sup>unkni

Xn i¼1 k

ck,d nhd

> Xn i¼1

Pkni

2 4

X kni

j¼1

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

xj � <sup>x</sup> j i � �<sup>2</sup>

bf h, <sup>k</sup>ð Þ¼ x

bf h, <sup>k</sup>ð Þ¼ x

This section describes the way to integrate the MDE distance within the framework of the mean shift clustering algorithm. To achieve this mission, we will first compute the mean shift vector using the MDE distance. And then, we will integrate the MDE and the derived mean

> <sup>¼</sup> ck, <sup>d</sup> nh<sup>d</sup>

; x j i � �<sup>2</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

þ unkn X<sup>i</sup> j¼1

Punkni

Pkni

Pkni

0 B@

0 B@

� k 0 <sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

In our computation, we will first deal with one coordinate l, and then, we will generate the

Xn i¼1 k

þ

Punkni

xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup>

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

> j i � �<sup>2</sup>

> > <sup>h</sup><sup>2</sup> <sup>þ</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

P<sup>d</sup>

Punkni

0 B@

<sup>j</sup>¼<sup>1</sup> MDE xj

<sup>j</sup>¼<sup>1</sup> MDE xj

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

> 3 5 0

1 CA

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

<sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup>

Punkni

Punkni

h2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} each xi has unkni missing attributes

h2

; x j i � �<sup>2</sup>

> ; x j i � �<sup>2</sup>

1

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272

1

CCCA:

CA: (13)

1 CA

1 CA:

1

CA: (12)

11

Using the derived MDE distance the density estimator in Eq. (7) will be written as:

� � � <sup>2</sup> � �

<sup>j</sup>¼<sup>1</sup> MDE xj

h2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} each xi has kni known attributes

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

<sup>k</sup> <sup>x</sup> � xi h � � �

Since each point xi may contain missing attributes, bf h, <sup>k</sup>ð Þx will be:

0

BBB@

According to the definition of the MDE distance, we obtain:

Xn i¼1 k Pkni

Pkni

Now, we will compute the gradient of the density estimator in Eq. (13).

xj � <sup>x</sup> j i � �<sup>2</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

0 B@

As a first step in the analysis is to find the modes of the density which are located among the zeros of the gradient ∇f xð Þ¼ 0, of a feature space with the underlying density f xð Þ, and the mean shift procedure is a way to find these zeros without the need to estimate the density.

Therefore, the density gradient estimator is obtained as the gradient of the density estimator by capitalizing on the linearity of Eq. (7).

$$\nabla \widehat{f}\_{h,K}(\mathbf{x}) = \frac{2c\_{k,d}}{nh^{d+2}} \sum\_{i=1}^{n} (\mathbf{x} - \mathbf{x}\_i) k' \left( \left\| \frac{\mathbf{x} - \mathbf{x}\_i}{h} \right\|^2 \right). \tag{8}$$

Define g xð Þ¼�k<sup>0</sup> ð Þx , then the kernel G xð Þ is defined as:

$$G(\mathfrak{x}) = c\_{\mathfrak{z},d} \mathfrak{g}\left(\|\mathfrak{x}\|^2\right).$$

Introducing g xð Þ into Eq. (8) yields

$$\begin{split} \nabla \widehat{f}\_{h,K}(\mathbf{x}) &= \frac{2c\_{k,d}}{nh^{d+2}} \sum\_{i=1}^{n} (\mathbf{x}\_{i} - \mathbf{x}) \mathbf{g}\left( \left\| \frac{\mathbf{x} - \mathbf{x}\_{i}}{h} \right\|^{2} \right) \\ &= \frac{2c\_{k,d}}{nh^{d+2}} \left[ \sum\_{i=1}^{n} \mathbf{g}\left( \left\| \frac{\mathbf{x} - \mathbf{x}\_{i}}{h} \right\|^{2} \right) \right] \left[ \frac{\sum\_{i=1}^{n} \mathbf{x}\_{i} \mathbf{g}\left( \left\| \frac{\mathbf{x} - \mathbf{x}\_{i}}{h} \right\|^{2} \right)}{\sum\_{i=1}^{n} \mathbf{g}\left( \left\| \frac{\mathbf{x} - \mathbf{x}\_{i}}{h} \right\|^{2} \right)} - \mathbf{x} \right], \end{split} \tag{9}$$

where P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>x</sup>�xi h � � � � <sup>2</sup> � � is assumed to be a positive number. Both terms of the product in Eq. (9) have special significance. The first term is proportional to the density estimate at x computed with the kernel G. The second term

$$m\_G(\mathbf{x}) = \frac{\sum\_{i=1}^n \mathbf{x}\_i \mathbf{g}\left(\left\|\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right\|^2\right)}{\sum\_{i=1}^n \mathbf{g}\left(\left\|\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right\|^2\right)} - \mathbf{x} \tag{10}$$

is called the mean shift vector. The mean shift vector thus points toward the direction of maximum increase in the density. The implication of the mean shift property is that the iterative procedure

$$y\_{j+1} = \frac{\sum\_{i=1}^{n} \mathbf{x}\_i \mathbf{g}\left(\left\|\frac{y\_j - \mathbf{x}\_i}{h}\right\|\right)}{\sum\_{i=1}^{n} \mathbf{g}\left(\left\|\frac{y\_j - \mathbf{x}\_i}{h}\right\|\right)} \quad j = 1, 2, \dots \tag{11}$$

In real world, most often the convergence points of this iterative procedure are the local maxima (modes) of the density. All the points that share the same mode are clustered within the same cluster. Therefore, we get clusters as the number of modes.

#### 5.1. Mean shift computing using the MDE distance

bf h, <sup>k</sup>ð Þ¼ x

∇bf h,Kð Þ¼ x

2ck, <sup>d</sup> nhdþ<sup>2</sup>

<sup>¼</sup> <sup>2</sup>ck, <sup>d</sup> nhdþ<sup>2</sup> Xn i¼1

> Xn i¼1 g

mGð Þ¼ x

P<sup>n</sup>

P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>g</sup> yj

yjþ<sup>1</sup> <sup>¼</sup>

the same cluster. Therefore, we get clusters as the number of modes.

ð Þ xi � x g

x � xi h � � �

� � � 2 " # � � <sup>P</sup><sup>n</sup>

have special significance. The first term is proportional to the density estimate at x computed

<sup>i</sup>¼<sup>1</sup> xig <sup>x</sup>�xi h � � � � <sup>2</sup> � �

<sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>x</sup>�xi h � � � �

�xi h � � �

�xi h � � �

In real world, most often the convergence points of this iterative procedure are the local maxima (modes) of the density. All the points that share the same mode are clustered within

� � � � �

� � �

is called the mean shift vector. The mean shift vector thus points toward the direction of maximum increase in the density. The implication of the mean shift property is that the

P<sup>n</sup>

P<sup>n</sup>

<sup>i</sup>¼<sup>1</sup> xig yj

by capitalizing on the linearity of Eq. (7).

10 Recent Applications in Data Clustering

Introducing g xð Þ into Eq. (8) yields

<sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>x</sup>�xi h � � � � <sup>2</sup> � �

with the kernel G. The second term

∇bf h,Kð Þ¼ x

Define g xð Þ¼�k<sup>0</sup>

where P<sup>n</sup>

iterative procedure

ck,d nhd

2ck,d nh<sup>d</sup>þ<sup>2</sup>

ð Þx , then the kernel G xð Þ is defined as:

Xn i¼1

As a first step in the analysis is to find the modes of the density which are located among the zeros of the gradient ∇f xð Þ¼ 0, of a feature space with the underlying density f xð Þ, and the mean shift procedure is a way to find these zeros without the need to estimate the density.

Therefore, the density gradient estimator is obtained as the gradient of the density estimator

G xð Þ¼ cg, dg <sup>∥</sup>x∥<sup>2</sup> � �:

x � xi h � � �

� � � <sup>2</sup> � �

> 2 4

P<sup>n</sup>

is assumed to be a positive number. Both terms of the product in Eq. (9)

Xn i¼1

<sup>k</sup> <sup>x</sup> � xi h � � �

ð Þ <sup>x</sup> � xi <sup>k</sup><sup>0</sup> <sup>x</sup> � xi

h � � �

<sup>i</sup>¼<sup>1</sup> xig <sup>x</sup>�xi h � � � � <sup>2</sup> � �

<sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>x</sup>�xi h � � � � <sup>2</sup> � � � <sup>x</sup>

� � � <sup>2</sup> � �

� � � <sup>2</sup> � �

: (7)

: (8)

3 5,

<sup>2</sup> � � � <sup>x</sup> (10)

� � <sup>j</sup> <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, … (11)

(9)

This section describes the way to integrate the MDE distance within the framework of the mean shift clustering algorithm. To achieve this mission, we will first compute the mean shift vector using the MDE distance. And then, we will integrate the MDE and the derived mean shift vector within the mean shift algorithm.

Using the derived MDE distance the density estimator in Eq. (7) will be written as:

$$\widehat{f}\_{h,k}(\mathbf{x}) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^n k\left(\left\|\frac{\mathbf{x} - \mathbf{x}\_i}{h}\right\|^2\right) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^n k\left(\frac{\sum\_{j=1}^d MD\_E\left(\mathbf{x}^j, \mathbf{x}\_i^j\right)^2}{h^2}\right). \tag{12}$$

Since each point xi may contain missing attributes, bf h, <sup>k</sup>ð Þx will be:

$$\widehat{f}\_{h,k}(\mathbf{x}) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^n k \left( \underbrace{\frac{\sum\_{j=1}^{kn\_i} MD\_{\mathcal{E}} \left(\mathbf{x}^j, \mathbf{x}\_i^j\right)^2}{h^2}}\_{\text{each } \mathbf{x}\_i \text{ has } k \text{u} \text{ known attributes}} + \underbrace{\frac{\sum\_{j=1}^{unkn\_i} MD\_{\mathcal{E}} \left(\mathbf{x}^j, \mathbf{x}\_i^j\right)^2}{h^2}}\_{\text{each } \mathbf{x}\_i \text{ has } unk \text{u} \text{ missing attributes}} \right).$$

According to the definition of the MDE distance, we obtain:

$$\widehat{f}\_{h,k}(\mathbf{x}) = \frac{c\_{k,d}}{nh^d} \sum\_{i=1}^n k \left( \frac{\sum\_{j=1}^{ku\_i} \left( \mathbf{x}^j - \mathbf{x}\_i^j \right)^2}{h^2} + \frac{\sum\_{j=1}^{unkn\_i} \left( \mathbf{x}^j - \boldsymbol{\mu}^j \right)^2 + \left( \boldsymbol{\sigma}^j \right)^2}{h^2} \right). \tag{13}$$

Now, we will compute the gradient of the density estimator in Eq. (13).

∇bf h, <sup>k</sup>ð Þ¼ x ck,d nhdþ<sup>2</sup> Xn i¼1 X kni j¼1 xj � <sup>x</sup> j i � �<sup>2</sup> þ unkn X<sup>i</sup> j¼1 xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> 2 4 3 5 0 �k 0 Pkni <sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup> j i � �<sup>2</sup> <sup>h</sup><sup>2</sup> <sup>þ</sup> Punkni <sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2 0 B@ 1 CA <sup>¼</sup> ck, <sup>d</sup> nhdþ<sup>2</sup> Xn i¼1 X kni j¼1 xj � <sup>x</sup> j i � �<sup>2</sup> 2 4 3 5 0 � k 0 Pkni <sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup> j i � �<sup>2</sup> <sup>h</sup><sup>2</sup> <sup>þ</sup> Punkni <sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2 0 B@ 1 CA <sup>þ</sup> <sup>P</sup>unkni <sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h i<sup>0</sup> � k 0 Pkni <sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup> j i � �<sup>2</sup> <sup>h</sup><sup>2</sup> <sup>þ</sup> Punkni <sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2 0 B@ 1 CA:

In our computation, we will first deal with one coordinate l, and then, we will generate the computation for all the other coordinates.

$$\begin{split} \Rightarrow & f'\_{x'} = \frac{2c\_{k,d}}{nh^{d+2}} \sum\_{i=1}^{m} \left( \mathbf{x}^{l} - \mathbf{x}^{l}\_{i} \right) \cdot \mathbf{k}' \left( \frac{\sum\_{j=1}^{k\_{\rm in}} \left( \mathbf{x}^{l} - \mathbf{x}^{l}\_{i} \right)^{2}}{h^{2}} + \frac{\sum\_{j=1}^{\rm unkw} \left( \mathbf{x}^{l} - \boldsymbol{\mu}^{\rm{j}} \right)^{2} + \left( \boldsymbol{\sigma}^{\rm{j}} \right)^{2}}{h^{2}} \right) \\ & \quad \phantom{=} + \frac{2c\_{k,d}}{nh^{d+2}} \sum\_{i=1}^{m} \left( \mathbf{x}^{l} - \boldsymbol{\mu}^{\rm{j}} \right) \cdot \mathbf{k}' \left( \frac{\sum\_{j=1}^{k\_{\rm in}} \left( \mathbf{x}^{l} - \mathbf{x}^{l}\_{i} \right)^{2}}{h^{2}} + \frac{\sum\_{j=1}^{\rm unkw} \left( \mathbf{x}^{l} - \boldsymbol{\mu}^{\rm{j}} \right)^{2} + \left( \boldsymbol{\sigma}^{\rm{j}} \right)^{2}}{h^{2}} \right) \\ & \quad \phantom{=} \frac{2c\_{k,d}}{nh^{d+2}} \left| \mathbf{x}^{l} \cdot \sum\_{i=1}^{n} \mathbf{k}' \left( \frac{\sum\_{j=1}^{k\_{\rm in}} \left( \mathbf{x}^{l} - \mathbf{x}^{l}\_{i} \right)^{2}}{h^{2}} + \frac{\sum\_{j=1}^{\rm unkw} \left( \mathbf{x}^{l} - \boldsymbol{\mu}^{\rm{j}} \right)^{2} + \left( \boldsymbol{\sigma}^{\rm{j$$

incomplete data using the suggested distance function and then again using the existing

Dataset Dataset size Clusters

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 13

Flame 240 � 2 2 Jain 373 � 2 2 Path-based 300 � 2 3 Spiral 312 � 2 3 Compound 399 � 2 6 Aggregation 788 � 2 7

To measure the similarity between two data clusterings, we decide to use the Rand index [18]. We use it in order to compare the results of the original clustering algorithms to the results of

Our experiments use six standard numerical datasets from the Speech and Image Processing

We produced the missing data by drawing randomly a set consisting of 10–40% of the data from each dataset. These sets are used as samples of incomplete data, where one attribute from each point was randomly selected to be assigned as missing value. For each dataset, we

In the k-means algorithm, we developed two versions, k-means-MDE and k-means-HistMDE; to cluster the incomplete datasets, we compare the performance of the k-means (k is fixed for each dataset) clustering algorithm on complete data (i.e., without missing values) to its performance on data with missing values, using the MDE distance measure (k-means-MDE and k-

As can be seen in Figure 2, the new algorithms that is based on the MDE distance outperformed the other existing algorithms on all the datasets. It occurred because in the MA MCA methods, the whole distribution of values is replaced by the mean or the mode of the distribution of known values, that is a fixed value. In our two developed algorithms, we use the distribution of the observed values in all the computation stages. This additional information, taking into account not only the mean of the attribute but also the variance, is probably the reason for the

Mean shift clustering algorithm was tested using bandwidth h ¼ 4 (because we saw that the

means-HistMDE) and then again using k-means-(MCA, MA, and MI).

improved performance of our methods compared to the known heuristics.

methods (MCA, MA, and MI) within the standard algorithms.

the other derived algorithms for incomplete datasets.

Table 1. Speech and Image Processing Unit Dataset properties.

Unit [13]; dataset characteristics are shown in Table 1.

average the results over 10 different runs.

6.1. k-Means experiments

6.2. Mean shift experiments

standard mean shift worked well for this value).

where there are nl points for which the xl coordinate is known, and there are ml points where it is missing.

$$\begin{split} f\_{\boldsymbol{x}^{l}}^{\boldsymbol{t}^{\boldsymbol{t}}} &= \frac{2c\_{k,d}}{nh^{d+2}} \cdot \left[ \sum\_{i=1}^{n} \mathbf{g}\left(\sum\_{j=1}^{d} \mathbf{M} \mathbf{D}\_{\mathcal{E}}\left(\mathbf{x}^{\boldsymbol{t}}, \mathbf{x}^{\boldsymbol{j}}\_{i}\right)^{2}\right) \right] \\ & \cdot \left[ \frac{\sum\_{i=1}^{m} \mathbf{x}^{l}\_{i} \cdot \mathbf{g}\left(\sum\_{j=1}^{d} \mathbf{M} \mathbf{D}\_{\mathcal{E}}\left(\mathbf{x}^{\boldsymbol{j}}, \mathbf{x}^{\boldsymbol{j}}\_{i}\right)^{2}\right) + \sum\_{i=1}^{m} \mu^{l} \cdot \mathbf{g}\left(\sum\_{j=1}^{d} \mathbf{M} \mathbf{D}\_{\mathcal{E}}\left(\mathbf{x}^{\boldsymbol{j}}, \mathbf{x}^{\boldsymbol{j}}\_{i}\right)^{2}\right)}{\sum\_{i=1}^{n} \mathbf{g}\left(\sum\_{j=1}^{d} \mathbf{M} \mathbf{D}\_{\mathcal{E}}\left(\mathbf{x}^{\boldsymbol{j}}, \mathbf{x}^{\boldsymbol{j}}\_{i}\right)^{2}\right)} - \mathbf{x}^{l}\right] .\end{split}$$

As a result, the mean shift vector using the MDE distance is defined as:

$$\begin{aligned} \text{mMD}\_{\mathcal{E},\mathcal{G}}(\mathbf{x}) &= \\ \frac{\sum\_{i=1}^{m} \mathbf{x}\_{i}^{l} \cdot \mathbf{g}\left(\sum\_{j=1}^{d} \text{MD}\_{\mathcal{E}}\left(\mathbf{x}^{j}, \mathbf{x}\_{i}^{j}\right)^{2}\right) + \sum\_{j=1}^{m} \mu^{l} \cdot \mathbf{g}\left(\sum\_{j=1}^{d} \text{MD}\_{\mathcal{E}}\left(\mathbf{x}^{j}, \mathbf{x}\_{i}^{j}\right)^{2}\right)}{\sum\_{i=1}^{n} \text{g}\left(\sum\_{j=1}^{d} \text{MD}\_{\mathcal{E}}\left(\mathbf{x}^{j}, \mathbf{x}\_{i}^{j}\right)^{2}\right)} - \mathbf{x}^{l}. \end{aligned} \tag{14}$$

Now, we can use this equation to run the mean shift procedure over datasets with missing values.

#### 6. Experiments on numerical datasets

In order to measure performance of the developed clustering algorithm (i.e., k-means and mean shift), we compare their performance on complete datasets to its performance on


Table 1. Speech and Image Processing Unit Dataset properties.

incomplete data using the suggested distance function and then again using the existing methods (MCA, MA, and MI) within the standard algorithms.

To measure the similarity between two data clusterings, we decide to use the Rand index [18]. We use it in order to compare the results of the original clustering algorithms to the results of the other derived algorithms for incomplete datasets.

Our experiments use six standard numerical datasets from the Speech and Image Processing Unit [13]; dataset characteristics are shown in Table 1.

We produced the missing data by drawing randomly a set consisting of 10–40% of the data from each dataset. These sets are used as samples of incomplete data, where one attribute from each point was randomly selected to be assigned as missing value. For each dataset, we average the results over 10 different runs.

#### 6.1. k-Means experiments

) f 0 xl <sup>¼</sup> <sup>2</sup>ck,d nh<sup>d</sup>þ<sup>2</sup>

12 Recent Applications in Data Clustering

þ 2ck, <sup>d</sup> nhdþ<sup>2</sup>

<sup>¼</sup> <sup>2</sup>ck,d nh<sup>d</sup>þ<sup>2</sup> <sup>½</sup>xl �

�Xnl <sup>i</sup>¼<sup>1</sup> xl <sup>i</sup> � k<sup>0</sup>

�Xml i¼1

nhdþ<sup>2</sup> � <sup>X</sup><sup>n</sup>

2 4

<sup>i</sup> � <sup>g</sup> <sup>P</sup><sup>d</sup>

<sup>i</sup> � <sup>g</sup> <sup>P</sup><sup>d</sup>

6. Experiments on numerical datasets

i¼1

is missing.

f 0 xl <sup>¼</sup> <sup>2</sup>ck, <sup>d</sup>

�

Pnl <sup>i</sup>¼<sup>1</sup> xl

Pnl <sup>i</sup>¼<sup>1</sup> xl

mMDE,Gð Þ¼ x

<sup>μ</sup><sup>l</sup> � <sup>k</sup><sup>0</sup>

Xnl i¼1

Xml i¼1

xl � xl i � � � <sup>k</sup><sup>0</sup>

xl � <sup>μ</sup><sup>l</sup> � � � <sup>k</sup>

0 B@

Pkni

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

MDE xj

; x j i

<sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>P</sup><sup>d</sup>

; x j i

<sup>i</sup>¼<sup>1</sup> <sup>g</sup> <sup>P</sup><sup>d</sup>

As a result, the mean shift vector using the MDE distance is defined as:

; x j i � �<sup>2</sup>

Xn i¼1 k 0

> 0 B@

Pkni

g X d

0 @

j¼1

� �<sup>2</sup> � �

P<sup>n</sup>

<sup>j</sup>¼<sup>1</sup> MDE xj

<sup>j</sup>¼<sup>1</sup> MDE xj

� �<sup>2</sup> � �

P<sup>n</sup>

0 B@ X kni

0

BBBBB@

0 B@

0

Pkni

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

xj � <sup>x</sup> j i � �<sup>2</sup>

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

Punkni

Punkni

where there are nl points for which the xl coordinate is known, and there are ml points where it

1 A

<sup>þ</sup> <sup>P</sup>ml

<sup>j</sup>¼<sup>1</sup> MDE xj

<sup>þ</sup> <sup>P</sup>ml

<sup>j</sup>¼<sup>1</sup> MDE xj

Now, we can use this equation to run the mean shift procedure over datasets with missing values.

In order to measure performance of the developed clustering algorithm (i.e., k-means and mean shift), we compare their performance on complete datasets to its performance on

<sup>i</sup>¼<sup>1</sup> <sup>μ</sup><sup>l</sup> � <sup>g</sup> <sup>P</sup><sup>d</sup>

� �<sup>2</sup> � � � xl

� �<sup>2</sup> � � � xl

; x j i

<sup>j</sup>¼<sup>1</sup> <sup>μ</sup><sup>l</sup> � <sup>g</sup> <sup>P</sup><sup>d</sup>

; x j i

3 5 Punkni

unkn X<sup>i</sup> j¼1

xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup>

Punkni

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

h2

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

<sup>j</sup>¼<sup>1</sup> MDE xj

<sup>j</sup>¼<sup>1</sup> MDE xj

� �<sup>2</sup> � �

� �<sup>2</sup> � �

; x j i

; x j i

: (14)

<sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup>

<sup>j</sup>¼<sup>1</sup> xj � <sup>μ</sup><sup>j</sup> � �<sup>2</sup> <sup>þ</sup> <sup>σ</sup><sup>j</sup> � �<sup>2</sup> h2

1

CCCCCA

1 CA

1 CA

1 CA, 1 CA

j¼1

Pkni

<sup>j</sup>¼<sup>1</sup> xj � <sup>x</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

j i � �<sup>2</sup>

<sup>h</sup><sup>2</sup> <sup>þ</sup>

In the k-means algorithm, we developed two versions, k-means-MDE and k-means-HistMDE; to cluster the incomplete datasets, we compare the performance of the k-means (k is fixed for each dataset) clustering algorithm on complete data (i.e., without missing values) to its performance on data with missing values, using the MDE distance measure (k-means-MDE and kmeans-HistMDE) and then again using k-means-(MCA, MA, and MI).

As can be seen in Figure 2, the new algorithms that is based on the MDE distance outperformed the other existing algorithms on all the datasets. It occurred because in the MA MCA methods, the whole distribution of values is replaced by the mean or the mode of the distribution of known values, that is a fixed value. In our two developed algorithms, we use the distribution of the observed values in all the computation stages. This additional information, taking into account not only the mean of the attribute but also the variance, is probably the reason for the improved performance of our methods compared to the known heuristics.

#### 6.2. Mean shift experiments

Mean shift clustering algorithm was tested using bandwidth h ¼ 4 (because we saw that the standard mean shift worked well for this value).

Figure 2. Results of k-means clustering algorithm using the different distance functions on the six datasets from the Speech and Image Processing Unit.

A resulting curve for the Rand index values was constructed for each dataset to evaluate how well the algorithm performed.

As can be seen in Figure 3, for all the datasets except the Jain dataset, the curves show that the new mean shift algorithm was superior and outperformed the other compared methods for all missing value percentages, while for the Jain dataset, its superiority became apparent only when the percent of the missing values was larger than 25%, as can be seen in Figure 3(b). In addition, we can see that the MS MC method outperforms the MS MA method for the flame and path-based datasets, and the MS MC outperforms MS MA for the other datasets. As a result, we cannot decide unequivocally which algorithm is better. On the other hand, we obviously can state that the MS MDE outperforms the other methods especially when the percentage of the missing values increases.
