7. Conclusions

Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on. Several methods were developed

Figure 3. Results of mean shift clustering algorithm using the different distance functions on the six datasets from the

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 15

Speech and Image Processing Unit.

A resulting curve for the Rand index values was constructed for each dataset to evaluate how

Figure 2. Results of k-means clustering algorithm using the different distance functions on the six datasets from the

As can be seen in Figure 3, for all the datasets except the Jain dataset, the curves show that the new mean shift algorithm was superior and outperformed the other compared methods for all missing value percentages, while for the Jain dataset, its superiority became apparent only when the percent of the missing values was larger than 25%, as can be seen in Figure 3(b). In addition, we can see that the MS MC method outperforms the MS MA method for the flame and path-based datasets, and the MS MC outperforms MS MA for the other datasets. As a result, we cannot decide unequivocally which algorithm is better. On the other hand, we obviously can state that the MS MDE outperforms the other methods especially

Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on. Several methods were developed

well the algorithm performed.

Speech and Image Processing Unit.

14 Recent Applications in Data Clustering

7. Conclusions

when the percentage of the missing values increases.

Figure 3. Results of mean shift clustering algorithm using the different distance functions on the six datasets from the Speech and Image Processing Unit.

to deal with this problem such as: filling the missing values with fixed values, ignoring sample with missing values, or dealing with the missing values by defining a distance function.

[6] Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken, New Jersey: John

Clustering Algorithms for Incomplete Datasets http://dx.doi.org/10.5772/intechopen.78272 17

[7] Zhang S, Qin Z, Ling CX, Sheng S. Missing is useful: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering. 2005;17(12):1689-1693

[8] Magnani M. Techniques for dealing with missing data in knowledge discovery tasks.

[9] Jerzy Grzymala-Busse, Ming Hu. A comparison of several approaches to missing attribute values in data mining. In: Proceedings of Rough Sets and Current Trends in Computing;

[10] Zhang S. Shell-neighbor method and its application in missing data imputation. Applied

[11] Batista G, Monard MC. An analysis of four missing data treatment methods for super-

[12] AbdAllah L, Shimshoni I. A distance function for data with missing values and its applications on KNN and k-means algorithms. International Journal Advances in Data Analy-

[13] Speech University of Eastern Finland and Image Processing Unit. Clustering dataset,

[14] Hunt L, Jorgensen M. Mixture model clustering for mixed data with missing information.

[15] Ghahramani Z, Jordan M. Learning from incomplete data. Technical Report, MIT AI Lab

[16] Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002;24(5):603-619

[17] Georgescu B, Shimshoni I, Meer P. Mean shift based clustering in high dimensions: A texture classification example. In: Proceedings of the 9th International Conference on Computer

[18] Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the

Obtido. 2004;15(01):2007. http://magnanim.web.cs.unibo.it/index.html

vised learning. Applied Artificial Intelligence. 2003;17(5–6):519-533

Computational Statistics and Data Analysis. 2003;41(3):429-440

American Statistical Association. 1971;66(336):846-850

Wiley & Sons; 2014

Springer; 2001. pp. 378-385

sis and Classification

Memo, (1509), 1995

Vision; 2003. pp. 456-463

Intelligence. 2011;35(1):123-133

http://cs.joensuu.fi/sipu/datasets/; 2008

In this work, we have proposed a new mean shift clustering algorithm and two versions of the k-means clustering algorithm over incomplete datasets based on the developed MDE distance that was presented in [1, 2, 12].

The computational complexities of all the developed algorithms were preserved and they are the same as that of the standard algorithms using the Euclidean distance. The distance was computed based only on the mean and variance of the data for each attribute.

We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods.

From our experiments, we conclude that the developed methods are more appropriate for measuring the mean, mean shift vector, and weighted mean for objects with missing values, especially when the percent of missing values is large.
