Preface

Chapter 10 **Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index 175**

Chapter 12 **Collective Solutions on Sets of Stable Clusterings 221**

Chapter 11 **New Approaches in Multi-View Clustering 195**

Vladimir Vasilevich Ryazanov

Zheng

**VI** Contents

Hadeel K. Aljobouri, Hussain A. Jaber and Ilyas Çankaya

Fanghua Ye, Zitai Chen, Hui Qian, Rui Li, Chuan Chen and Zibin

We are experiencing a transition from an information age to a wisdom age driven by an explosion in data available for analysis. A major consequence of this transition is an evolu‐ tion of data mining to become data analytics, a discipline involving engineers, statisticians, and computer scientists together with the diverse realms they serve. The potential value that resides in data is suggested in the famous saying "data is the oil of the age," but suitable theories and tools are needed to tease this value into the open.

Clustering has emerged as one of the more fertile fields within data analytics, widely adopt‐ ed by companies, research institutions, and educational entities as a tool to describe similar/ different groups, communities, patterns, modules, and objects and broadly to predict assign‐ ment of certain members to unlabeled groups in an unsupervised fashion. Often classed as an instance of machine learning, clustering has found applications to generate groups con‐ sisting of market segments, genes, constellations of stars, movies to recommend, facilities to serve critical functions, and remarkable communities within a society.

The history of clustering dates back to ancient times, manifest in Aristotles' taxonomy of living things, and quite possibly can be traced even earlier. Just as counting is essential to computation, clustering is essential for learning and predicting. Hence, clustering algo‐ rithms have been developed in rich abundance.

This book is intended to provide a view of recent contributions to the vast clustering litera‐ ture that offers useful insights within the context of modern applications for professionals, academics, and students. The book spans the domains of clustering in image analysis, lexical analysis of texts, replacement of missing values in data, temporal clustering in smart cities, comparison of artificial neural network variations, graph theoretical approaches, spectral clustering, multiview clustering, and model-based clustering in an R package. Image, text, face recognition, speech (synthetic and simulated), and smart city datasets are used. The ta‐ ble below is a summary of chapters according to the types of theory and applications they represent:



clusters according to the complex multivariate dependence structure of the data-generating process. The main R commands are used to perform a fully developed clustering of multi‐

Milan Vukicevic et al. propose a methodology for behavior variation and anomaly detection from acquired sensory data, based on hidden Markov temporal clustering models (HMMs ). Data are collected from five prominent European smart cities and Singapore, which aim to

Glover and Wang introduce a class of tree-based clustering methods based on a single pa‐ rameter W and show how to generate the full collection of cluster sets C(W), without dupli‐ cation, by varying W according to conditions identified automatically during the algorithm's execution. The number of clusters within C(W) for a given W is also determined automatically and provides a wide range of clusters with different structures from the same

Xiaodong Feng presents robust spectral clustering via sparse representation proposing two approaches of weight matrix construction according to the similarity of the sparse coeffi‐ cient vectors. The method is compared with K-means and spectral clustering approaches us‐ ing Gaussian RBF, SIS, l1 -Directed Graph Construction, and Nonnegative SIS using external

Hadeel K. Aljobouri et al. compare the performances of three artificial neural network (ANN) applications: neural gas (NG), growing neural gas (GNG), and robust growing neu‐

Chuan Chen et al. summarize K-means, spectral clustering, matrix factorization, tensor de‐

Ryazanov V.V. addresses the problem of finding the best committee synthesis of ensemble

I express my sincere congratulations to all the contributing authors for their outstanding in‐ novations and insights. Special thanks go to Fred W. Glover for our fertile interactions and

> **Harun Pirim, Ph.D.** Assistant Professor Systems Engineering

> > Saudi Arabia

Preface IX

King Fahd University of Petroleum and Minerals

metrics clustering accuracy (CA) and normalized mutual information (NMI).

composition, and deep learning with their multiview learning versions.

clustering formulated as a discrete optimization problem.

discussions of shared research interests.

variate-dependent data through numerical examples.

become fully "elderly friendly."

ral gas (RGNG) algorithms.

dataset.

12 Ensemble clustering, stability

Some of the distinguishing features of these contributions are as follows:

Loai AbdAllah and Ilan Shimshoni develop a new distance function that computes distances over incomplete datasets. The distances are employed in K-means and mean shift algorithms. The procedure is compared with mean imputation (MI), mean attribute (MA), and most com‐ mon value (MCA) replacements. Experiments are run on six standard numerical datasets.

Uğurhan Kutbay reviews partitional clustering focusing on K-means, fuzzy C-means, colored fuzzy C-means, and a genetic algorithm. Algorithms are applied on the same image data.

Reda R. Gharieb incorporates the influence of an object's neighborhood employing two Kullback-Leibler (KL) membership divergences for clustering image data. A local member‐ ship function is embedded into the objective function of hard C-means, which prevents ad‐ ditive noise. Partition coefficient and partition entropy measures are adopted to evaluate the performance of fuzzy clustering algorithms. A synthetic image, a simulated MRI image, and the standard Lena image are used to compare conventional C-means algorithms with two different KL divergence fuzzy C-means algorithms.

Khaled Abdalgader presents a new version of the original K-means algorithm to cluster small-sized text fragments. This new variation measures the semantic similarity between sentences based on the idea of generating a synonym expansion set to be used in the com‐ pared semantic vectors. The algorithm is compared with Spectral Clustering Affinity Propa‐ gation, K-medoids, STC-LE, and K-means (TF-IDF) using Reuters-21578, Aural Sonar, Protein, Voting, SearchSnippets, StackOverflow and Biomedical datasets, and Purity, Entro‐ py, V-measure, Rand Index, and F-measure validation metrics.

Masafumi Nakagawa focuses on region-based point cloud clustering to improve 3D visuali‐ zation and modeling using massive point clouds, based on a combined point cloud cluster‐ ing methodology and point cloud filtering on a multilayered panoramic range image. Indoor MMS data and two terrestrial laser scanner data are used to test the approach.

Marta presents a clustering algorithm based on the copula function and the R package Co‐ Clust. The range (or set) of clusters from which the procedure automatically selects the best one and the sample size to be used to select it can be varied. The algorithm is able to find clusters according to the complex multivariate dependence structure of the data-generating process. The main R commands are used to perform a fully developed clustering of multi‐ variate-dependent data through numerical examples.

**Chapter Theory Applications**

Some of the distinguishing features of these contributions are as follows:

6 Model-based clustering R package, simulated examples 7 Hidden Markov temporal clustering models Smart cities, IoT, anomaly detec‐

10 ANNs, neural gas, and two variations MatlabGUI, six synthetic datasets 11 Multiview clustering methods K-means, spectral, matrix factoriza‐

Loai AbdAllah and Ilan Shimshoni develop a new distance function that computes distances over incomplete datasets. The distances are employed in K-means and mean shift algorithms. The procedure is compared with mean imputation (MI), mean attribute (MA), and most com‐ mon value (MCA) replacements. Experiments are run on six standard numerical datasets. Uğurhan Kutbay reviews partitional clustering focusing on K-means, fuzzy C-means, colored fuzzy C-means, and a genetic algorithm. Algorithms are applied on the same image data.

Reda R. Gharieb incorporates the influence of an object's neighborhood employing two Kullback-Leibler (KL) membership divergences for clustering image data. A local member‐ ship function is embedded into the objective function of hard C-means, which prevents ad‐ ditive noise. Partition coefficient and partition entropy measures are adopted to evaluate the performance of fuzzy clustering algorithms. A synthetic image, a simulated MRI image, and the standard Lena image are used to compare conventional C-means algorithms with two

Khaled Abdalgader presents a new version of the original K-means algorithm to cluster small-sized text fragments. This new variation measures the semantic similarity between sentences based on the idea of generating a synonym expansion set to be used in the com‐ pared semantic vectors. The algorithm is compared with Spectral Clustering Affinity Propa‐ gation, K-medoids, STC-LE, and K-means (TF-IDF) using Reuters-21578, Aural Sonar, Protein, Voting, SearchSnippets, StackOverflow and Biomedical datasets, and Purity, Entro‐

Masafumi Nakagawa focuses on region-based point cloud clustering to improve 3D visuali‐ zation and modeling using massive point clouds, based on a combined point cloud cluster‐ ing methodology and point cloud filtering on a multilayered panoramic range image. Indoor MMS data and two terrestrial laser scanner data are used to test the approach.

Marta presents a clustering algorithm based on the copula function and the R package Co‐ Clust. The range (or set) of clusters from which the procedure automatically selects the best one and the sample size to be used to select it can be varied. The algorithm is able to find

Three laser scanner datasets

tion, City4Age pilot sites

recognition datasets

learning

Three datasets from UCI, three face

tion, tensor decomposition, deep

5 Point cloud clustering, surface, terrestrial la‐

8 Tree-based clustering, spanning forest 9 Spectral clustering, weight matrix construc‐

different KL divergence fuzzy C-means algorithms.

py, V-measure, Rand Index, and F-measure validation metrics.

12 Ensemble clustering, stability

ser scanning

tion

VIII Preface

Milan Vukicevic et al. propose a methodology for behavior variation and anomaly detection from acquired sensory data, based on hidden Markov temporal clustering models (HMMs ). Data are collected from five prominent European smart cities and Singapore, which aim to become fully "elderly friendly."

Glover and Wang introduce a class of tree-based clustering methods based on a single pa‐ rameter W and show how to generate the full collection of cluster sets C(W), without dupli‐ cation, by varying W according to conditions identified automatically during the algorithm's execution. The number of clusters within C(W) for a given W is also determined automatically and provides a wide range of clusters with different structures from the same dataset.

Xiaodong Feng presents robust spectral clustering via sparse representation proposing two approaches of weight matrix construction according to the similarity of the sparse coeffi‐ cient vectors. The method is compared with K-means and spectral clustering approaches us‐ ing Gaussian RBF, SIS, l1 -Directed Graph Construction, and Nonnegative SIS using external metrics clustering accuracy (CA) and normalized mutual information (NMI).

Hadeel K. Aljobouri et al. compare the performances of three artificial neural network (ANN) applications: neural gas (NG), growing neural gas (GNG), and robust growing neu‐ ral gas (RGNG) algorithms.

Chuan Chen et al. summarize K-means, spectral clustering, matrix factorization, tensor de‐ composition, and deep learning with their multiview learning versions.

Ryazanov V.V. addresses the problem of finding the best committee synthesis of ensemble clustering formulated as a discrete optimization problem.

I express my sincere congratulations to all the contributing authors for their outstanding in‐ novations and insights. Special thanks go to Fred W. Glover for our fertile interactions and discussions of shared research interests.

**Harun Pirim, Ph.D.**

Assistant Professor Systems Engineering King Fahd University of Petroleum and Minerals Saudi Arabia

**Chapter 1**

Provisional chapter

**Clustering Algorithms for Incomplete Datasets**

DOI: 10.5772/intechopen.78272

Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new versions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was Oð Þ1 . We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the developed algorithms using our distance function outperform the existing k-means and mean

Keywords: missing values, distance metric, weighted Euclidean distance, clustering,

Missing values in data are common in real-world applications. They can be caused by human

In this research, we developed two popular clustering algorithms to run over incomplete datasets: (1) k-means clustering algorithm [1] and (2) mean shift clustering algorithms [2].

> © 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Clustering Algorithms for Incomplete Datasets

Loai AbdAllah and Ilan Shimshoni

Loai AbdAllah and Ilan Shimshoni

http://dx.doi.org/10.5772/intechopen.78272

Abstract

mean shift, k-means

1. Introduction

Additional information is available at the end of the chapter

shift using other methods for dealing with missing values.

error, equipment failure, system-generated errors, and so on.

Additional information is available at the end of the chapter

#### **Clustering Algorithms for Incomplete Datasets** Clustering Algorithms for Incomplete Datasets

DOI: 10.5772/intechopen.78272

Loai AbdAllah and Ilan Shimshoni Loai AbdAllah and Ilan Shimshoni

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.78272

#### Abstract

Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new versions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was Oð Þ1 . We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the developed algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.

Keywords: missing values, distance metric, weighted Euclidean distance, clustering, mean shift, k-means
