Preface

Chapter 9 **Identification of Research Thematic Approaches Based on**

Marenco-Escuderos and Eugenio Saavedra Guajardo

Chapter 10 **Data Privacy for Big Data Publishing Using Newly Enhanced**

Priyank Jain, Manasi Gyanchandani and Nilay Khare

**PASS Data Mining Mechanism 165**

**VI** Contents

**Keywords Network Analysis in Colombian Social Sciences 145** José Hernando Ávila-Toscano, Ivón Catherine Romero-Pérez, Ailed

> This book on data mining discusses a broad set of ideas and presents some of the advanced research in this field. The book is triggered by pervasive applications that retrieve knowledge from real-world big data. It provides the basics, methods, and tools in processing large data available with various applications. The chapters discuss various applications and research frontiers in data mining with algorithms and implementation details for use in real-world.

> The inundation of data from various fields like scientific observations, medical data, social science data, financial data, business data, etc. has been an issue in the recent past and is expected to become worse in future. Manual analysis of this large volume of data is difficult and there is a need to automate the data analysis. Data mining is the intelligent supporting techniques in discovering useful information and knowledge from the big data. This can be through characterization, classification, discrimination, anomaly detection, association, clus‐ tering, trend or evolution prediction, etc., and accordingly, data mining can be either de‐ scriptive or predictive.

> Researchers and practitioners in areas such as statistics, pattern recognition, machine learn‐ ing, artificial intelligence, data analytics, and visualization are contributing to the field of data mining for better utilization of the data. Data mining finds applications in the entire spectrum of science and technology including basic sciences to life sciences and medicine, to social, economic, and cognitive sciences, to engineering and computers.

> This book on data mining consists of ten chapters. The chapters include the real-world prob‐ lems in various fields and propose methods to address each one of them. Technologies ex‐ plored in each of these chapters are introduced for the reader in every chapter.

> In Chapter I, the ensemble methods in environmental data mining are discussed. Environ‐ mental data mining is the nontrivial process of identifying valid, novel, and potentially use‐ ful patterns in data from environmental sciences. This chapter proposes ensemble methods like bagging, random forest, boosting, and voting in environmental data mining that com‐ bine the outputs from multiple classification models to obtain better results than the outputs that could be obtained with a single model. In the experimental studies, ensemble methods are tested on different real-world environmental datasets in various subjects such as air, ecology, rainfall, and soil.

> In order to maintain a good relationship with customers in business or with workers in a production field, there is always the need to collect and analyze the data and interpret the information. Data mining can play an important part in such scenarios as seen in Chapters II and III. Chapter II discusses the estimation of customer lifetime value using machine learn‐ ing techniques. The chapter includes the passenger network value assessment model for the

least resource investment for maximum profit return to help airlines find their significant customers. Chapter III proposes a methodology for determination of crew productivity us‐ ing data mining methods. The identification of leadership types that will motivate and sup‐ port employees has great importance in construction businesses where the human element is of significance. For this purpose, the relationship of productivity between the engineers working in construction companies and workers who work with them is examined and ana‐ lyzed using data mining methods.

dexed in Web of Science, Scopus and other bibliographic databases, applying the social net‐ works analysis technique to the keywords of all. The analysis includes each discipline's clustering coefficient and group metrics. The results described in this chapter identify how social disciplines in Colombia have mainly focused their research production on topics such as armed conflict, poverty, and human development. Chapter X presents data privacy for big data publishing using newly enhanced PASS data mining mechanism. The growth in data made anonymization using conventional processing methods inefficient. This chapter proposes PASS mechanism in Hadoop framework to reduce the processing time of anonym‐ ization. In this work, the whole program is divided into the map and reduced parts. More‐ over, the data types used in Hadoop provide better serialization and transport of data.

The intended audience of this book will mainly consist of students, researchers, practition‐ ers, data analysts, and business professionals who seek information on the various data min‐

I would like to convey my gratitude to everyone who contributed to this book including the authors of the accepted chapters. My special thanks go to the Publishing Process Manager, Mr. Julian Virag, and other staff of InTech publishing for their support and efforts in bring‐

> **Prof. Ciza Thomas, Professor and Head** College of Engineering Trivandrum, India

Preface IX

ing techniques and their applications.

ing the book to fruitful completion.

Chapter IV discusses the mining human-computer interaction (HCI) data for theory of mind induction. The HCI data has the potential for understanding a human user's intentions, goals, and desires. Knowing what users want and need is key to intelligent system assis‐ tance. The theory of mind concept known from studies in animal behavior is adopted and adapted for expressive user modeling. Theories of mind are hypothetical user models repre‐ senting, to some extent, a human user's thoughts. Theories of mind are induced by mining HCI data.

Chapter V discusses performance-aware high-performance computing for remote sensing big data analytics. The chapter introduces a novel high-performance computing system on the geo-distributed private cloud for remote sensing applications that take advantage of network topology and exploit utilization and workloads of CPU, storage, and memory resources in a distributed fashion and optimize resource allocation for realizing big data analytics efficient‐ ly.

Data mining techniques are used these days in the medical field as they have the potential to improve health systems. Data mining uses data and analytics to identify the best practices that improve care at reduced cost. Chapter VI proposes a predictive model for early predic‐ tion of patient mortality. Experimental evaluation is conducted on patients admitted to ICUs with renal failure. Chapter VII presents a semantic infrastructure for service environment supporting successful aging. The aging individuals living independently at home need new kinds of services and service environments. Digitalization of services and the data gathered from the individuals creates an opportunity for more optimized and punctual services. The data gathered through digital equipment is used in optimizing service processes. However, service process misses common ontology and semantic infrastructure to use the gathered data for service optimization. This chapter introduces the service environment and semantic infrastructure, which could be used in social and health care. Chapter VIII presents an adap‐ tive neural network classifier-based analysis of big data in health care. An FCM-based Map‐ Reduce programming model is used for the parallel computing using the AANN approach. The FCM-based MapReduce clusters the large medical datasets into smaller groups of cer‐ tain similarity and assigns each data cluster to one Mapper, where the training of neural networks are done by the optimal selection of the interconnection weights by Whale Optimi‐ zation Algorithm (WOA). Finally, the reducer reduces all the AANN classifiers obtained from the Mappers for identifying the normal and abnormal classes of the newer medical records promptly and accurately.

Chapter IX is on the identification of research thematic approaches based on keywords net‐ work analysis in Colombian social sciences. This chapter unveils the structure of knowledge of social sciences in Colombia through the analysis of thematic networks and its association with different disciplines' new knowledge production to define scenarios and trends in each. About 2992 published articles in the period 2006–2015 are revised in this work, all in‐

dexed in Web of Science, Scopus and other bibliographic databases, applying the social net‐ works analysis technique to the keywords of all. The analysis includes each discipline's clustering coefficient and group metrics. The results described in this chapter identify how social disciplines in Colombia have mainly focused their research production on topics such as armed conflict, poverty, and human development. Chapter X presents data privacy for big data publishing using newly enhanced PASS data mining mechanism. The growth in data made anonymization using conventional processing methods inefficient. This chapter proposes PASS mechanism in Hadoop framework to reduce the processing time of anonym‐ ization. In this work, the whole program is divided into the map and reduced parts. More‐ over, the data types used in Hadoop provide better serialization and transport of data.

least resource investment for maximum profit return to help airlines find their significant customers. Chapter III proposes a methodology for determination of crew productivity us‐ ing data mining methods. The identification of leadership types that will motivate and sup‐ port employees has great importance in construction businesses where the human element is of significance. For this purpose, the relationship of productivity between the engineers working in construction companies and workers who work with them is examined and ana‐

Chapter IV discusses the mining human-computer interaction (HCI) data for theory of mind induction. The HCI data has the potential for understanding a human user's intentions, goals, and desires. Knowing what users want and need is key to intelligent system assis‐ tance. The theory of mind concept known from studies in animal behavior is adopted and adapted for expressive user modeling. Theories of mind are hypothetical user models repre‐ senting, to some extent, a human user's thoughts. Theories of mind are induced by mining

Chapter V discusses performance-aware high-performance computing for remote sensing big data analytics. The chapter introduces a novel high-performance computing system on the geo-distributed private cloud for remote sensing applications that take advantage of network topology and exploit utilization and workloads of CPU, storage, and memory resources in a distributed fashion and optimize resource allocation for realizing big data analytics efficient‐

Data mining techniques are used these days in the medical field as they have the potential to improve health systems. Data mining uses data and analytics to identify the best practices that improve care at reduced cost. Chapter VI proposes a predictive model for early predic‐ tion of patient mortality. Experimental evaluation is conducted on patients admitted to ICUs with renal failure. Chapter VII presents a semantic infrastructure for service environment supporting successful aging. The aging individuals living independently at home need new kinds of services and service environments. Digitalization of services and the data gathered from the individuals creates an opportunity for more optimized and punctual services. The data gathered through digital equipment is used in optimizing service processes. However, service process misses common ontology and semantic infrastructure to use the gathered data for service optimization. This chapter introduces the service environment and semantic infrastructure, which could be used in social and health care. Chapter VIII presents an adap‐ tive neural network classifier-based analysis of big data in health care. An FCM-based Map‐ Reduce programming model is used for the parallel computing using the AANN approach. The FCM-based MapReduce clusters the large medical datasets into smaller groups of cer‐ tain similarity and assigns each data cluster to one Mapper, where the training of neural networks are done by the optimal selection of the interconnection weights by Whale Optimi‐ zation Algorithm (WOA). Finally, the reducer reduces all the AANN classifiers obtained from the Mappers for identifying the normal and abnormal classes of the newer medical

Chapter IX is on the identification of research thematic approaches based on keywords net‐ work analysis in Colombian social sciences. This chapter unveils the structure of knowledge of social sciences in Colombia through the analysis of thematic networks and its association with different disciplines' new knowledge production to define scenarios and trends in each. About 2992 published articles in the period 2006–2015 are revised in this work, all in‐

lyzed using data mining methods.

records promptly and accurately.

HCI data.

VIII Preface

ly.

The intended audience of this book will mainly consist of students, researchers, practition‐ ers, data analysts, and business professionals who seek information on the various data min‐ ing techniques and their applications.

I would like to convey my gratitude to everyone who contributed to this book including the authors of the accepted chapters. My special thanks go to the Publishing Process Manager, Mr. Julian Virag, and other staff of InTech publishing for their support and efforts in bring‐ ing the book to fruitful completion.

> **Prof. Ciza Thomas, Professor and Head** College of Engineering Trivandrum, India

**Chapter 1**

**Provisional chapter**

**Ensemble Methods in Environmental Data Mining**

**Ensemble Methods in Environmental Data Mining**

DOI: 10.5772/intechopen.74393

Environmental data mining is the nontrivial process of identifying valid, novel, and potentially useful patterns in data from environmental sciences. This chapter proposes ensemble methods in environmental data mining that combines the outputs from multiple classification models to obtain better results than the outputs that could be obtained by an individual model. The study presented in this chapter focuses on several ensemble strategies in addition to the standard single classifiers such as decision tree, naive Bayes, support vector machine, and k-nearest neighbor (KNN), popularly used in literature. This is the first study that compares four ensemble strategies for environmental data mining: (i) *bagging*, (ii) bagging combined with random feature subset selection (the *random forest* algorithm), (iii) *boosting* (the AdaBoost algorithm), and (iv) *voting* of different algorithms. In the experimental studies, ensemble methods are tested on different real-world environmental datasets in various subjects such as air, ecology,

**Keywords:** data mining, classification, ensemble learning, environmental data, bagging,

*Environmental data mining* is defined as extracting knowledge from huge sets of environmental data. It is an interdisciplinary area of both computer and environmental sciences, including but not limited to environmental information management systems, decision support sys-

Environmental data mining based on ensemble learning is a rather young research area where a set of learners are trained sequentially on the dataset to better analyze and understand

tems, recommender systems, environmental data analytics, and so on.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Goksu Tuysuzoglu, Derya Birant and Aysegul Pala

Goksu Tuysuzoglu, Derya Birant and Aysegul Pala

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.74393

**Abstract**

rainfall, and soil.

**1. Introduction**

random forest, AdaBoost

### **Ensemble Methods in Environmental Data Mining Ensemble Methods in Environmental Data Mining**

DOI: 10.5772/intechopen.74393

Goksu Tuysuzoglu, Derya Birant and Aysegul Pala Goksu Tuysuzoglu, Derya Birant and Aysegul Pala

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.74393

**Abstract**

Environmental data mining is the nontrivial process of identifying valid, novel, and potentially useful patterns in data from environmental sciences. This chapter proposes ensemble methods in environmental data mining that combines the outputs from multiple classification models to obtain better results than the outputs that could be obtained by an individual model. The study presented in this chapter focuses on several ensemble strategies in addition to the standard single classifiers such as decision tree, naive Bayes, support vector machine, and k-nearest neighbor (KNN), popularly used in literature. This is the first study that compares four ensemble strategies for environmental data mining: (i) *bagging*, (ii) bagging combined with random feature subset selection (the *random forest* algorithm), (iii) *boosting* (the AdaBoost algorithm), and (iv) *voting* of different algorithms. In the experimental studies, ensemble methods are tested on different real-world environmental datasets in various subjects such as air, ecology, rainfall, and soil.

**Keywords:** data mining, classification, ensemble learning, environmental data, bagging, random forest, AdaBoost

### **1. Introduction**

*Environmental data mining* is defined as extracting knowledge from huge sets of environmental data. It is an interdisciplinary area of both computer and environmental sciences, including but not limited to environmental information management systems, decision support systems, recommender systems, environmental data analytics, and so on.

Environmental data mining based on ensemble learning is a rather young research area where a set of learners are trained sequentially on the dataset to better analyze and understand

**2.1. Ensemble-based environmental data mining**

ducted so far in the environmental science.

[22] 2017 Energy Identification

[18] 2016 Climate Determination

[11] 2015 Soil Classification of

[26] 2015 Soil Solve the problem

[4] 2015 Air Forecasting of air

**Ref. Year Type Description Data mining** 

of anomalous consumption patterns in building energy consumption

of the impact of climate change on the habitat suitability for large brown trout

complex land use/land cover categories of desert landscapes using remotely sensed

of rare classes' classification on dust storm forecasting

pollutant values for the Attica area

data

**task**

Anomaly detection

**Ensemble strategy**

Prediction 1, 2 Generalized additive

Classification 2, 3 RF and boosted

Classification 2, 3 SMOTE with

Clustering 2, 4 SOM for clustering,

2, 4 RF, SVR, CCAD-SW

(TSK)

ANNs

ANN

AdaBoost and RF (SARF), SVM, fuzzy

FFANN and RF ANN for regression, FIS to obtain fuzzy values

using autoencoder and PCA, EAD

models, MLP with bagging ensembles, RF, SVM, and fuzzy rule-based systems

**Algorithms Validation**

Ensemble Methods in Environmental Data Mining http://dx.doi.org/10.5772/intechopen.74393 3

TPR = 98.10% FPR = 1.98% (for EAD model)

Threefold cross validation Weighted MSE = 0.18 (MLP with bagging ensembles) Overall true skill statistics (TSS) = 0.69 (RF)

Mean class user's accuracy = 86.7% (for boosted ANN) and 86.6% (for RF

ANN)

(SARF)

Tenfold cross validation

Tenfold cross validation RMSE and R<sup>2</sup>

Accuracy = 96.51%

Ensemble classifiers have been applied to different environmental subjects, such as air [1–6], water [7–9], soil [10–12], plant [13], forests [14, 15], climate [16–18], noise [19], rainfall [20], energy [21–23], as well as living organisms [18, 24, 25]. Some of the ensemble-based environmental data mining studies have been compared in **Table 1**. In this table, the scopes of the studies, the year they were performed, the algorithms that were used in the studies, the type of data mining task, the success rate with the validation method, and the ensemble strategy are listed. In addition, if more than one algorithm is presented and compared with each other, the proposed one (the most successful one) is also indicated. As given in the table, ensemble of models for classification or prediction has higher interest than ensemble clustering and anomaly detection [2, 22] in environmental science. Although ensemble clustering has been used in many areas, especially in bioinformatics, only a few studies [4, 25] have been con-

**Figure 1.** Interdisciplinary structure of ensemble learning in environmental data mining (ELEDM).

environmental processes and systems. However, it is not well-known yet how ensemble methodology can be utilized in order to improve the performance of a single method. For this purpose, this chapter presents the findings of a systematic survey of what is currently done in the area and aims to investigate the ability of different ensemble strategies for environmental data mining.

Ensemble learning in environmental data mining (ELEDM) can be drawn as a combination of three main areas: data mining (DM), machine learning (ML), and environmental science (**Figure 1**). ML in environmental science is learning-driven, meaning that machines teach themselves to recognize patterns by analyzing environmental data, whereas in contrast, DM is discovery-driven, meaning that patterns are automatically discovered from environmental data. DM uses many ML methods, including ensemble learning methods.

The novelty and main contributions of this chapter are as follows. First, it provides a brief survey of ensemble learning used in environmental data mining. Second, it presents how an ensemble of classifiers can be applied on environmental data in order to improve the performance of a single classifier. Third, it is the first study that compares different ensemble strategies on different environmental datasets in terms of classification accuracy.
