Preface

The digitization and subsequent automation of global activities have considerably enhanced our capabilities for both creating and amassing data from various sources. This has resulted in a great amount of data flood in almost every facet of our lives. The explosive growth in warehoused data has generated an urgent need for new techniques and automated tools that can logically support us in converting available big data into useful information and knowledge. Data mining is a promising, leadingedge technology for mining large volumes of data for knowledge discovery. Data mining algorithms can be used either for clear description of data or for prediction of future outcomes from data. This can be accomplished through characterization, summarization, association, clustering, classification, discrimination, anomaly detection, trend or evolution prediction, and much more. Accordingly, data mining can be either descriptive or predictive.

Researchers and practitioners in statistics, pattern recognition, machine learning, artificial intelligence, data analytics, and visualization are contributing to the field of data mining for better utilization of data. Data mining finds applications in the entire spectrum of science and technology including basic sciences to life sciences and medicine, to social, economic, and cognitive sciences, to engineering and computers. Data mining also finds tremendous applications in business analytics.

This book discusses the concepts of data mining and presents some of the advanced research in this field. The book provides the fundamentals, techniques, and methods of processing big data for various applications. The chapters discuss the concepts, applications, and research frontiers in data mining with algorithms and implementation details for use in the real world. It includes twelve chapters divided into two sections: "Concepts of Data Mining" and "Applications of Data Mining." The initial seven chapters describe the concepts of data mining, while the remaining five chapters discuss the applications of data mining. The chapters include real-world problems in various fields and propose methods to address them. The first chapter introduces readers to the technologies explored in each of the subsequent chapters.

Chapter 1 provides an overview of the data mining process and its benefits and drawbacks, as well as discusses data mining methodologies and tasks. This chapter also discusses data mining techniques in terms of their features, benefits, drawbacks, and application areas.

After the introductory chapter on the concepts of data mining, we look at the various steps in the data mining process. The initial step after acquiring the data is data cleaning, followed by data Integration, data reduction, and data transformation. The data is then analyzed and evaluated for knowledge discovery.

Chapter 2 describes the initial step of data cleaning to prepare data for strategic decisions. As the pre-processing of data is an important step in the data mining process, the data cleaning process helps in obtaining accurate strategic decisions. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of strategic decision-making approaches. Thus, the representation and quality of data are first and foremost before running an analysis. As such, this chapter identifies the sources of data collection to remove errors and describes data mining cleaning and its methods.

Privacy has become a serious problem, especially in data mining applications that involve the collection and sharing of personal data. For these reasons, the problem of protecting privacy in the context of data mining differs from traditional data privacy protection, as data mining can act as both a friend and foe. Chapter 3 discusses privacy-preserving data mining and its two techniques, namely, those proposed for input data that will be subject to data mining, and those suggested for processed data that are the output of the data mining algorithms. This chapter also presents attacks against the privacy of data mining applications. The chapter concludes with a discussion of next-generation privacy-preserving data mining applications at both the individual and organizational levels.

Chapter 4 explains multi-label classification based on graph neural networks. Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multi-label problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike the typical approaches that require pre-computed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, thus creating robust Laplacian embedding and influencing graph convolutional networks. The experimental results show that the method proposed in this work performs better than the baselines, even when the data is contaminated with noise.

In the cyber world, modern-day malware is quite intelligent with the ability of hiding its presence on the network and performing stealthy operations in the background. Advance persistent threat (APT) is one such kind of malware attack on sensitive corporate and banking networks that can remain undetected for a long time. In real-time corporate networks, identifying the presence of intruders is a challenging task for security experts. Chapter 5 presents a study on data mining and machine learning techniques in APT attribution and detection. In this chapter, the authors shed light on various data mining, machine learning techniques and frameworks used in both attribution and detection of APT malware. Additionally, the chapter highlights gap analysis and the need for a paradigm shift in existing techniques to deal with evolving modern APT malware.

Instagram is one of the world's top ten most popular social networks. One of the main purposes of Instagram is social media marketing. Chapter 6 focuses on text classification of Instagram captions using support vector machine (SVMs). The proposed SVM algorithm uses text classification to categorize Instagram captions into organized groups, namely fashion, food and beverage, technology, health and beauty, lifestyle and travel, and so on, in 66,171 post captions to classify what is trending on the platform. The chapter uses the term frequency-inverse document frequency (TF-IDF) method and percentage variations for data separation in this study.

The main challenges in data mining are related to large, multi-dimensional data sets. There is a need to develop algorithms that are precise and efficient enough to deal with big data problems. The simplex algorithm from linear programming is an example of a successful big data problem-solving tool.

According to the fundamental theorem of linear programming, the solution of the optimization problem can be found in one of the vertices in the parameter space. The basis exchange algorithms also search for the optimal solution among a finite number of vertices in the parameter space. Basis exchange algorithms enable the design of complex layers of classifiers or predictive models based on a small number of multivariate data vectors. Chapter 7 discusses computing on vertices in data mining. The chapter considers computational schemes of designing classifiers or prognostic models based on a data set that consists of a small number of high-dimensional feature vectors. It also discusses in detail the concept of a complex layer composed of many linear prognostic models built in low-dimensional features.

Nowadays, the increase in data acquisition and complexity around optimization make it imperative to jointly use artificial intelligence and optimization for devising data-driven and intelligent decision support systems. A decision support system can be successful if large amounts of interactive data is processed fast to extract useful information and knowledge to help in real-time decision-making. In this context, the data-driven approach has gained prominence due to its provision of insights for decision-making and easy implementation. The data-driven approach can discover various database patterns without relying on prior knowledge while also handling flexible objectives and multiple scenarios. Chapter 8 introduces artificial intelligence and its application in data-driven optimization. The chapter reviews recent advances in data-driven optimization, highlighting the promise of data-driven optimization that integrates mathematical programming and machine learning for decision-making. It also presents perspectives on reinforcement learning (RL)-based data-driven optimization and deep RL for solving NP-hard problems. The chapter investigates the application of data-driven optimization in different case studies to demonstrate the improvements in operational performance over conventional optimization methodology. Finally, the chapter includes some managerial implications and provides some future directions.

Chapter 9 is a detailed discussion on the practical application of the clustering algorithm. This chapter surveys the clustering algorithm, which is an unsupervised learning algorithm for data mining and machine learning techniques. The most popular clustering algorithm is the K-means clustering algorithm, where it is required to find an appropriate K value for distributing the training dataset. It is common to find this value experimentally. Also, it can use the elbow method, which is a heuristic approach used in determining the number of clusters. The particulate matter

concentration clustering algorithm for particulate matter distribution estimation performs a K-means clustering algorithm to cluster feature data sets to find the observatory location representing particulate matter distribution.

Chapter 10 looks at the leaching mechanisms of trace elements from coal and host rock using data mining. Coal and host rock, including gangue dump, are important sources of toxic elements that have great potential to contaminate surface and ground water. The leaching and migration of trace elements are controlled mainly by two factors: trace elements' occurrence and surrounding environment. The traditional method to investigate elements' occurrence and leaching mechanisms is based on a geochemical method. In this chapter, data mining is applied to discover the relationship and patterns that are concealed in the data matrix. From the geochemical point of view, the patterns mean the occurrence and leaching mechanisms of trace elements from coal and host rock. An unsupervised machine learning method using principal component analysis is applied to reduce dimensions of the data matrix of solid and liquid samples, then the re-calculated data is clustered to find its co-existing pattern using the Gaussian mixture model.

Chapter 11 introduces the sentiment mining of tourists based on deep learning. Mining the sentiment of the user on the Internet via the context plays a significant role in uncovering human emotion and determining the exactness of the underlying emotion in the context. An increasing number of user-generated content in social media and online travel platforms lead to the development of data-driven sentiment analysis, and most extant SA (sentiment analysis) in the domain of tourism is conducted using document-based SA. However, DBSA (document-based sentiment analysis) cannot be used to examine what specific aspects need to be improved or disclose the unknown dimensions that affect the overall sentiment like aspect-based SA. ABSA (aspect-based sentiment analysis) requires accurate identification of the aspects and sentiment orientation in the UGC (User-generated content). In this chapter, the contribution of data mining based on deep learning in sentiment and emotion detection are clearly illustrated.

Chapter 12 explains data mining applied for predicting community satisfaction of rehabilitation and reconstruction projects. Natural disasters can occur anytime and anywhere, especially in areas with high disaster risk. Rehabilitation and reconstruction projects have been implemented to restore and accelerate economic growth in such cases. As such, a study is needed to determine whether the rehabilitation and reconstruction that has been carried out resulted in community satisfaction. The results of further analysis are expected to predict the level of community satisfaction for the implementation of rehabilitation and other reconstruction. This chapter uses predictive modeling with a data mining approach. The analysis results show that the artificial neural network and the SVM with a data mining approach can develop a community satisfaction prediction model to implement rehabilitation and reconstruction after earthquake-tsunami and liquefaction disasters.

This book is for students, researchers, practitioners, data analysts, and business professionals who seek information on the various data mining techniques and their applications.

I would like to convey my gratitude to everyone who contributed to this book including the authors of the accepted chapters. My special thanks to Publishing Process Manager, Ms. Mia Vulovic, and other staff at IntechOpen for their support and efforts in bringing this book to fruitful completion.
