Automatic Systems

**Chapter 4**

**Abstract**

data analytics

**61**

**1. Introduction**

Incomplete Data Analysis

This chapter discusses missing-value problems from the perspective of machine learning. Missing values frequently occur during data acquisition. When a dataset contains missing values, nonvectorial data are generated. This subsequently causes a serious problem in pattern recognition models because nonvectorial data need further data wrangling before models are built. In view of such, this chapter reviews the methodologies of related works and examines their empirical effectiveness. At present, a great deal of effort has been devoted in this field, and those works can be roughly divided into two types — Multiple imputation and single imputation, where the latter can be further classified into subcategories. They include deletion, fixed-value replacement, *K*-Nearest Neighbors, regression, tree-based algorithms, and latent component-based approaches. In this chapter, those approaches are introduced and commented. Finally, numerical examples are provided along with

**Keywords:** data imputation, missing value analysis, missing data, data wrangling,

With recent development of the Internet of Things (IoT), communication technology, and wireless sensor networks, a huge amount of data is generated every day [1, 2]. It becomes easier to collect data in quantities than before, but new challenges subsequently arise. For example, in manufacturing industries, manufacturers often utilize and deploy thousands of sensors in production lines to monitor the quality of products and to detect possible anomaly or abnormal events. Those sensing data are stored in databases for further analysis. Nonetheless, the collected data are not always perfect. Missing-value entries may appear in databases. When a dataset contains missing values, it is referred to as an incomplete dataset. Missing values in manufacturing industries frequently occur due to sensor failure, particle occlusion, and physical/chemical interferences [3–5]. Unfortunately, most of them happen due to unknown causes and unforeseen circumstances. In addition to manufacturing industries, missing value problems also frequently occur in biomedical areas, for instance, microarray profiling — One of the commonly used tools in genomics [6]. Microarrays are well known for rapidly automated measurement at massive scale — High-throughput biotechnologies [7], but microarrays suffered from missing-value problems [8]. During microarray profiling, missing values might occur as a result of different reasons, such as human errors, dust or scratches on the slide [9], spotting problems, poor/failed hybridization [10], insufficient image resolutions, and fabrication errors [11]. Those unpredictable factors thereby increase the opportunity of

*Bo-Wei Chen and Jia-Ching Wang*

recommendations on future development.
