**1. Introduction**

With recent development of the Internet of Things (IoT), communication technology, and wireless sensor networks, a huge amount of data is generated every day [1, 2]. It becomes easier to collect data in quantities than before, but new challenges subsequently arise. For example, in manufacturing industries, manufacturers often utilize and deploy thousands of sensors in production lines to monitor the quality of products and to detect possible anomaly or abnormal events. Those sensing data are stored in databases for further analysis. Nonetheless, the collected data are not always perfect. Missing-value entries may appear in databases. When a dataset contains missing values, it is referred to as an incomplete dataset. Missing values in manufacturing industries frequently occur due to sensor failure, particle occlusion, and physical/chemical interferences [3–5]. Unfortunately, most of them happen due to unknown causes and unforeseen circumstances. In addition to manufacturing industries, missing value problems also frequently occur in biomedical areas, for instance, microarray profiling — One of the commonly used tools in genomics [6]. Microarrays are well known for rapidly automated measurement at massive scale — High-throughput biotechnologies [7], but microarrays suffered from missing-value problems [8]. During microarray profiling, missing values might occur as a result of different reasons, such as human errors, dust or scratches on the slide [9], spotting problems, poor/failed hybridization [10], insufficient image resolutions, and fabrication errors [11]. Those unpredictable factors thereby increase the opportunity of

defect microarrays. In fact, missing-value problems almost challenge every part of our daily applications ranging from manufacturing/biotechnology industries that rely on sensors to typical service industries that involve questionnaire-based surveys. Questionnaires are used to collect information from respondents. Nonetheless, respondents occasionally fail to provide answers that match the format to fit the response categories [12], subsequently generating unanswered questions or invalid formats/responses (e.g., out-of-range values). There are many reasons for such problems, e.g., respondents refused to answer, respondents chose wrong formats [13, 14], respondents intentionally/unintentionally left blanks, testers addressed unclear/confusing choices, designs involved sensitive/private questions, and interviews were interrupted. These factors could result in missing values in questionnaires.

**<sup>x</sup>***<sup>n</sup>* <sup>∈</sup> *<sup>M</sup>*, *<sup>N</sup>* <sup>∈</sup> <sup>þ</sup> refers to the number of samples, and the dimensionality of a sample is *M* ∈þ. Herein, the dimensionality represents the number of independent variables, predictor variables, features, dimensions, control variables, or explanatory variables. Those terms are used, depending on various research fields, e.g., data science, machine learning, and statistics. Furthermore, **X** is max-min normalized. If supervised learning is required, the label information and the response variable corresponding to each sample **x***<sup>n</sup>* are respectively defined as *cn* and *yn*. The former belongs to categorical variables ∈ after encoded, and **c** = {*cn*| *n* **=** 1, 2, … , *N*}. The latter belongs to numerical variables ∈ , and **y** = {*yn*| *n* **=** 1, 2, … , *N*}. Moreover, the sizes of **X**, **c**, and **y** are *M*-by-*N*, 1-by-*N*, and 1-by-*N*. For **X**, **c**, and **y** that

KNNImpute [9] is a popular imputation tool that leverages KNN algorithms to find the *K*-nearest neighbors nearby a given sample **x**~<sup>t</sup> that contains missing values (if no missing values are present, **x**<sup>t</sup> ∈ *<sup>M</sup>*). The substituted values are generated based on the weighted average of those *K*-nearest neighbors. Notably, there is a limitation for the selected *K*-nearest neighbors when KNNImpute is executed. That is, the dimensions of those *K*-nearest neighbors corresponding to missing-value entries should contain nonmissing-value data. In the plain version, label informa-

In the above-mentioned algorithm, ":" means selecting the entire rows or columns based on the position, and the operator ⊕ means to replace the missing values with the corresponding generated substituted values. Moreover, distance(�,�) signifies the distance between two samples, e.g., Euclidean or Manhattan distance. Those substituted values were fixed and unchanged once they were generated. Nonetheless, such substituted values were highly affected by initial conditions, such as the subset of *M* independent variables and the number of nearest neighbors. Iterative *K*-Nearest Neighbor imputation (IKNNimpute) [25] improved one of such drawbacks by using a loop that iteratively produced substituted values, chose the subset of *M* independent variables, and reselected near neighbors. **Table 2** lists a

missing-value entries are filled in with substituted values in the *j*-th iteration, and *J*

ing **X**~ and **X**~ *<sup>t</sup>*. Gray KNNs [26] further proposed Gray Relational Analysis to capture



1 Select *M*<sup>0</sup> dimensions *M*<sup>0</sup> < *M* and *M*<sup>0</sup> ð Þ ∈ ℤ<sup>þ</sup> without missing values from **x**~<sup>t</sup>

½ �*j* represents the matrix, of which the

is formed by horizontally concatenat-

tion was not used while the *K*-nearest neighbors were searched (**Table 1**).

contain missing values, **X**~ , ~**c**, and **y**~ are used.

*DOI: http://dx.doi.org/10.5772/intechopen.94068*

*Incomplete Data Analysis*

**2.1 Imputation based on** *K***-nearest neighbors**

simple version of IKNNimpute, where **X**~ <sup>0</sup>

**Algorithm: KNNImpute Input**: **X** and **x**~<sup>t</sup> **Output**: **x**^<sup>t</sup>

6 ð Þ **x**^<sup>t</sup> **<sup>s</sup>**,: ¼ ð Þ **x**~<sup>t</sup> **<sup>s</sup>**,:

**Table 1.** *KNNImpute.*

**63**

denotes the number of iterations. Besides, **X**~ <sup>0</sup>

2 Store the indices of the selected dimensions in an *M*<sup>0</sup>

4 Store the *K*-nearest samples in an *M*<sup>0</sup>

⊕ð Þ **Ω**

3 Apply KNN algorithm to ð Þ **x**~<sup>t</sup> **<sup>s</sup>**,: based on the dataset **Xs**,:

5 Compute a *K*-by-1 weight vector **Ω** ¼ ½ � 1*=*distanceð Þj *k*, t *k* ¼ 1, … ,*K*

The difficulty of processing incomplete data is that when a dataset contains missing values, the corresponding entries are marked with invalid values. Accordingly, such a dataset becomes nonvectorial because invalid values are present (which are constantly represented as Not-a-Number (NaN)). To tackle those entries with NaN, mathematical operations (e.g., pairwise distance) need further revision under such circumstances because nonvectorial arithmetic is not well defined.

To handle missing-value problems, data imputation is generally used. Data imputation is a statistical term that describes the process of substituting estimated values for missing ones. Related approaches for data imputation [15] can be classified into two types: Multiple imputation and single imputation. The former is aimed at generating two distributions. One is a distribution for selecting hyperparameters, and the other is a distribution for generating data. Multiple imputation uses a function for generating distributional hyperparameters and takes samples from such a function to obtain an averaged distributional hyperparameter set. Multiple imputation then utilizes this averaged distributional hyperparameter set to create a statistical distribution for describing the data distribution. Finally, data samples are drawn to replace missing values. Popular methods for multiple imputation include Expectation Maximization (EM) algorithms or Monte Carlo Markov Chain (MCMC) strategies [16, 17]. Regarding single imputation, it does not involve drawing data samples from an uncertain function to substitute for missing data as multiple imputation does. In brief, single imputation relies on neither sample drawing nor uncertain functions. At present, a great deal of effort has been devoted to single imputation, for example, hot-deck/cold-deck, deletion, fixed-value replacement (e.g., zeros, means, and medians), *K*-Nearest Neighbors (KNNs) [18], regression [10, 19, 20], tree-based algorithms [21, 22], and latent component-based approaches (including matrix completion) [15, 23, 24]. This chapter focuses on *K*-Nearest Neighbors, regression, tree-based algorithms, and latent component-based approaches because the imputation errors of deck/cold-deck and fixed-value replacement are not satisfactory. Moreover, deletion could result in loss of discriminant features or samples. Therefore, the subsequent sections lay emphasis on the other methods.

The rest of this chapter is organized as follows. Section 2 introduces related works, and their methods are subsequently introduced. Sections 3 shows the numerical results, and finally the conclusions are drawn in Section 4.

### **2. Imputation methods**

The following subsections introduce data imputation using KNNs, regression, tree-based algorithms, and latent component-based approaches, respectively. For clarity, the description on the dataset uses the following definitions and notations. A nonmissing-value dataset is represented as **X** = {**x***<sup>n</sup>* |*n* **=** 1, 2, … , *N*}, where

*Incomplete Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.94068*

defect microarrays. In fact, missing-value problems almost challenge every part of our daily applications ranging from manufacturing/biotechnology industries that rely on sensors to typical service industries that involve questionnaire-based surveys. Questionnaires are used to collect information from respondents. Nonetheless, respondents occasionally fail to provide answers that match the format to fit the response categories [12], subsequently generating unanswered questions or invalid formats/responses (e.g., out-of-range values). There are many reasons for such problems, e.g., respondents refused to answer, respondents chose wrong formats [13, 14], respondents intentionally/unintentionally left blanks, testers addressed unclear/confusing choices, designs involved sensitive/private questions, and interviews were interrupted. These factors could result in missing values in

The difficulty of processing incomplete data is that when a dataset contains missing values, the corresponding entries are marked with invalid values. Accordingly, such a dataset becomes nonvectorial because invalid values are present (which are constantly represented as Not-a-Number (NaN)). To tackle those entries with NaN, mathematical operations (e.g., pairwise distance) need further revision under

To handle missing-value problems, data imputation is generally used. Data imputation is a statistical term that describes the process of substituting estimated values for missing ones. Related approaches for data imputation [15] can be classified into two types: Multiple imputation and single imputation. The former is aimed at generating two distributions. One is a distribution for selecting hyperparameters, and the other is a distribution for generating data. Multiple imputation uses a function for generating distributional hyperparameters and takes samples from such a function to obtain an averaged distributional hyperparameter set. Multiple imputation then utilizes this averaged distributional hyperparameter set to create a statistical distribution for describing the data distribution. Finally, data samples are drawn to replace missing values. Popular methods for multiple imputation include

such circumstances because nonvectorial arithmetic is not well defined.

Expectation Maximization (EM) algorithms or Monte Carlo Markov Chain (MCMC) strategies [16, 17]. Regarding single imputation, it does not involve drawing data samples from an uncertain function to substitute for missing data as multiple imputation does. In brief, single imputation relies on neither sample drawing nor uncertain functions. At present, a great deal of effort has been devoted to single imputation, for example, hot-deck/cold-deck, deletion, fixed-value replacement (e.g., zeros, means, and medians), *K*-Nearest Neighbors (KNNs) [18], regression [10, 19, 20], tree-based algorithms [21, 22], and latent component-based approaches (including matrix completion) [15, 23, 24]. This chapter focuses on *K*-Nearest Neighbors, regression, tree-based algorithms, and latent component-based approaches because the imputation errors of deck/cold-deck and fixed-value replacement are not satisfactory. Moreover, deletion could result in loss of discriminant features or samples. Therefore, the subsequent sections lay emphasis on the other methods. The rest of this chapter is organized as follows. Section 2 introduces related works, and their methods are subsequently introduced. Sections 3 shows the numerical results, and finally the conclusions are drawn in Section 4.

The following subsections introduce data imputation using KNNs, regression, tree-based algorithms, and latent component-based approaches, respectively. For clarity, the description on the dataset uses the following definitions and notations.

A nonmissing-value dataset is represented as **X** = {**x***<sup>n</sup>* |*n* **=** 1, 2, … , *N*}, where

questionnaires.

*Applications of Pattern Recognition*

**2. Imputation methods**

**62**

**<sup>x</sup>***<sup>n</sup>* <sup>∈</sup> *<sup>M</sup>*, *<sup>N</sup>* <sup>∈</sup> <sup>þ</sup> refers to the number of samples, and the dimensionality of a sample is *M* ∈þ. Herein, the dimensionality represents the number of independent variables, predictor variables, features, dimensions, control variables, or explanatory variables. Those terms are used, depending on various research fields, e.g., data science, machine learning, and statistics. Furthermore, **X** is max-min normalized.

If supervised learning is required, the label information and the response variable corresponding to each sample **x***<sup>n</sup>* are respectively defined as *cn* and *yn*. The former belongs to categorical variables ∈ after encoded, and **c** = {*cn*| *n* **=** 1, 2, … , *N*}. The latter belongs to numerical variables ∈ , and **y** = {*yn*| *n* **=** 1, 2, … , *N*}. Moreover, the sizes of **X**, **c**, and **y** are *M*-by-*N*, 1-by-*N*, and 1-by-*N*. For **X**, **c**, and **y** that contain missing values, **X**~ , ~**c**, and **y**~ are used.
