**2.1 Imputation based on** *K***-nearest neighbors**

KNNImpute [9] is a popular imputation tool that leverages KNN algorithms to find the *K*-nearest neighbors nearby a given sample **x**~<sup>t</sup> that contains missing values (if no missing values are present, **x**<sup>t</sup> ∈ *<sup>M</sup>*). The substituted values are generated based on the weighted average of those *K*-nearest neighbors. Notably, there is a limitation for the selected *K*-nearest neighbors when KNNImpute is executed. That is, the dimensions of those *K*-nearest neighbors corresponding to missing-value entries should contain nonmissing-value data. In the plain version, label information was not used while the *K*-nearest neighbors were searched (**Table 1**).

In the above-mentioned algorithm, ":" means selecting the entire rows or columns based on the position, and the operator ⊕ means to replace the missing values with the corresponding generated substituted values. Moreover, distance(�,�) signifies the distance between two samples, e.g., Euclidean or Manhattan distance. Those substituted values were fixed and unchanged once they were generated. Nonetheless, such substituted values were highly affected by initial conditions, such as the subset of *M* independent variables and the number of nearest neighbors. Iterative *K*-Nearest Neighbor imputation (IKNNimpute) [25] improved one of such drawbacks by using a loop that iteratively produced substituted values, chose the subset of *M* independent variables, and reselected near neighbors. **Table 2** lists a simple version of IKNNimpute, where **X**~ <sup>0</sup> ½ �*j* represents the matrix, of which the missing-value entries are filled in with substituted values in the *j*-th iteration, and *J* denotes the number of iterations. Besides, **X**~ <sup>0</sup> is formed by horizontally concatenating **X**~ and **X**~ *<sup>t</sup>*. Gray KNNs [26] further proposed Gray Relational Analysis to capture


**Table 1.** *KNNImpute.*


independent variable. The groups with lower missing rates were filled in with substituted values first. Then, those recovered groups were utilized for estimating the other groups with higher missing rates during data reconstruction. For other methods like Weighted ILLSImpute [28] and Regularized LLSImpute (RLLSImpute) [29], variants of OLS models were utilized to adapt to data imputation, for example, Locally Weighted OLS, L2-norm regularized OLS (i.e., ridge regression), L1-norm OLS (e.g., LASSO (Least Absolute Shrinkage and Selection Operator), Group LASSO, and Sparse Group LASSO), and regression based on other norms [31].

Since Random Forests (RFs) [32] were proposed, the performance of treestructured classification and regression algorithms were significantly enhanced. Random Forests adopted Bagging (i.e., Bootstrap Aggregating) techniques to perform data sampling and perceptron learning. Multiple trees were created based on randomly selected features several times (where the numbers of features should be smaller than *M*), and each set of sampled data after Bootstrapping was used to train only one tree. A distinct structure used in Random Forests was called proximity matrix, which recorded the concurrence of pairwise samples in a leaf node, namely, the frequencies when two samples coexist in the same leaf node. Such a proximity matrix was symmetric, of which the size was *N*-by-*N*, and it was used as imputation weighting during data recovery. Based on [22], two basic principles for handling incomplete data can be derived by summarizing the strategies mentioned in [22] — Preimputation [33] and on-the-fly imputation [21]. The former filled in missing-value entries with fixed replacement at the beginning. Replacement was iteratively updated using proximity matrices. RF growing process repeated until convergence. For the latter method, on-the-fly imputation did not fill in missing-value entries with initial substituted values, and it skipped using missing-value samples for computing splitting nodes. When missing-value samples reached leaves, imputation started. As assignment of missing-value samples to leaves involved randomness, iterations until convergence improved imputation.

**2.4 Imputation based on latent component-based approaches**

(or eigenvectors) were computed, whereas plain NMF did not.

*2.4.1 SVDImpute*

**65**

is perform on **<sup>X</sup>**^ ½ �*<sup>ι</sup>* , so that

This type of method has a general procedure for reconstructing an incomplete data matrix. Firstly, the missing-value entries of a data matrix **X**~ are filled in with replacement (e.g., zeros). Secondly, new matrix factors or vector factors are initialized by generating random numbers. In typical, two or three matrix/vector factors are used, e.g., **P**∈ *<sup>M</sup>*�*<sup>D</sup>* and **T** ∈ *<sup>D</sup>*�*<sup>N</sup>*. Besides, the product of those factors should be close to **X**~ . Thirdly, iterations are performed to improve the replacement. Unlike the aforementioned types of methods, this type has a unique characteristic — Setting the number of latent components *D* ∈ <sup>þ</sup> — before imputation starts.

Popular methods included SVDImpute (i.e., imputation based on Singular Value Decomposition) [34], Nonlinear Iterative Partial Least Squares-Principal Component Analysis (NIPALS-PCA) [35], matrix completion/approximation, and so forth. The common place of SVDImpute and NIPALS-PCA was that projection matrices

For SVDImpute, let **<sup>X</sup>**^ ½ �*<sup>ι</sup>* represent the data matrix, of which the missing-value entries are filled in with substituted values in the *ι*-th iteration. Subsequently, SVD

**2.3 Imputation based on tree-based algorithms**

*Incomplete Data Analysis*

*DOI: http://dx.doi.org/10.5772/intechopen.94068*

**Table 2.** *IKNNImpute.*

pairwise distance between samples, so that near neighbors were appropriately measured and described.

#### **2.2 Imputation based on regression**

The underlying model of this category is primarily based on well-known Ordinary Least Squares (OLS), which focuses on minimizing least squares errors

$$E\_{\rm OLS} = \left\| \mathbf{y} - \hat{\mathbf{y}} \right\|\_{2}^{2} = \left\| \mathbf{y} - \mathbf{w}^{\top} \mathbf{X} \right\|\_{2}^{2} = \text{Tr} \left( \left( \mathbf{y} - \mathbf{w}^{\top} \mathbf{X} \right)^{\top} \left( \mathbf{y} - \mathbf{w}^{\top} \mathbf{X} \right) \right). \tag{1}$$

Herein, k k� <sup>2</sup> is the <sup>L</sup>2-norm distance, **<sup>w</sup>** <sup>∈</sup> *<sup>M</sup>* means an unknown weight vector, <sup>⊤</sup> represents matrix transpose, and **<sup>y</sup>**^ <sup>¼</sup> **<sup>w</sup>**⊤**X**. Moreover, **<sup>w</sup>** = (**XX**⊤) �<sup>1</sup> **Xy**<sup>⊤</sup> is the closed form for finding the weight vector. Given a nonmissing-value sample **x**t, of which the response variable <sup>~</sup>*y*<sup>t</sup> is unknown, **<sup>w</sup>**⊤**x**<sup>t</sup> can generate estimated results. Such methods included Least Squares Imputation (LSImpute) [10], Local LSImpute (LLSImpute) [19], Sequential LLSimpute (SLLSimpute) [27], Iterated LLSImpute (ILLSImpute) [20], Weighted ILLSImpute [28], Regularized LLSImpute (RLLSImpute) [29], and so on.

In LSImpute [10], the authors examined two types of correlations — those among independent variables (i.e., estimating ~*y*<sup>t</sup> based on **X** and **y**) and those among samples (i.e., estimating ~*y*<sup>t</sup> based only on the subsets of **y**). In the latter, ~*y*<sup>t</sup> and the selected subsets from **y** should be highly correlated during OLS modeling [30]. A combined model was also derived by taking the weighted average from the estimated ^*y*<sup>t</sup> based on independent variables and that based on samples. Unlike LSImpute, LLSImpute [19] was aimed at the correlation among independent variables, but the response variable became multivariate, namely, **Y** = {*Yn*| *n* **=** 1, 2, … , *N*}, where *Yn* ∈ *<sup>L</sup>*, and **Y** was an *L*-by-*N* matrix. The fitting weight was converted from **w** to an *M*-by-*L* matrix **W**. For ILLSImpute, different subsets of independent variables in **X** were drawn and examined during iterations to estimate an optimal ^*y*<sup>t</sup> . SLLSimpute [27] adopted multistage data imputation, where the whole missingvalue entries were divided into multiple groups based on the missing rate of each

### *Incomplete Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.94068*

independent variable. The groups with lower missing rates were filled in with substituted values first. Then, those recovered groups were utilized for estimating the other groups with higher missing rates during data reconstruction. For other methods like Weighted ILLSImpute [28] and Regularized LLSImpute (RLLSImpute) [29], variants of OLS models were utilized to adapt to data imputation, for example, Locally Weighted OLS, L2-norm regularized OLS (i.e., ridge regression), L1-norm OLS (e.g., LASSO (Least Absolute Shrinkage and Selection Operator), Group LASSO, and Sparse Group LASSO), and regression based on other norms [31].

#### **2.3 Imputation based on tree-based algorithms**

Since Random Forests (RFs) [32] were proposed, the performance of treestructured classification and regression algorithms were significantly enhanced. Random Forests adopted Bagging (i.e., Bootstrap Aggregating) techniques to perform data sampling and perceptron learning. Multiple trees were created based on randomly selected features several times (where the numbers of features should be smaller than *M*), and each set of sampled data after Bootstrapping was used to train only one tree. A distinct structure used in Random Forests was called proximity matrix, which recorded the concurrence of pairwise samples in a leaf node, namely, the frequencies when two samples coexist in the same leaf node. Such a proximity matrix was symmetric, of which the size was *N*-by-*N*, and it was used as imputation weighting during data recovery.

Based on [22], two basic principles for handling incomplete data can be derived by summarizing the strategies mentioned in [22] — Preimputation [33] and on-the-fly imputation [21]. The former filled in missing-value entries with fixed replacement at the beginning. Replacement was iteratively updated using proximity matrices. RF growing process repeated until convergence. For the latter method, on-the-fly imputation did not fill in missing-value entries with initial substituted values, and it skipped using missing-value samples for computing splitting nodes. When missing-value samples reached leaves, imputation started. As assignment of missing-value samples to leaves involved randomness, iterations until convergence improved imputation.

#### **2.4 Imputation based on latent component-based approaches**

This type of method has a general procedure for reconstructing an incomplete data matrix. Firstly, the missing-value entries of a data matrix **X**~ are filled in with replacement (e.g., zeros). Secondly, new matrix factors or vector factors are initialized by generating random numbers. In typical, two or three matrix/vector factors are used, e.g., **P**∈ *<sup>M</sup>*�*<sup>D</sup>* and **T** ∈ *<sup>D</sup>*�*<sup>N</sup>*. Besides, the product of those factors should be close to **X**~ . Thirdly, iterations are performed to improve the replacement. Unlike the aforementioned types of methods, this type has a unique characteristic — Setting the number of latent components *D* ∈ <sup>þ</sup> — before imputation starts.

Popular methods included SVDImpute (i.e., imputation based on Singular Value Decomposition) [34], Nonlinear Iterative Partial Least Squares-Principal Component Analysis (NIPALS-PCA) [35], matrix completion/approximation, and so forth. The common place of SVDImpute and NIPALS-PCA was that projection matrices (or eigenvectors) were computed, whereas plain NMF did not.

#### *2.4.1 SVDImpute*

For SVDImpute, let **<sup>X</sup>**^ ½ �*<sup>ι</sup>* represent the data matrix, of which the missing-value entries are filled in with substituted values in the *ι*-th iteration. Subsequently, SVD is perform on **<sup>X</sup>**^ ½ �*<sup>ι</sup>* , so that

pairwise distance between samples, so that near neighbors were appropriately

½ � <sup>1</sup> by filling in the missing-value entries of **<sup>X</sup>**~<sup>0</sup> with initial replacement

The underlying model of this category is primarily based on well-known Ordinary Least Squares (OLS), which focuses on minimizing least squares errors

<sup>2</sup> <sup>¼</sup> Tr **<sup>y</sup>** � **<sup>w</sup>**⊤**<sup>X</sup>** <sup>⊤</sup>

Herein, k k� <sup>2</sup> is the <sup>L</sup>2-norm distance, **<sup>w</sup>** <sup>∈</sup> *<sup>M</sup>* means an unknown weight vector,

closed form for finding the weight vector. Given a nonmissing-value sample **x**t, of which the response variable <sup>~</sup>*y*<sup>t</sup> is unknown, **<sup>w</sup>**⊤**x**<sup>t</sup> can generate estimated results. Such methods included Least Squares Imputation (LSImpute) [10], Local LSImpute (LLSImpute) [19], Sequential LLSimpute (SLLSimpute) [27], Iterated LLSImpute

In LSImpute [10], the authors examined two types of correlations — those among independent variables (i.e., estimating ~*y*<sup>t</sup> based on **X** and **y**) and those among samples (i.e., estimating ~*y*<sup>t</sup> based only on the subsets of **y**). In the latter, ~*y*<sup>t</sup> and the selected subsets from **y** should be highly correlated during OLS modeling [30]. A combined model was also derived by taking the weighted average from the estimated ^*y*<sup>t</sup> based on independent variables and that based on samples. Unlike LSImpute, LLSImpute [19] was aimed at the correlation among independent variables, but the response variable became multivariate, namely, **Y** = {*Yn*| *n* **=** 1, 2, … , *N*}, where *Yn* ∈ *<sup>L</sup>*, and **Y** was an *L*-by-*N* matrix. The fitting weight was converted from **w** to an *M*-by-*L* matrix **W**. For ILLSImpute, different subsets of independent variables in **X** were drawn and examined during iterations to estimate an optimal ^*y*<sup>t</sup>

SLLSimpute [27] adopted multistage data imputation, where the whole missingvalue entries were divided into multiple groups based on the missing rate of each

**<sup>y</sup>** � **<sup>w</sup>**⊤**<sup>X</sup>**

½ �*j*

*:* (1)

�<sup>1</sup> **Xy**<sup>⊤</sup> is the

.

 2

<sup>⊤</sup> represents matrix transpose, and **<sup>y</sup>**^ <sup>¼</sup> **<sup>w</sup>**⊤**X**. Moreover, **<sup>w</sup>** = (**XX**⊤)

(ILLSImpute) [20], Weighted ILLSImpute [28], Regularized LLSImpute

<sup>2</sup> <sup>¼</sup> **<sup>y</sup>** � **<sup>w</sup>**⊤**<sup>X</sup>** 

measured and described.

**Algorithm: IKNNImpute Input**: **X**~ and **x**~<sup>t</sup> **Output**: **X**^ and **x**^<sup>t</sup>

*Applications of Pattern Recognition*

4 **For** *n* = 1:(*N* + 1)

8 **x**^*n*½ �¼ *j* þ 1 **x**^*n*½ �*j* ⊕ð Þ **Ω**

2 Form **X**~<sup>0</sup>

3 **For** *j* = 1:*J*

9 **End** 10 **End**

**Table 2.** *IKNNImpute.*

**64**

<sup>1</sup> Form **<sup>X</sup>**~<sup>0</sup> by horizontally concatenating **<sup>X</sup>**<sup>~</sup> and **<sup>x</sup>**~<sup>t</sup>

<sup>5</sup> Apply KNN algorithm to **<sup>x</sup>**~*n*½ �*<sup>j</sup>* based on the dataset **<sup>X</sup>**<sup>~</sup> <sup>0</sup>

7 Compare a *K*-by-1 weight vector **Ω** ¼ ½ � 1*=*distanceð Þj *k*, *n k* ¼ 1, … , *K*

6 Store the *K*-nearest samples in an *M*-by-*K* matrix

*<sup>E</sup>*OLS <sup>¼</sup> **<sup>y</sup>** � **<sup>y</sup>**^ 2

(RLLSImpute) [29], and so on.

**2.2 Imputation based on regression**

$$(\mathbf{U}\mathbf{2}\mathbf{V}^{\mathsf{T}})[l] = \mathbf{S}\mathbf{V}\mathbf{D}(\hat{\mathbf{X}}[l])\tag{2}$$

important topic in matrix completion/approximation, and it became a highlight when it was applied to recommendation systems in a Netflix contest [36–38]. At present, NMF has developed multiple variants, including (i) regularization based on L1-norms, L2-norms [39], L2,1-norms, nuclear norms, mixed norms, and graphs, (ii) different loss functions like Huber loss, the correntropy induced metric

[40, 41], Cauchy functions [42], and Truncated Cauchy functions [43, 44], and (iii) many more, such as projected gradient NMF [45], projective NMF [46, 47], and orthogonal NMF [48]. The following uses the plain version of NMF as an example

where Trð Þ� is the trace operator. Eq. (8) is nonconvex and hard to solve. When one variable is fixed, (8) becomes convex. One can use ALS or Coordinate Descend to find the solution. Differentiating (8) with respect to **P***m*,*<sup>d</sup>* and **T***d*,*<sup>n</sup>* (where

<sup>¼</sup> <sup>2</sup> **PTT**<sup>⊤</sup> � **XT**^ <sup>⊤</sup> � �

<sup>¼</sup> <sup>2</sup> **<sup>P</sup>**⊤**PT** � **<sup>P</sup>**⊤**X**^ � �

<sup>2</sup> <sup>¼</sup> Tr **<sup>X</sup>**^ � **PT** � �<sup>⊤</sup> **<sup>X</sup>**^ � **PT** n o � � , (8)

*m*,*d*

*:* (9)

, (11)

(10)

*d*,*n*

*m*,*d*

*d*,*n*

**XT**^ <sup>⊤</sup> **PTT**<sup>⊤</sup> !

**P**⊤**X**^ **P**⊤**PT** !

!

where *ε* is an extremely small positive number, and division is elementwise.

To show the imputation performance of the above-mentioned methods, experiments on open datasets were conducted. The datasets included Abalone (Aba), Scene (SCN), White Wine (WW), and Indian Pines (IP). The number of samples were 4177, 2407, 4898, and 21025, respectively. Furthermore, the dimensionality was, respectively, 561, 294, 11, and 200. Imputation approaches included KNN Regression Imputation (KNRImpute), KNNImpute with *K* = 5, Regression Tree Imputation (RTImpute), Random Forest Imputation (RFImpute), and NIPALS-PCA Imputation (PCAImpute) with only one component. All of them were found in

To generate missing values for each dataset, this study used a random generator to decide missing-values entries. For KNRImpute, KNNImpute, RTImpute, and RFImpute, they required that missing values should not be uniformly distributed in data. Otherwise, imputation could not be performed. Thus, not every of the independent variables were chosen. Missing-value rates ranged from 3.00% to 9.00%,

!

to elaborate the detail. NMF minimizes the Least Squares error of

� � � 2

> *∂E*NMF *∂***P***<sup>m</sup>*,*<sup>d</sup>*

8 >><

>>:

Eqs. (10), (11), and (4) iterate until convergence.

**3. Experimental results**

open sources.

**67**

*∂E*NMF *∂***T***<sup>d</sup>*,*<sup>n</sup>*

The multiplicative update rules for (9) are, respectively,

**P***<sup>m</sup>*,*<sup>d</sup>* ¼ max *ε*, **P***<sup>m</sup>*,*<sup>d</sup>* ⊙

**T***<sup>d</sup>*,*<sup>n</sup>* ¼ max *ε*, **T***<sup>d</sup>*,*<sup>n</sup>* ⊙

*<sup>E</sup>*NMF <sup>¼</sup> **<sup>X</sup>**^ � **PT** �

*d* = 1,2, … ,*D*), respectively, yields

*Incomplete Data Analysis*

*DOI: http://dx.doi.org/10.5772/intechopen.94068*

and

where **U** is an *M*-by-*M* unitary matrix, **Σ** represents an *M*-by-*N* diagonal matrix, of which the diagonal terms are nonnegative real numbers sorted in descending order, and **V** denotes an *N*-by-*N* matrix. By selecting the top *D* largest diagonal values from **Σ**, the following two corresponding matrices **P** and **T** are formed

$$\begin{cases} \mathbf{P}[\iota] = \mathbf{U}\_{:,1:D}[\iota] \\ \mathbf{T}[\iota] = \left(\mathbf{E}\mathbf{V}^{\sf T}\right)\_{1:D,:}[\iota] \end{cases} \tag{3}$$

where ":,1:*D*" means selecting columns ranging from the first one to the *D*-th one, and "1:*D*,:" extracts rows. The process of (3) is the same as Truncated SVD. Subsequently, the reconstructed data matrix becomes

$$
\hat{\mathbf{X}}[\iota+\mathbf{1}] = \hat{\mathbf{X}} \oplus (\mathbf{P}[\iota] \mathbf{T}[\iota]).\tag{4}
$$

Herein, the operator ⊕ means to replace the missing values of **X**~ with the corresponding generated values by **P**[*ι*]**T**[*ι*]. Eqs. (2)–(4) iterate until convergence.

#### *2.4.2 NIPALS-PCA*

As for NIPALS-PCA (abbreviated as NP below), it minimizes the reconstruction error of

$$\begin{split} E\_{\text{NP}} &= \sum\_{m=1}^{M} \sum\_{n=1}^{N} \left( \mathbf{H} \odot \left( \hat{\mathbf{X}} - \mathbf{PT} \right) \right)\_{mn}^{2} \\ &= \sum\_{n=1}^{N} \left\{ \left( \mathbf{H}\_{:,n} \odot \left( \hat{\mathbf{X}}\_{:,n} - \left( \mathbf{PT} \right)\_{:,n} \right) \right)^{\top} \left( \mathbf{H}\_{:,n} \odot \left( \hat{\mathbf{X}}\_{:,n} - \left( \mathbf{PT} \right)\_{:,n} \right) \right) \right\}^{\top} \end{split} \tag{5}$$

where ⊙ is the elementwise multiplication, **H** denotes an *M*-by-*N* index matrix of the nonmissing-value entries in **X**~ . That is, if the entry **X**~ *<sup>m</sup>*,*<sup>n</sup>* is nonmissing, then **H***m*,*<sup>n</sup>* is one. Otherwise, it shows a zero. Let **Φ** ¼ diagð Þ **H**:,*<sup>n</sup>* and **Ψ** ¼ diagð Þ **H***<sup>m</sup>*,: , where *n* = 1,2, … ,*N*, and *m* = 1,2, … ,*M*. Eq. (5) becomes convex if either **P** or **T** is fixed. Then, a solution can be achieved based on Alternating Least Squares (ALS). Taking the derivative form of (5) with respect to **T**:,*<sup>n</sup>* and zeroing the result yield

$$\mathbf{T}\_{:,n} = \left(\mathbf{P}^{\top}\boldsymbol{\Phi}\mathbf{P}\right)^{-1}\mathbf{P}^{\top}\boldsymbol{\Phi}\hat{\mathbf{X}}\_{:,n}.\tag{6}$$

Likewise, taking the derivative form of (5) with respect to **P***m*,: and arranging the result generate

$$\mathbf{P}\_{m,\cdot} = \hat{\mathbf{X}}\_{m,\cdot} \mathbf{\hat{Y}} \mathbf{T}^{\top} \left(\mathbf{T} \mathbf{Y} \mathbf{T}^{\top}\right)^{-1}. \tag{7}$$

At the beginning, NIPALS-PCA utilizes the SVD result in the first iteration (see (2)) and extracts the top *D* components from **U** and **V** as **P** and **T** (see (3)). Subsequently, alternating computation between **<sup>P</sup>**, **<sup>T</sup>**, and **<sup>X</sup>**~⊕ð Þ **PT** until convergence generates solutions.

#### *2.4.3 NMFImpute*

Alternating Least Squares has been widely applied to many models, especially matrix completion/approximation. Nonnegative matrix factorization (NMF) is an important topic in matrix completion/approximation, and it became a highlight when it was applied to recommendation systems in a Netflix contest [36–38]. At present, NMF has developed multiple variants, including (i) regularization based on L1-norms, L2-norms [39], L2,1-norms, nuclear norms, mixed norms, and graphs, (ii) different loss functions like Huber loss, the correntropy induced metric [40, 41], Cauchy functions [42], and Truncated Cauchy functions [43, 44], and (iii) many more, such as projected gradient NMF [45], projective NMF [46, 47], and orthogonal NMF [48]. The following uses the plain version of NMF as an example to elaborate the detail. NMF minimizes the Least Squares error of

$$E\_{\text{NMF}} = \left\| \hat{\mathbf{X}} - \mathbf{PT} \right\|\_{2}^{2} = \text{Tr} \left\{ \left( \hat{\mathbf{X}} - \mathbf{PT} \right)^{\top} \left( \hat{\mathbf{X}} - \mathbf{PT} \right) \right\},\tag{8}$$

where Trð Þ� is the trace operator. Eq. (8) is nonconvex and hard to solve. When one variable is fixed, (8) becomes convex. One can use ALS or Coordinate Descend to find the solution. Differentiating (8) with respect to **P***m*,*<sup>d</sup>* and **T***d*,*<sup>n</sup>* (where *d* = 1,2, … ,*D*), respectively, yields

$$\begin{cases} \frac{\partial E\_{\text{NNMF}}}{\partial \mathbf{P}\_{m,d}} = 2 \left( \mathbf{PTT}^{\top} - \hat{\mathbf{X}} \mathbf{T}^{\top} \right)\_{m,d} \\\\ \frac{\partial E\_{\text{NNMF}}}{\partial \mathbf{T}\_{d,n}} = 2 \left( \mathbf{P}^{\top} \mathbf{PT} - \mathbf{P}^{\top} \hat{\mathbf{X}} \right)\_{d,n} \end{cases} \tag{9}$$

The multiplicative update rules for (9) are, respectively,

$$\mathbf{P}\_{m,d} = \max\left(\boldsymbol{\varepsilon}, \mathbf{P}\_{m,d} \odot \left(\frac{\hat{\mathbf{X}} \mathbf{T}^{\top}}{\mathbf{P} \mathbf{T} \mathbf{T}^{\top}}\right)\_{m,d}\right) \tag{10}$$

and

**<sup>U</sup>ΣV**<sup>⊤</sup> � �½�¼*<sup>ι</sup>* SVD **<sup>X</sup>**^ ½ �*<sup>ι</sup>* � � (2)

**<sup>X</sup>**^ ½ �¼ *<sup>ι</sup>* <sup>þ</sup> <sup>1</sup> **<sup>X</sup>**~⊕ð Þ **<sup>P</sup>**½ �*<sup>ι</sup>* **<sup>T</sup>**½ �*<sup>ι</sup> :* (4)

**<sup>H</sup>**:,*<sup>n</sup>* <sup>⊙</sup> **<sup>X</sup>**^ :,*<sup>n</sup>* � ð Þ **PT** :,*<sup>n</sup>*

**P**<sup>⊤</sup>**ΦX**^ :,*<sup>n</sup>:* (6)

*:* (7)

, (3)

, (5)

where **U** is an *M*-by-*M* unitary matrix, **Σ** represents an *M*-by-*N* diagonal matrix, of which the diagonal terms are nonnegative real numbers sorted in descending order, and **V** denotes an *N*-by-*N* matrix. By selecting the top *D* largest diagonal values from **Σ**, the following two corresponding matrices **P** and **T** are formed

> **P**½�¼*ι* **U**:,1:*D*½ �*ι* **<sup>T</sup>**½�¼*<sup>ι</sup>* **<sup>Σ</sup>V**<sup>⊤</sup> � �

where ":,1:*D*" means selecting columns ranging from the first one to the *D*-th one, and "1:*D*,:" extracts rows. The process of (3) is the same as Truncated SVD.

Herein, the operator ⊕ means to replace the missing values of **X**~ with the corresponding generated values by **P**[*ι*]**T**[*ι*]. Eqs. (2)–(4) iterate until convergence.

*mn*

of the nonmissing-value entries in **X**~ . That is, if the entry **X**~ *<sup>m</sup>*,*<sup>n</sup>* is nonmissing, then **H***m*,*<sup>n</sup>* is one. Otherwise, it shows a zero. Let **Φ** ¼ diagð Þ **H**:,*<sup>n</sup>* and **Ψ** ¼ diagð Þ **H***<sup>m</sup>*,: , where *n* = 1,2, … ,*N*, and *m* = 1,2, … ,*M*. Eq. (5) becomes convex if either **P** or **T** is fixed. Then, a solution can be achieved based on Alternating Least Squares (ALS). Taking the derivative form of (5) with respect to **T**:,*<sup>n</sup>* and zeroing the result yield

**<sup>T</sup>**:,*<sup>n</sup>* <sup>¼</sup> **<sup>P</sup>**<sup>⊤</sup>**Φ<sup>P</sup>** � ��<sup>1</sup>

As for NIPALS-PCA (abbreviated as NP below), it minimizes the reconstruction

� � � � � �

where ⊙ is the elementwise multiplication, **H** denotes an *M*-by-*N* index matrix

Likewise, taking the derivative form of (5) with respect to **P***m*,: and arranging the

**<sup>P</sup>***<sup>m</sup>*,: <sup>¼</sup> **<sup>X</sup>**^ *<sup>m</sup>*,:**ΨT**<sup>⊤</sup> **<sup>T</sup>ΨT**<sup>⊤</sup> � ��<sup>1</sup>

(2)) and extracts the top *D* components from **U** and **V** as **P** and **T** (see (3)). Subsequently, alternating computation between **<sup>P</sup>**, **<sup>T</sup>**, and **<sup>X</sup>**~⊕ð Þ **PT** until conver-

At the beginning, NIPALS-PCA utilizes the SVD result in the first iteration (see

Alternating Least Squares has been widely applied to many models, especially matrix completion/approximation. Nonnegative matrix factorization (NMF) is an

(

Subsequently, the reconstructed data matrix becomes

**<sup>H</sup>** <sup>⊙</sup> **<sup>X</sup>**^ � **PT** � � � � <sup>2</sup>

**<sup>H</sup>**:,*<sup>n</sup>* <sup>⊙</sup> **<sup>X</sup>**^ :,*<sup>n</sup>* � ð Þ **PT** :,*<sup>n</sup>* � � � � <sup>⊤</sup>

*2.4.2 NIPALS-PCA*

*<sup>E</sup>*NP <sup>¼</sup> <sup>X</sup> *M*

result generate

gence generates solutions.

*2.4.3 NMFImpute*

**66**

*m*¼1

*Applications of Pattern Recognition*

*n*¼1

<sup>¼</sup> <sup>X</sup> *N*

X *N*

*n*¼1

error of

1:*D*,: ½ �*ι*

$$\mathbf{T}\_{d,n} = \max\left(\boldsymbol{\varepsilon}, \mathbf{T}\_{d,n} \odot \left(\frac{\mathbf{P}^{\mathsf{T}}\hat{\mathbf{X}}}{\mathbf{P}^{\mathsf{T}}\mathbf{P}\mathbf{T}}\right)\_{d,n}\right),\tag{11}$$

where *ε* is an extremely small positive number, and division is elementwise. Eqs. (10), (11), and (4) iterate until convergence.
