**5. Variable selection in nonlinear PCA: Modified PCA approach**

In the analysis of data with large numbers of variables, a common objective is to reduce the dimensionality of the data set. PCA is a popular dimension-reducing tool that replaces the variables in the data set by a smaller number of derived variables. However, for example, in PCA of a data set with a large number of variables, the result may not be easy to interpret. One way to give a simple interpretation of principal components is to select a subset of variables that best approximates all the variables. Various variable selection criteria in PCA has been proposed by Jolliffe [Jolliffe, 1972], McCabe [McCabe, 1984], Robert and Escoufier [Robert and Escoufier, 1976], Krzanowski [Krzanowski, 1987]. Al-Kandari et al. [Al-Kandari et al., 2001; Al-Kandari et al., 2005] gave guidelines as to the types of data for which each variable selection criteria is useful. Cadima et al. [Cadima et al., 2004] reported computational experiments carried out with several heuristic algorithms for the optimization problems resulting from the variable selection criteria in PCA found in the above literature.

Tanaka and Mori [Tanaka and Mori, 1997] proposed modified PCA (M.PCA) for deriving principal components which are computed by using only a selected subset of variables but which represent all the variables including those not selected. Since M.PCA includes variable selection procedures in the analysis, its criteria can be used directly to find a reasonable subset of variables. Mori et al. [Mori et al., 1997] extended M.PCA to qualitative data and provided variable selection procedures, in which the ASL algorithm is utilized.

#### **5.1 Formulation of modified PCA**

M.PCA derives principal components which are computed as linear combinations of a subset of variables but which can reproduce all the variables very well. Let **X** be decomposed into an *n* × *q* submatrix **X***V*<sup>1</sup> and an *n* × (*p* − *q*) remaining submatrix **X***V*<sup>2</sup> . Then M.PCA finds *r* linear combinations **Z** = **X***V*1**A**. The matrix **A** consists of the eigenvectors associated with the largest *r* eigenvalues *λ*<sup>1</sup> ≥ *λ*<sup>2</sup> ≥···≥ *λ<sup>r</sup>* and is obtained by solving the eigenvalue problem:

$$[(\mathbf{S}\_{11}^2 + \mathbf{S}\_{12}\mathbf{S}\_{21}) - \mathbf{D}\mathbf{S}\_{11}]\mathbf{A} = 0,\tag{8}$$

where **S** = **S**<sup>11</sup> **S**<sup>12</sup> **<sup>S</sup>**<sup>21</sup> **<sup>S</sup>**<sup>22</sup> is the covariance matrix of **X** = (**X***V*<sup>1</sup> , **X***V*<sup>2</sup> ) and **D** is a *q* × *q* diagonal matrix of eigenvalues. A best subset of *q* variables has the largest value of the proportion *P* = ∑*r <sup>j</sup>*=<sup>1</sup> *λj*/tr(**S**) or the *RV*-coefficient *RV* = ∑*r <sup>j</sup>*=<sup>1</sup> *<sup>λ</sup>*<sup>2</sup> *<sup>j</sup>* /tr(**S**2) 1/2 . Here we use *P* as variable selection criteria.

#### **5.2 Variable selection procedures**

In order to find a subset of *q* variables, we employ Backward elimination and Forward selection of Mori et al. [Mori et al., 1998; Mori et al., 2006] as cost-saving stepwise selection procedures in which only one variable is removed or added sequentially.
